self-hosted/ai
§01·recipe · 3d

TRELLIS image-large on RTX 5090: Image-to-3D Mesh Generation with the Blackwell Build Path

3dadvanced16GB+ VRAMMay 25, 2026
models
tools
  • Pytorch
  • Conda
prerequisites
  • NVIDIA RTX 5090 (32 GB VRAM, Blackwell sm_120)
  • Linux (Ubuntu tested; Windows requires extra steps — see Issue #3)
  • CUDA Toolkit 12.8 (required for sm_120)
  • Conda + Python 3.10+

What You'll Build

A working install of Microsoft's TRELLIS image-large (1.2B-parameter image-to-3D mesh generator, MIT-licensed) on an RTX 5090 — capable of converting a single input image into a textured GLB mesh, a Gaussian-splat representation, or a radiance field. The default setup.sh install path does not work on Blackwell GPUs; this recipe walks through the community-maintained Blackwell build path that compiles xformers, kaolin, and diffoctreerast from source against PyTorch + CUDA 12.8.

Hardware data: RTX 5090 (32 GB VRAM) · canonical 16 GB minimum per TRELLIS README · See benchmark data

⚠️ Known issue: Stock TRELLIS fails on RTX 5090 with NVIDIA GeForce RTX 5090 with CUDA capability sm_120 is not compatible with the current PyTorch installation — multiple submodules (kaolin, xformers, diffoctreerast, spconv) need to be built against PyTorch's cu128 wheel rather than the default cu118. See Issue #243.

ℹ️ Image-to-3D mesh, not video or interactive world model. TRELLIS image-large turns a single image into a 3D asset (Gaussians, radiance field, or textured mesh — exportable as .glb and .ply). It is distinct from the waypoint-1-5 real-time world model (also in our 3d vertical) and from the newer TRELLIS.2-4B (Microsoft's 4B-parameter successor with a 24 GB minimum).

Requirements

ComponentMinimumTested
GPU16 GB VRAM (per README, verified on A100 / A6000)RTX 5090 (32 GB)
RAM16 GB system RAM
Storage~3.3 GB for model weights (HF tree); ~20 GB total with conda env and CUDA extensions
SoftwareCUDA 12.8, Conda, Python 3.10+, PyTorch ≥ 2.7.1 + cu128

Installation

The default setup.sh --new-env --basic --xformers --flash-attn --diffoctreerast --spconv --mipgaussian --kaolin --nvdiffrast from the TRELLIS README is hard-coded to PyTorch 2.4.0 + CUDA 11.8 and will fail on Blackwell. The steps below follow the community-tested Blackwell path documented in maepopi's fork README and corroborated in TRELLIS Issue #243. Note that maepopi's fork (PR #257) was closed without merge — the upstream README still does not contain Blackwell instructions, so this is the de-facto community recipe.

1. Verify CUDA 12.8 toolkit

nvcc --version
# Expected: release 12.8 or newer

If nvcc reports an older release, install CUDA Toolkit 12.8 before continuing.

2. Clone the repo

git clone --recurse-submodules https://github.com/microsoft/TRELLIS.git
cd TRELLIS

3. Run partial setup (skip xformers / diffoctreerast / kaolin — we'll build those from source)

. ./setup.sh --new-env --basic --flash-attn --spconv --nvdiffrast

This creates a fresh trellis conda env, installs basic Python dependencies, builds flash-attn, installs spconv-cu120, and builds nvdiffrast. Activate the env if not already active: conda activate trellis.

4. Replace torch with cu128 wheel (sm_120 support)

PyTorch 2.7.0+ shipped pre-built CUDA 12.8 wheels with native sm_120 support; install the latest nightly to be safe:

pip install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu128

This replaces the torch installed in step 3. CUDA extensions built in step 3 may need to be rebuilt against the new torch; if you hit an undefined symbol error later, rebuild the offending extension.

5. Build xformers from source

The PyPI xformers wheels lag behind PyTorch nightly; build against your installed torch:

git clone --recurse-submodules https://github.com/facebookresearch/xformers.git
cd xformers
pip install -e .
cd ..

6. Build diffoctreerast from source

mkdir -p /tmp/extensions
git clone --recurse-submodules https://github.com/JeffreyXiang/diffoctreerast.git /tmp/extensions/diffoctreerast
pip install /tmp/extensions/diffoctreerast

7. Install kaolin v0.18.0+ (sm_120 support)

NVIDIA Kaolin maintainer Caenorst confirmed kaolin v0.18.0 supports the latest PyTorch / CUDA versions (Issue #243 comment):

git clone https://github.com/NVIDIAGameWorks/kaolin
cd kaolin
export IGNORE_TORCH_VER=1
pip install "Cython >= 0.29.37"
pip install -e .
cd ..

If pip install -e . fails on a cuda_post_cflags / Unknown CUDA arch ("12.0") or GPU not supported error, ensure you're on kaolin master (v0.18.0+); older releases hard-coded the supported arch list.

8. Install Gradio demo dependencies (optional but recommended)

. ./setup.sh --demo

9. Re-pin torchvision (the demo setup may downgrade it)

pip uninstall -y torchvision
pip install --pre torchvision --index-url https://download.pytorch.org/whl/nightly/cu128

Running

Verify the install with the upstream example.py — it downloads the weights from HuggingFace on first run (~3.3 GB to ~/.cache/huggingface/hub/):

python example.py

You should get five files in the working directory:

  • sample_gs.mp4 — turntable video of the 3D Gaussian representation
  • sample_rf.mp4 — turntable of the radiance field
  • sample_mesh.mp4 — turntable of the normal-shaded mesh
  • sample.glb — textured GLB exportable to Blender / Unity / web viewers
  • sample.ply — raw 3D Gaussian point cloud

For the interactive Gradio demo:

python app.py

Then open the URL it prints (default http://127.0.0.1:7860). The demo lets you drop in a single image, runs the same TrellisImageTo3DPipeline.from_pretrained("microsoft/TRELLIS-image-large") pipeline, and previews the Gaussian / radiance / mesh outputs side-by-side.

Results

  • Speed: No RTX 5090–named TRELLIS measurement has been published. Once a community benchmark lands via /contribute, this section will pick it up. For now, see /check/trellis-image-large/rtx-5090 for the live data.
  • VRAM usage: The canonical TRELLIS README states "An NVIDIA GPU with at least 16GB of memory is necessary. The code has been verified on NVIDIA A100 and A6000 GPUs." The 5090's 32 GB envelope comfortably exceeds that floor with ~16 GB of headroom — see "Spending the headroom" below for productive uses of the spare capacity.
  • Quality notes: TRELLIS image-large is a 1.2B-parameter SLAT (Structured LATent) flow model — see the arXiv paper for the architecture. It outputs three representations from one pass; the GLB export from postprocessing_utils.to_glb(...) is the most directly usable downstream artifact (drop into Blender, Three.js, or any GLTF-aware viewer).

For the full benchmark data, see /check/trellis-image-large/rtx-5090.

Spending the headroom

A 5090 (32 GB) is roughly 2× over-provisioned for the canonical TRELLIS image-large workload (16 GB minimum per README). Concrete uses for the spare ~16 GB:

  • Larger texture maps. The postprocessing_utils.to_glb(...) call in example.py defaults to texture_size=1024; bump to 2048 or 4096 for higher-fidelity surface detail on the exported .glb. Texture baking is where VRAM pressure spikes in the mesh stage.
  • Skip the offload forks. Community forks like 0lento/TRELLIS (8 GB target) and the FP16 TRELLIS-BOX (~50% VRAM reduction) stream models between CPU and GPU to fit smaller cards — on a 5090 you can keep everything resident for faster per-image throughput.
  • Multi-image conditioning. Use the multi-image conditioning support added 2024-12-18 to condition on 2-4 input views simultaneously; each extra view costs more VRAM but produces noticeably more consistent geometry.
  • Colocate with an image generator. Pair TRELLIS with a smaller image-gen model (e.g. flux-2-klein-4b at ~9 GB FP8) on the same card to build a text→image→3D pipeline without a second GPU.

Troubleshooting

NVIDIA GeForce RTX 5090 with CUDA capability sm_120 is not compatible with the current PyTorch installation

The pre-built PyTorch 2.4.0 wheel that setup.sh --basic installs is compiled for CUDA 11.8 and predates Blackwell. The fix is step 4 above — install PyTorch from the nightly/cu128 index. The canonical tracking thread is Issue #243, which collects working install paths from multiple contributors (maepopi, SanBingYouYong, zhizdev).

Unknown CUDA arch ("12.0") or GPU not supported

Reported by Polytoo on Issue #243 — fires when an installed extension's bundled torch.utils.cpp_extension doesn't recognize compute_120. Rebuild the offending extension after step 4: usually kaolin (step 7) or xformers (step 5). Make sure you're on the upstream master of each (kaolin v0.18.0+, xformers latest) — older tagged releases pre-date Blackwell.

flash_attn import fails after step 4 (undefined symbol: _ZN3c105ErrorC...)

Pinned PyTorch nightly often breaks flash_attn ABI compatibility. Either rebuild flash_attn from source against the installed torch, or set the xformers backend before importing TRELLIS:

import os
os.environ['ATTN_BACKEND'] = 'xformers'  # before any TRELLIS import

TRELLIS supports both flash-attn and xformers attention backends — see the Minimal Example at the top of the upstream README. FlashAttention 2 itself does not currently ship sm_120 kernels — coverage tracked at Dao-AILab/flash-attention#2168. The xformers backend works on Blackwell.

GLIBCXX_3.4.30 not found at import time

conda install -c conda-forge libstdcxx-ng

The system libstdc++ shipped with older Ubuntu LTS lags the version Caffe2 / PyTorch nightly needs. The conda-forge package is the safe override.

Tremendous VRAM allocation request (Tried to allocate 196.89 GiB)

Issue #79diffoctreerast can mis-size its allocation when given certain input image shapes (transparency / unusual aspect ratios). Pre-process input images to a square aspect ratio (the upstream app.py does this automatically; if calling pipeline.run directly, mirror its preprocessing).

Windows install

Windows is documented as not fully tested by Microsoft — see Issue #3. Issue #259 collects a community Windows installer with pre-compiled libraries for RTX 50-series. For first-class Blackwell support on Windows, that installer is currently the path of least resistance; the steps above target Linux.