self-hosted/ai
§01·recipe · 3d

TRELLIS.2-4B on RTX 5090: First Consumer Card That Hits the 24 GB Floor for Image-to-3D

3dadvanced24GB+ VRAMMay 25, 2026
models
tools
prerequisites
  • NVIDIA RTX 5090 (32 GB VRAM) — the first consumer card that cleanly meets Microsoft's official 24 GB minimum
  • Linux (Ubuntu 22.04 / 24.04 or WSL2) — Microsoft has only tested the code on Linux
  • Python 3.10+, CUDA Toolkit 12.4+ (12.8 or 13.0 recommended for native sm_120 builds), conda
  • 32 GB+ system RAM (64 GB recommended — the pipeline retains significant CPU memory after inference)
  • ~17 GB free disk space for the TRELLIS.2-4B checkpoints + ~1.2 GB for the DINOv3 image encoder

What You'll Build

Generate high-fidelity textured 3D assets from single images locally with TRELLIS.2-4B — Microsoft's 4-billion-parameter image-to-3D flow-matching transformer — on an RTX 5090. The model outputs PBR-ready meshes with base color, roughness, metallic, and opacity channels at voxel-grid resolutions from 512³ up to 1536³. The 5090 is the first consumer NVIDIA GPU that cleanly meets Microsoft's documented 24 GB minimum: the smaller 16 GB Blackwell 5060 Ti and 5070 cards cannot load the pipeline at all, and the 24 GB RTX 4090 / RTX 3090 sit right on the floor with no headroom for the reconstruction stage.

Hardware data: RTX 5090 (32 GB VRAM) · 4B image-to-3D pipeline · See benchmark data

Variant pin. This recipe is for TRELLIS.2-4B at canonical Hugging Face slug microsoft/TRELLIS.2-4B (note the dot in TRELLIS.2; the dash variants 401-shadow). It is NOT for the older microsoft/TRELLIS-image-large (the 2024 v1 generation, different architecture, also image-to-3D) and NOT for the TRELLIS-text family.

Known issue — Blackwell mesh extraction. TRELLIS.2's to_glb() pipeline calls cumesh.remeshing.remesh_narrow_band_dc, which silently produces fragmented output (thousands of disconnected triangle patches instead of a connected surface) on every Blackwell GPU (RTX 5070 / 5070 Ti / 5080 / 5090). See Troubleshooting — the ThatButters/trellis2-blackwell-fix repo provides a marching-cubes workaround that produces watertight meshes suitable for 3D printing.

Requirements

ComponentMinimumTested
GPU24 GB VRAM per the official TRELLIS.2 READMERTX 5090 (32 GB)
RAM32 GB64 GB (Issue #168 — VRAM-leak workaround recommends subprocess on ≤ 32 GB systems)
Storage~17 GB (TRELLIS.2-4B ckpts/ = 15.12 GiB + DINOv3 ViT-L = 1.21 GB)
SoftwareLinux + conda, Python 3.10+, CUDA 12.4+ (12.8 or 13.0 for native sm_120 builds), PyTorch 2.7+ (cu128)

The 16 GB Blackwell cards (5060 Ti, 5070) cannot load the pipeline; the 24 GB cards (4090, 3090) sit at the canonical minimum without headroom for the reconstruction stage. The 5090's 32 GB envelope leaves ~5-6 GB of headroom over the observed 26-27 GB runtime peak on RTX 5090 before the reconstruction OOM kicks in.

Installation

There are two viable paths on Blackwell. The native Python install (option A) mirrors Microsoft's documented setup but requires building flash-attn and CUDA extensions from source for sm_120. The ComfyUI port (option B, visualbruno/ComfyUI-Trellis2) ships prebuilt Windows + Linux wheels and skips the multi-hour compile cycle. Option B is recommended unless you need the bare pipeline API.

1A. Native install — clone the repo

git clone -b main https://github.com/microsoft/TRELLIS.2.git --recursive
cd TRELLIS.2

2A. Native install — run setup.sh

The official setup uses conda and compiles the CUDA extensions in place. On Blackwell you'll need PyTorch 2.7+ on a cu128 (or newer cu130) wheel and a CUDA Toolkit that supports sm_120. Per the official README: "By default the trellis2 environment will use pytorch 2.6.0 with CUDA 12.4. If you want to use a different version of CUDA, you can remove the --new-env flag and manually install the required dependencies."

Recommended sequence for RTX 5090 (Linux), drawing on the community-confirmed environment in Issue #143 and the build steps in Issue #19:

# Create env manually so we control the PyTorch version
conda create -n trellis2 python=3.10 -y
conda activate trellis2

# PyTorch 2.7.x+ with cu128 (sm_120 native)
pip install torch==2.9.1 torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128

# Set CUDA arch to Blackwell BEFORE running setup
export TORCH_CUDA_ARCH_LIST="12.0"
export CUDA_HOME=/usr/local/cuda-12.8  # adjust to your install path

# Run setup without --new-env (we already created it manually)
. ./setup.sh --basic --flash-attn --nvdiffrast --nvdiffrec --cumesh --o-voxel --flexgemm

flash-attn will compile from source — this is the long step (expect ~30-60 minutes depending on CPU). Per ben-mkiv's confirmed setup in Issue #19: "I got this running on 50 series card with this setup [...] Python: 3.13 PyTorch: 2.9.1+cu128 CUDA version: 12.8 local CUDA toolkit to build against: 12.8.1-r1 [...] however, flash-attention needed to be build from source, which took quite a while."

1B. ComfyUI port — install the custom node + prebuilt wheels

If you already run ComfyUI, this path skips the source compile. Install ComfyUI first (latest version) per its own README, then:

cd ComfyUI/custom_nodes
git clone https://github.com/visualbruno/ComfyUI-Trellis2.git
cd ..

# Install the prebuilt wheels — choose the set matching your Torch/Python combo
# Linux / Torch 2.7.0 / Python 3.11 (example; check wheels/ folder for your stack):
python -m pip install custom_nodes/ComfyUI-Trellis2/wheels/Linux/Torch270/cumesh-1.0-cp311-cp311-linux_x86_64.whl
python -m pip install custom_nodes/ComfyUI-Trellis2/wheels/Linux/Torch270/nvdiffrast-0.4.0-cp311-cp311-linux_x86_64.whl
python -m pip install custom_nodes/ComfyUI-Trellis2/wheels/Linux/Torch270/nvdiffrec_render-0.0.0-cp311-cp311-linux_x86_64.whl
python -m pip install custom_nodes/ComfyUI-Trellis2/wheels/Linux/Torch270/flex_gemm-0.0.1-cp311-cp311-linux_x86_64.whl
python -m pip install custom_nodes/ComfyUI-Trellis2/wheels/Linux/Torch270/o_voxel-0.0.1-cp311-cp311-linux_x86_64.whl
python -m pip install -r custom_nodes/ComfyUI-Trellis2/requirements.txt

The port maintainer states it was "Tested on Windows 11 with Python 3.11 and Torch = 2.7.0 + cu128" per the ComfyUI-Trellis2 README, and the changelog notes Linux wheels and FP8 model support landed in February 2026. Check the repo's wheels/ folder for the exact filename for your Torch / Python / OS combo.

3. Download the model weights

The TRELLIS.2-4B checkpoints (~15.12 GiB total across encoder/decoder/flow stages) live at microsoft/TRELLIS.2-4B:

huggingface-cli download microsoft/TRELLIS.2-4B --local-dir ./models/TRELLIS.2-4B

You also need the DINOv3 image encoder (~1.21 GB) — per the ComfyUI-Trellis2 REQUIREMENTS section: "You need to have access to facebook dinov3 models in order to use Trellis.2 [...] Clone the repository in ComfyUI models folder under facebook/dinov3-vitl16-pretrain-lvd1689m."

huggingface-cli download facebook/dinov3-vitl16-pretrain-lvd1689m \
  --local-dir ./models/facebook/dinov3-vitl16-pretrain-lvd1689m

The DINOv3 repo is gated — you'll need to accept the license on Hugging Face first and pass HUGGING_FACE_HUB_TOKEN or huggingface-cli login before the download will succeed.

Running

Native install — minimal example

Save the following as run_trellis.py in the cloned TRELLIS.2 directory and place a single input image at assets/example_image/T.png:

import os
os.environ['OPENCV_IO_ENABLE_OPENEXR'] = '1'
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True"  # Can save GPU memory
import cv2
import imageio
from PIL import Image
import torch
from trellis2.pipelines import Trellis2ImageTo3DPipeline
from trellis2.utils import render_utils
from trellis2.renderers import EnvMap
import o_voxel

# 1. Setup Environment Map
envmap = EnvMap(torch.tensor(
    cv2.cvtColor(cv2.imread('assets/hdri/forest.exr', cv2.IMREAD_UNCHANGED), cv2.COLOR_BGR2RGB),
    dtype=torch.float32, device='cuda'
))

# 2. Load Pipeline
pipeline = Trellis2ImageTo3DPipeline.from_pretrained("microsoft/TRELLIS.2-4B")
pipeline.cuda()

# 3. Load Image & Run
image = Image.open("assets/example_image/T.png")
mesh = pipeline.run(image)[0]
mesh.simplify(16777216)  # nvdiffrast limit

# 4. Render Video
video = render_utils.make_pbr_vis_frames(render_utils.render_video(mesh, envmap=envmap))
imageio.mimsave("sample.mp4", video, fps=15)

# 5. Export to GLB
glb = o_voxel.postprocess.to_glb(
    vertices=mesh.vertices,
    faces=mesh.faces,
    attr_volume=mesh.attrs,
    coords=mesh.coords,
    attr_layout=mesh.layout,
    voxel_size=mesh.voxel_size,
    aabb=[[-0.5, -0.5, -0.5], [0.5, 0.5, 0.5]],
    decimation_target=1000000,
    texture_size=4096,
    remesh=True,
    remesh_band=1,
    remesh_project=0,
    verbose=True,
)
glb.export("sample.glb", extension_webp=True)

This snippet is reproduced verbatim from the minimal example in the official GitHub README. The PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True environment variable is recommended by Microsoft for reducing peak VRAM.

Then run:

python run_trellis.py

Output:

  • sample.mp4 — a video visualizing the generated 3D asset with PBR materials
  • sample.glb — the extracted 3D asset in GLB format

Web demo

The repo also ships a Gradio-style web UI:

python app.py

For PBR texture generation on an existing mesh, see example_texturing.py / app_texturing.py per the official README's PBR Texture Generation section.

ComfyUI port

After installing the wheels, launch ComfyUI normally and load one of the example workflows from ComfyUI/custom_nodes/ComfyUI-Trellis2/example_workflows/. The port adds nodes including "Sparse Generator with ReconViaGen," "Mesh with Voxel Cascade Generator," "Mesh Texturing Multi-View," and FP8 model variants — see the changelog in the ComfyUI-Trellis2 README for the full feature matrix.

Results

  • Speed: Microsoft publishes only H100 timings (~3 s at 512³, ~17 s at 1024³, ~60 s at 1536³). The H100 is a datacenter Hopper card and not a close-arch sibling of the Blackwell RTX 5090 — extrapolation would be misleading, so no consumer-card speed number is quoted here. Once a community benchmark lands, /check/trellis-2/rtx-5090 will carry the measured figure; please submit yours via /contribute.
  • VRAM usage: ~24 GB minimum per the canonical README prerequisites; the community-reported runtime peak on RTX 5090 32 GB lands at ~26-27 GB allocated going into the reconstruction stage — the 5090's 32 GB envelope leaves a workable ~5-6 GB margin where the 24 GB cards have none.
  • Quality notes: TRELLIS.2-4B is a pre-trained foundation model without RLHF or aesthetic fine-tuning per the model card. Outputs reflect the training-data distribution; expect to iterate on input images. Raw meshes may contain small holes; Microsoft ships mesh post-processing scripts for hole-filling when watertight geometry is needed (e.g. 3D printing).

For the full benchmark data, see /check/trellis-2/rtx-5090.

Troubleshooting

Blackwell mesh extraction: fragmented GLB output

The single most disruptive Blackwell-specific bug. Symptom: your .glb looks like a point cloud — thousands of disconnected vertices with no connected surface, suspiciously small file size, slicer software (PrusaSlicer / Bambu Studio) shows an empty build plate. Root cause per the trellis2-blackwell-fix README: the cumesh.remeshing.remesh_narrow_band_dc CUDA kernel produces incorrect results on the Blackwell instruction set (sm_120) — it completes without errors but the fragments that should be connected remain disconnected.

The workaround at ThatButters/trellis2-blackwell-fix bypasses CuMesh remeshing entirely and reconstructs a watertight mesh from the raw voxel coordinates using marching cubes:

git clone https://github.com/ThatButters/trellis2-blackwell-fix
cd trellis2-blackwell-fix
pip install numpy scipy scikit-image trimesh

Then in your inference script:

import blackwell_fix
blackwell_fix.patch_all()  # MUST come before importing trellis2

from trellis2.pipelines import Trellis2ImageTo3DPipeline
# ... rest of your pipeline as usual ...
mesh = pipeline.run(image)[0]
result = blackwell_fix.voxel_to_mesh(mesh, target_height_mm=100.0)
result.export("output.stl")

The output is watertight and suitable for 3D printing, with detail limited to the voxel grid resolution (~0.1 mm at typical print scales) per the Working Workaround section. The fix has been tested on RTX 5070 Ti (sm_120) — the same Blackwell instruction set as the 5090.

Build failures: nvcc fatal: Unsupported gpu architecture 'compute_120'

The default setup.sh flow pulls CUDA 12.4 which predates sm_120 support. Per Issue #19 the failure mode is exactly: "CUDA 12.4 doesn't support sm_120 (only up to sm_90) [...] CUDA extensions (cumesh, o-voxel, flash-attn, nvdiffrast) fail to compile with nvcc fatal: Unsupported gpu architecture 'compute_120'."

Fix: install a CUDA Toolkit ≥ 12.8 (12.8 or 13.0 both work), set CUDA_HOME to that install, set TORCH_CUDA_ARCH_LIST="12.0" before running setup, and use a cu128 or cu130 PyTorch wheel — see the community-verified env in Issue #19's comment from ehuqija217-max for the full sequence. If you're on Windows, the visualbruno/ComfyUI-Trellis2 prebuilt wheels skip the compile entirely.

RuntimeError: cuDNN error: CUDNN_STATUS_NOT_INITIALIZED

Reported in Issue #154 on Linux conda environments with default setup.sh output. The root cause is typically a torch / torchvision / cuDNN version mismatch between what conda installs and what your system CUDA driver supports. Verify with python -c "import torch; print(torch.version.cuda, torch.backends.cudnn.version())" and confirm the cuDNN version matches your driver's supported CUDA — on Blackwell drivers (590-series and later), use the cu128 or cu130 wheels and avoid mixing system-pip and conda channels for torch.

VRAM not released between pipeline stages — OOM at reconstruction

A community user reports in Issue #168 on the ComfyUI port (official response pending) that on RTX 5090 + PyTorch 2.11 + CUDA 13.0, ~26-27 GB stays allocated after the shape SLat decoder completes, causing torch.OutOfMemoryError at reconstruct_mesh_dc_quad with "Currently allocated: 26.57 GiB / Device limit: 31.34 GiB / Free (according to CUDA): 55.44 MiB." The reporter identifies cudaMallocAsync as the underlying cause — comfy-aimdo's async memory pool does not respond to torch.cuda.empty_cache(). Workaround on this stack: use the standard ComfyUI core (avoiding comfy-aimdo's allocator override), or set PYTORCH_CUDA_ALLOC_CONF=backend:native at load time (not at runtime — a load-time assertion will fire if you try to switch mid-run).

For the related Blackwell crash on empty_cache(), see pending PR #116 by cuzelac which adds torch.cuda.synchronize() before each empty_cache() call: "torch.cuda.empty_cache() can free GPU memory that still has pending async work, causing CUDA error: illegal memory access on Blackwell GPUs (RTX 5090, sm_120)."

Inference VRAM leak across multiple runs

Issue #136 (open, community-reported, no maintainer response) reports growing GPU memory usage across iterations on the canonical Microsoft repo. Mitigation: between runs, explicitly del pipeline; torch.cuda.empty_cache() (with the torch.cuda.synchronize() guard from PR #116 above on Blackwell), or restart the Python process between large batches. The ThatButters subprocess pattern — pickling voxel coords to disk and running the mesh reconstruction in a fresh process — is the most robust mitigation for long batch runs.