self-hosted/ai
§01·recipe · 3d

TRELLIS image-large on RTX 5060 Ti: Image-to-3D Mesh Generation at the 16 GB Floor

3dadvanced16GB+ VRAMMay 28, 2026
models
tools
  • Pytorch
  • Conda
prerequisites
  • NVIDIA RTX 5060 Ti 16 GB (Blackwell sm_120)
  • Linux (Ubuntu tested; Windows requires extra steps — see Issue #3)
  • CUDA Toolkit 12.8 (required for sm_120)
  • Conda + Python 3.10+

What You'll Build

A working install of Microsoft's TRELLIS image-large (1.2B-parameter image-to-3D mesh generator, MIT-licensed, arXiv:2412.01506) on an RTX 5060 Ti 16 GB — capable of converting a single input image into a textured GLB mesh, a Gaussian-splat representation, or a radiance field. The 5060 Ti sits exactly at the model's officially-stated 16 GB floor, so the recipe is structured around the default code path (no offloading, no quantization tricks); the Blackwell sm_120 build path here mirrors the upstream community fix collected on Issue #243.

Hardware data: RTX 5060 Ti (16 GB VRAM) · canonical 16 GB minimum per TRELLIS README and Issue #5 (Microsoft collaborator JeffreyXiang) · See benchmark data

⚠️ Known issue: Stock TRELLIS fails on RTX 5060 Ti with NVIDIA GeForce RTX 5060 Ti with CUDA capability sm_120 is not compatible with the current PyTorch installation. The default setup.sh installs PyTorch 2.4.0 + CUDA 11.8 wheels that predate Blackwell; multiple CUDA submodules (kaolin, xformers, diffoctreerast) must be built against PyTorch's cu128 wheel. The same fix path applies across the Blackwell consumer lineup — see Issue #243 and Issue #343 (RTX 5060 Ti specifically).

ℹ️ Tight floor, no headroom. Where the RTX 5090 sibling ran with ~16 GB of slack above the floor, the 5060 Ti is at the floor. The default code path fits in 16 GB but texture baking at simplify=0 can OOM on detailed meshes (per Issue #31) — see Troubleshooting for mode='fast' and simplify=0.95 workarounds. If you need more headroom, the Microsoft team has TRELLIS.2-4B (different model entirely, 24 GB minimum) for higher VRAM cards.

Requirements

ComponentMinimumTested
GPU16 GB VRAM (per README, verified on A100 / A6000)RTX 5060 Ti (16 GB)
RAM16 GB system RAM
Storage~3.07 GB for model weights (HF tree API); ~20 GB total with conda env and CUDA extensions
SoftwareCUDA 12.8, Conda, Python 3.10+, PyTorch ≥ 2.7.1 + cu128

Installation

The default setup.sh --new-env --basic --xformers --flash-attn --diffoctreerast --spconv --mipgaussian --kaolin --nvdiffrast from the TRELLIS README is hard-coded to PyTorch 2.4.0 + CUDA 11.8 and will fail on Blackwell. The steps below follow the community-tested Blackwell path documented in the maepopi fork README (explicitly "RTX 5090 (or other Blackwell GPU)" — the sm_120 fix is shared across all Blackwell consumer cards including the 5060 Ti) and corroborated by Caenorst's confirmation that kaolin v0.18.0 now supports CUDA 12.8 natively.

1. Verify CUDA 12.8 toolkit

nvcc --version
# Expected: release 12.8 or newer

If nvcc reports an older release, install CUDA Toolkit 12.8 before continuing.

2. Clone the repo

git clone --recurse-submodules https://github.com/microsoft/TRELLIS.git
cd TRELLIS

3. Run partial setup (skip xformers / diffoctreerast / kaolin — we'll build those from source)

. ./setup.sh --new-env --basic --flash-attn --spconv --nvdiffrast

This creates a fresh trellis conda env, installs basic Python dependencies, builds flash-attn, installs spconv-cu120, and builds nvdiffrast. Activate the env if not already active: conda activate trellis.

4. Replace torch with cu128 wheel (sm_120 support)

PyTorch 2.7.1+ shipped pre-built CUDA 12.8 wheels with native sm_120 support:

pip install torch==2.7.1+cu128 torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128

This replaces the torch installed in step 3. CUDA extensions built in step 3 may need to be rebuilt against the new torch; if you hit an undefined symbol error later, rebuild the offending extension.

5. Build xformers from source

The PyPI xformers wheels lag behind PyTorch 2.7.1+cu128; build against your installed torch:

git clone --recurse-submodules https://github.com/facebookresearch/xformers.git
cd xformers
pip install -e .
cd ..

6. Build diffoctreerast from source

mkdir -p /tmp/extensions
git clone --recurse-submodules https://github.com/JeffreyXiang/diffoctreerast.git /tmp/extensions/diffoctreerast
pip install /tmp/extensions/diffoctreerast

7. Install kaolin v0.18.0+ (sm_120 support)

NVIDIA Kaolin maintainer Caenorst confirmed kaolin v0.18.0 supports the latest PyTorch / CUDA versions on Issue #243:

git clone https://github.com/NVIDIAGameWorks/kaolin
cd kaolin
export IGNORE_TORCH_VER=1
pip install "Cython >= 0.29.37"
pip install -e .
cd ..

If pip install -e . fails on a cuda_post_cflags / Unknown CUDA arch ("12.0") or GPU not supported error, ensure you're on kaolin master (v0.18.0+); older releases hard-coded the supported arch list.

8. Install Gradio demo dependencies (optional but recommended)

. ./setup.sh --demo

9. Re-pin torchvision (the demo setup may downgrade it)

pip uninstall -y torchvision
pip install torchvision --index-url https://download.pytorch.org/whl/cu128

Running

Verify the install with the upstream example.py — it downloads the weights from HuggingFace on first run (~3.07 GB to ~/.cache/huggingface/hub/):

python example.py

You should get five files in the working directory:

  • sample_gs.mp4 — turntable video of the 3D Gaussian representation
  • sample_rf.mp4 — turntable of the radiance field
  • sample_mesh.mp4 — turntable of the normal-shaded mesh
  • sample.glb — textured GLB exportable to Blender / Unity / web viewers
  • sample.ply — raw 3D Gaussian point cloud

For the interactive Gradio demo:

python app.py

Then open the URL it prints (default http://127.0.0.1:7860). The demo lets you drop in a single image, runs the same TrellisImageTo3DPipeline.from_pretrained("microsoft/TRELLIS-image-large") pipeline, and previews the Gaussian / radiance / mesh outputs side-by-side.

Tightening texture baking for the 16 GB floor

The default postprocessing_utils.to_glb(...) call in example.py keeps simplify=0.95 and texture_size=1024, which fits the 5060 Ti comfortably. If you call the pipeline directly with simplify=0 (no mesh decimation) on a complex input, the texture-baking stage can OOM even on 24 GB cards (per PladsElsker on Issue #31). Keep simplify0.9 on this card, and for very dense meshes set mode='fast' in to_glb (0lento's workaround on Issue #31).

Results

  • Speed: No RTX 5060 Ti–named TRELLIS measurement has been published. Once a community benchmark lands via /contribute, this section will pick it up. For now, see /check/trellis-image-large/rtx-5060-ti for the live data.
  • VRAM usage: The canonical TRELLIS README states "An NVIDIA GPU with at least 16GB of memory is necessary. The code has been verified on NVIDIA A100 and A6000 GPUs." — and Microsoft collaborator JeffreyXiang reiterated this in Issue #5: "Currently at least 16GB VRAM is required." The 5060 Ti is at the floor — the default code path fits, but with no headroom for simplify=0 texture bakes (see Troubleshooting).
  • Quality notes: TRELLIS image-large is a 1.2B-parameter SLAT (Structured LATent) flow model — see the arXiv paper for the architecture. It outputs three representations from one pass; the GLB export from postprocessing_utils.to_glb(...) is the most directly usable downstream artifact (drop into Blender, Three.js, or any GLTF-aware viewer).

For the full benchmark data, see /check/trellis-image-large/rtx-5060-ti.

Troubleshooting

NVIDIA GeForce RTX 5060 Ti with CUDA capability sm_120 is not compatible with the current PyTorch installation

The pre-built PyTorch 2.4.0 wheel that setup.sh --basic installs is compiled for CUDA 11.8 and predates Blackwell. The fix is step 4 above — install PyTorch 2.7.1+cu128 from the cu128 index. The canonical tracking thread is Issue #243, which collects working install paths from multiple contributors (maepopi, SanBingYouYong, zhizdev, Caenorst). RTX 5060 Ti is reported as a Blackwell target in Issue #343 where IgorAherne confirms his recompiled trellis-stable-projectorz build supports "5000 cards".

Unknown CUDA arch ("12.0") or GPU not supported

Reported by Polytoo on Issue #243 — fires when an installed extension's bundled torch.utils.cpp_extension doesn't recognize compute_120. Rebuild the offending extension after step 4: usually kaolin (step 7) or xformers (step 5). Make sure you're on the upstream master of each (kaolin v0.18.0+, xformers latest) — older tagged releases pre-date Blackwell.

Texture-baking OOM at the 16 GB floor

The texture-bake stage in postprocessing_utils.to_glb(...) is the single largest VRAM consumer in the pipeline. On a 16 GB card with the default simplify=0.95 and texture_size=1024 the bake fits, but on detailed meshes with simplify=0 it can OOM (per Issue #31). Three remediations, in order:

  1. Keep simplify0.9 (the default 0.95 is already safe).
  2. Set mode='fast' in the to_glb call (0lento's diff).
  3. If calling the pipeline programmatically (not via app.py), del pipeline before invoking to_glb to free the SLAT decoder's VRAM for the bake stage (same comment).

If you still OOM, you have effectively outgrown the 16 GB floor — the 24 GB tier (RTX 4090 / 5090 sibling recipe) is the next stop.

flash_attn import fails after step 4 (undefined symbol: _ZN3c105ErrorC...)

Pinned PyTorch 2.7.1+cu128 often breaks flash_attn ABI compatibility. Either rebuild flash_attn from source against the installed torch, or set the xformers backend before importing TRELLIS:

import os
os.environ['ATTN_BACKEND'] = 'xformers'  # before any TRELLIS import

TRELLIS supports both flash-attn and xformers attention backends — see the Minimal Example at the top of the upstream README. FlashAttention 2 itself does not currently ship sm_120 kernels — coverage tracked at Dao-AILab/flash-attention#2168. The xformers backend works on Blackwell.

GLIBCXX_3.4.30 not found at import time

conda install -c conda-forge libstdcxx-ng

The system libstdc++ shipped with older Ubuntu LTS lags the version Caffe2 / PyTorch nightly needs. The conda-forge package is the safe override.

Tremendous VRAM allocation request (Tried to allocate 196.89 GiB)

Issue #79diffoctreerast can mis-size its allocation when given certain input image shapes (transparency / unusual aspect ratios). Pre-process input images to a square aspect ratio (the upstream app.py does this automatically; if calling pipeline.run directly, mirror its preprocessing).

Windows install

Windows is documented as not fully tested by Microsoft — see Issue #3. For RTX 50-series on Windows, Issue #259 collects a 1-click installer from FurkanGozukara, and trellis-stable-projectorz v40 (recompiled for "5000 cards" per #343 comment) is the path of least resistance for first-class Blackwell support on Windows. The steps above target Linux.