TRELLIS image-large on RTX 5070 Ti: Image-to-3D Mesh Generation at the 16 GB Floor

What You'll Build

A working install of Microsoft's TRELLIS image-large (1.2B-parameter image-to-3D mesh generator, MIT-licensed, arXiv:2412.01506) on an RTX 5070 Ti 16 GB — capable of converting a single input image into a textured GLB mesh, a Gaussian-splat representation, or a radiance field. The 5070 Ti sits exactly at the model's officially-stated 16 GB floor, so the recipe is structured around the default code path (no offloading, no quantization tricks); the Blackwell sm_120 build path here mirrors the upstream community fix collected on Issue #243.

ℹ️ Image-to-3D, not text-to-3D. TRELLIS image-large takes a single image as input and produces 3D representations (mesh / Gaussian splat / radiance field) — it does not generate 3D from a text prompt. It lives in our 3d vertical because the catalogue groups all 3D-asset generators together; the model card is explicit that the input is an image. Bring your own reference picture (or generate one with an image model first).

Hardware data: RTX 5070 Ti (16 GB GDDR7) · canonical 16 GB minimum per TRELLIS README and Issue #5 (Microsoft collaborator JeffreyXiang) · See benchmark data

⚠️ Known issue: Stock TRELLIS fails on RTX 5070 Ti with NVIDIA GeForce RTX 5070 Ti with CUDA capability sm_120 is not compatible with the current PyTorch installation. The default setup.sh installs PyTorch 2.4.0 + CUDA 11.8 wheels that predate Blackwell; multiple CUDA submodules (kaolin, xformers, diffoctreerast) must be built against PyTorch's cu128 wheel. The same fix path applies across the Blackwell consumer lineup — see Issue #243 and Issue #343.

ℹ️ Tight floor, no headroom. The RTX 5070 Ti's 16 GB GDDR7 envelope is the same envelope as the RTX 5080 and the RTX 5060 Ti — all three sit at the model's floor. The 5070 Ti's compute and bandwidth determine how fast each pass runs, but they do not buy you VRAM headroom: the default code path fits in 16 GB but texture baking at simplify=0 can OOM on detailed meshes (per Issue #31) — see Troubleshooting for mode='fast' and simplify=0.95 workarounds. If you need more headroom, the RTX 5090 sibling recipe runs the same model with ~16 GB of slack above the floor, and the Microsoft team has TRELLIS.2-4B (a different model entirely, 24 GB minimum) for higher-VRAM cards.

Requirements

Component	Minimum	Tested
GPU	16 GB VRAM (per README, verified on A100 / A6000)	RTX 5070 Ti (16 GB GDDR7, 256-bit, ~896 GB/s, 8960 CUDA cores, GB203 sm_120, 300 W)
RAM	16 GB system RAM	—
Storage	~3.30 GB for model weights (HF tree API); ~20 GB total with conda env and CUDA extensions	—
Software	CUDA 12.8, Conda, Python 3.10+, PyTorch ≥ 2.7.1 + cu128	—

Installation

The default setup.sh --new-env --basic --xformers --flash-attn --diffoctreerast --spconv --mipgaussian --kaolin --nvdiffrast from the TRELLIS README is hard-coded to PyTorch 2.4.0 + CUDA 11.8 and will fail on Blackwell. The steps below follow the community-tested Blackwell path documented in the maepopi fork README (explicitly "RTX 5090 (or other Blackwell GPU)" — the sm_120 fix is shared across all Blackwell consumer cards, the RTX 5070 Ti included) and corroborated by a confirmation on Issue #243 from Caenorst (a contributor to NVIDIA's Kaolin repo) that kaolin v0.18.0 supports current PyTorch / CUDA versions.

1. Verify CUDA 12.8 toolkit

nvcc --version
# Expected: release 12.8 or newer

If nvcc reports an older release, install CUDA Toolkit 12.8 before continuing.

2. Clone the repo

git clone --recurse-submodules https://github.com/microsoft/TRELLIS.git
cd TRELLIS

3. Run partial setup (skip xformers / diffoctreerast / kaolin — we'll build those from source)

. ./setup.sh --new-env --basic --spconv --nvdiffrast

This creates a fresh trellis conda env, installs basic Python dependencies, installs spconv-cu120, and builds nvdiffrast. We deliberately omit --flash-attn here because FlashAttention 2 does not yet ship sm_120 kernels (see Troubleshooting) — TRELLIS runs fine on the xformers backend on Blackwell. Activate the env if not already active: conda activate trellis.

4. Replace torch with cu128 wheel (sm_120 support)

PyTorch 2.7.1+ shipped pre-built CUDA 12.8 wheels with native sm_120 support:

pip install torch==2.7.1+cu128 torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128

This replaces the torch installed in step 3. CUDA extensions built in step 3 may need to be rebuilt against the new torch; if you hit an undefined symbol error later, rebuild the offending extension.

5. Build xformers from source

The PyPI xformers wheels lag behind PyTorch 2.7.1+cu128; build against your installed torch — this is the attention backend TRELLIS will use on Blackwell:

git clone --recurse-submodules https://github.com/facebookresearch/xformers.git
cd xformers
pip install -e .
cd ..

6. Build diffoctreerast from source

mkdir -p /tmp/extensions
git clone --recurse-submodules https://github.com/JeffreyXiang/diffoctreerast.git /tmp/extensions/diffoctreerast
pip install /tmp/extensions/diffoctreerast

7. Install kaolin v0.18.0+ (sm_120 support)

Caenorst, a NVIDIAGameWorks/kaolin contributor, noted on Issue #243 that kaolin v0.18.0 supports current PyTorch / CUDA versions:

git clone https://github.com/NVIDIAGameWorks/kaolin
cd kaolin
export IGNORE_TORCH_VER=1
pip install "Cython >= 0.29.37"
pip install -e .
cd ..

If pip install -e . fails on a cuda_post_cflags / Unknown CUDA arch ("12.0") or GPU not supported error, ensure you're on kaolin master (v0.18.0+); older releases hard-coded the supported arch list.

8. Install Gradio demo dependencies (optional but recommended)

. ./setup.sh --demo

9. Re-pin torchvision (the demo setup may downgrade it)

pip uninstall -y torchvision
pip install torchvision --index-url https://download.pytorch.org/whl/cu128

Running

Verify the install with the upstream example.py — it downloads the weights from HuggingFace on first run (~3.30 GB to ~/.cache/huggingface/hub/):

python example.py

You should get five files in the working directory:

sample_gs.mp4 — turntable video of the 3D Gaussian representation
sample_rf.mp4 — turntable of the radiance field
sample_mesh.mp4 — turntable of the normal-shaded mesh
sample.glb — textured GLB exportable to Blender / Unity / web viewers
sample.ply — raw 3D Gaussian point cloud

For the interactive Gradio demo:

python app.py

Then open the URL it prints (default http://127.0.0.1:7860). The demo lets you drop in a single image, runs the same TrellisImageTo3DPipeline.from_pretrained("microsoft/TRELLIS-image-large") pipeline, and previews the Gaussian / radiance / mesh outputs side-by-side.

Tightening texture baking for the 16 GB floor

The default postprocessing_utils.to_glb(...) call in example.py keeps simplify=0.95 and texture_size=1024, which fits the 5070 Ti comfortably. If you call the pipeline directly with simplify=0 (no mesh decimation) on a complex input, the texture-baking stage can OOM even on 24 GB cards (per PladsElsker on Issue #31). Keep simplify ≥ 0.9 on this card, and for very dense meshes set mode='fast' in to_glb (0lento's workaround on Issue #31).

Results

Speed: No RTX 5070 Ti–named TRELLIS measurement has been published. The 5070 Ti is built on the same Blackwell GB203 die as the RTX 5080 but with fewer enabled cores (8960 vs 10752 CUDA cores) and slightly lower memory bandwidth (~896 GB/s vs ~960 GB/s), so each pass will land below an RTX 5080 figure — but with no published 5070 Ti–named number, quoting a value here would be a guess. Once a community benchmark lands via /contribute, this section will pick it up. For now, see /check/trellis-image-large/rtx-5070-ti for the live data.
VRAM usage: The canonical TRELLIS README states "An NVIDIA GPU with at least 16GB of memory is necessary. The code has been verified on NVIDIA A100 and A6000 GPUs." — and Microsoft collaborator JeffreyXiang reiterated this in Issue #5: "Currently at least 16GB VRAM is required." The 5070 Ti is at the floor — the default code path fits, but with no headroom for simplify=0 texture bakes (see Troubleshooting).
Quality notes: TRELLIS image-large is a 1.2B-parameter SLAT (Structured LATent) flow model — see the arXiv paper for the architecture. It outputs three representations from one pass; the GLB export from postprocessing_utils.to_glb(...) is the most directly usable downstream artifact (drop into Blender, Three.js, or any GLTF-aware viewer).

For the full benchmark data, see /check/trellis-image-large/rtx-5070-ti.

Troubleshooting

`NVIDIA GeForce RTX 5070 Ti with CUDA capability sm_120 is not compatible with the current PyTorch installation`

The pre-built PyTorch 2.4.0 wheel that setup.sh --basic installs is compiled for CUDA 11.8 and predates Blackwell. The fix is step 4 above — install PyTorch 2.7.1+cu128 from the cu128 index. The canonical tracking thread is Issue #243, which collects working install paths from multiple contributors (maepopi, SanBingYouYong, zhizdev, Caenorst). RTX 50-series Blackwell support is also tracked in Issue #343, where IgorAherne confirms his recompiled trellis-stable-projectorz build supports "5000 cards".

`Unknown CUDA arch ("12.0") or GPU not supported`

Reported by Polytoo on Issue #243 — fires when an installed extension's bundled torch.utils.cpp_extension doesn't recognize compute_120. Rebuild the offending extension after step 4: usually kaolin (step 7) or xformers (step 5). Make sure you're on the upstream master of each (kaolin v0.18.0+, xformers latest) — older tagged releases pre-date Blackwell.

Texture-baking OOM at the 16 GB floor

The texture-bake stage in postprocessing_utils.to_glb(...) is the single largest VRAM consumer in the pipeline. On a 16 GB card with the default simplify=0.95 and texture_size=1024 the bake fits, but on detailed meshes with simplify=0 it can OOM (per Issue #31). Three remediations, in order:

Keep simplify ≥ 0.9 (the default 0.95 is already safe).
Set mode='fast' in the to_glb call (0lento's diff).
If calling the pipeline programmatically (not via app.py), del pipeline before invoking to_glb to free the SLAT decoder's VRAM for the bake stage (same comment).

If you still OOM, you have effectively outgrown the 16 GB floor — the 24 GB tier (RTX 5090 sibling recipe) is the next stop.

`flash_attn` import fails (`undefined symbol: _ZN3c105ErrorC...`)

This recipe skips --flash-attn in step 3 precisely because FlashAttention 2 does not currently ship sm_120 kernels — coverage is tracked at Dao-AILab/flash-attention#2168. If you installed flash-attn anyway and hit an ABI / undefined symbol error after pinning PyTorch 2.7.1+cu128, force the xformers backend before importing TRELLIS:

import os
os.environ['ATTN_BACKEND'] = 'xformers'  # before any TRELLIS import

TRELLIS supports both flash-attn and xformers attention backends — see the Minimal Example at the top of the upstream README. The xformers backend (step 5) works on Blackwell.

`GLIBCXX_3.4.30 not found` at import time

conda install -c conda-forge libstdcxx-ng

The system libstdc++ shipped with older Ubuntu LTS lags the version Caffe2 / PyTorch needs. The conda-forge package is the safe override.

Tremendous VRAM allocation request (`Tried to allocate 196.89 GiB`)

Issue #79 — diffoctreerast can mis-size its allocation when given certain input image shapes (transparency / unusual aspect ratios). Pre-process input images to a square aspect ratio (the upstream app.py does this automatically; if calling pipeline.run directly, mirror its preprocessing).

Windows install

Windows is documented as not fully tested by Microsoft — see Issue #3. For RTX 50-series on Windows, Issue #259 collects a full tutorial with pre-compiled libraries from FurkanGozukara, and trellis-stable-projectorz v40 (recompiled for "5000 cards" per #343 comment) is the path of least resistance for first-class Blackwell support on Windows. The steps above target Linux.