TRELLIS image-large on RTX 4070 Ti SUPER: Image-to-3D Mesh Generation at the 16 GB Floor

What You'll Build

A working install of Microsoft's TRELLIS image-large (a 1.2B-parameter image-to-3D model, MIT-licensed, arXiv:2412.01506) on an RTX 4070 Ti SUPER 16 GB — capable of converting a single input image into a textured GLB mesh, a 3D Gaussian-splat representation, or a radiance field. The 4070 Ti SUPER carries the same 16 GB GDDR6X envelope as the 4080 and sits exactly at the model's officially-stated 16 GB floor, so the recipe is structured around the default code path (no offloading, no quantization tricks).

ℹ️ Image-to-3D, not text-to-3D. TRELLIS image-large takes a single image as input and produces 3D representations (mesh / Gaussian splat / radiance field) — it does not generate 3D from a text prompt. It lives in our 3d vertical because the catalogue groups all 3D-asset generators together; the model card is explicit that the input is an image. Bring your own reference picture (or generate one with an image model first).

Hardware data: RTX 4070 Ti SUPER (16 GB VRAM) · canonical 16 GB minimum per the TRELLIS README (verified by Microsoft on A100 / A6000) · See benchmark data

✅ No Blackwell build gymnastics needed on Ada. Unlike the RTX 50-series (Blackwell, sm_120), where stock TRELLIS crashes with a CUDA capability sm_120 is not compatible error and every CUDA submodule must be rebuilt against a cu128 wheel, the RTX 4070 Ti SUPER is Ada Lovelace (sm_89). The default PyTorch wheels that setup.sh installs already ship sm_89 kernels, and the build coverage for TRELLIS's custom CUDA ops (flash-attn, spconv, nvdiffrast, diffoctreerast, kaolin) all covers sm_89. The canonical one-line install from the upstream README runs as-is — this recipe follows it verbatim.

⚠️ Tight floor, no headroom. The RTX 4070 Ti SUPER's 16 GB GDDR6X envelope is exactly the model's stated floor. The default code path fits in 16 GB, but texture baking at simplify=0 on a detailed mesh can OOM — even on 24 GB cards (per Issue #31). Keep simplify at its default 0.95 and see Troubleshooting for the mode='fast' workaround.

Requirements

Component	Minimum	Tested
GPU	16 GB VRAM (per README, verified on A100 / A6000)	RTX 4070 Ti SUPER (16 GB)
RAM	16 GB system RAM	—
Storage	~3.30 GB for model weights (HF tree API); ~20 GB total with the conda env and compiled CUDA extensions	—
Software	CUDA 11.8 or 12.2, Conda, Python 3.8+, PyTorch (default `setup.sh` wheel)	—

Installation

On the RTX 4070 Ti SUPER (Ada, sm_89), the canonical install path from the TRELLIS README works without modification. The default PyTorch wheel installed by setup.sh --basic includes sm_89 kernels, and every CUDA submodule TRELLIS builds (xformers, flash-attn, diffoctreerast, spconv, mipgaussian, kaolin, nvdiffrast) compiles cleanly for sm_89 — so you do not need the cu128 wheel swap or source-build steps that the Blackwell siblings require.

1. Verify your CUDA toolkit

nvcc --version
# Expected: release 11.8 or 12.2 (needed to compile the CUDA submodules)

2. Clone the repo (with submodules)

git clone --recurse-submodules https://github.com/microsoft/TRELLIS.git
cd TRELLIS

3. Run the canonical setup

This is the exact one-liner from the upstream README. It creates a fresh trellis conda env, installs the basic Python dependencies, and builds all the CUDA extensions:

. ./setup.sh --new-env --basic --xformers --flash-attn --diffoctreerast --spconv --mipgaussian --kaolin --nvdiffrast

On Ada (sm_89) every one of these flags builds successfully against the default torch wheel — including --flash-attn, which is unavailable on Blackwell (see Troubleshooting). Activate the env if it is not already active: conda activate trellis.

4. (Optional) Install the Gradio demo dependencies

. ./setup.sh --demo

Running

Verify the install with the upstream example.py — it downloads the model weights from HuggingFace on first run (to ~/.cache/huggingface/hub/):

python example.py

You should get five files in the working directory:

sample_gs.mp4 — turntable video of the 3D Gaussian representation
sample_rf.mp4 — turntable of the radiance field
sample_mesh.mp4 — turntable of the normal-shaded mesh
sample.glb — textured GLB exportable to Blender / Unity / web viewers
sample.ply — raw 3D Gaussian point cloud

For the interactive Gradio demo:

python app.py

Then open the URL it prints (default http://127.0.0.1:7860). The demo lets you drop in a single image, runs the TrellisImageTo3DPipeline.from_pretrained("microsoft/TRELLIS-image-large") pipeline, and previews the Gaussian / radiance / mesh outputs side-by-side.

Tightening texture baking for the 16 GB floor

The default postprocessing_utils.to_glb(...) call in example.py keeps simplify=0.95 and texture_size=1024, which fits the 4070 Ti SUPER comfortably. If you call the pipeline directly with simplify=0 (no mesh decimation) on a complex input, the texture-baking stage can OOM even on larger cards — community user PladsElsker reports it happening "even on 24GB cards" for large meshes in Issue #31. Keep simplify at 0.95 on this card, and for very dense meshes set mode='fast' in to_glb (see Troubleshooting).

Results

Speed: No RTX 4070 Ti SUPER–named TRELLIS image-large measurement has been published, and the backend has no benchmark for this pair yet (/check/trellis-image-large/rtx-4070-ti-super returns verdict: unknown). TRELLIS publishes no GPU-specific timing in its README, and the only public figure for the original model is a ~30-second-per-image time reported on an RTX 3090 — a different (Ampere sm_86) architecture whose number does not transfer to Ada, so quoting it here would mislead. If you run TRELLIS on a 4070 Ti SUPER, please submit your timing via /contribute and this section will pick it up.
VRAM usage: The canonical TRELLIS README states that an NVIDIA GPU with at least 16 GB of memory is necessary, and that the code has been verified on NVIDIA A100 and A6000 GPUs. Microsoft collaborator JeffreyXiang reiterated this in Issue #5: "Currently at least 16GB VRAM is required." The RTX 4070 Ti SUPER sits exactly at this floor — the default code path fits, but with no headroom for simplify=0 texture bakes (see Troubleshooting).
Quality notes: TRELLIS image-large is a 1.2B-parameter SLAT (Structured LATent) flow model — see the arXiv paper for the architecture. The weights ship in fp16 (the *_fp16.safetensors checkpoints on the HF tree); there is no separate FP8 build to install, so the install path is the same on Ada as on any supported card. It outputs three representations from a single pass; the GLB export from postprocessing_utils.to_glb(...) is the most directly usable downstream artifact (drop it into Blender, Three.js, or any GLTF-aware viewer).

For the full benchmark data, see /check/trellis-image-large/rtx-4070-ti-super.

Troubleshooting

Texture-baking OOM at the 16 GB floor

The texture-bake stage in postprocessing_utils.to_glb(...) is the single largest VRAM consumer in the pipeline. On a 16 GB card with the default simplify=0.95 and texture_size=1024 the bake fits, but on detailed meshes with simplify=0 it can OOM — a community user, PladsElsker, notes this happens "even on 24GB cards" for large meshes in Issue #31. Three remediations, in order:

Keep simplify at 0.95 (the default is already safe).
Set mode='fast' in the to_glb call — community user 0lento posted this one-line diff in Issue #31, noting "It's going to consume way less VRAM and is many times faster."
If calling the pipeline programmatically (not via app.py), del pipeline before invoking to_glb to free the SLAT decoder's VRAM for the bake stage (same comment).

If you still OOM, you have effectively outgrown the 16 GB floor — a 24 GB card is the next stop.

Choosing the attention backend

TRELLIS supports both flash-attn (the default) and xformers attention backends, selectable via the ATTN_BACKEND environment variable. On the RTX 4070 Ti SUPER (Ada, sm_89) FlashAttention 2 builds and runs natively — unlike the RTX 50-series (Blackwell, sm_120), where FA2 does not yet ship kernels and the xformers backend is required. You can leave the default in place. If you ever want to force xformers anyway, set it before importing TRELLIS:

import os
os.environ['ATTN_BACKEND'] = 'xformers'  # before any TRELLIS import

`GLIBCXX_3.4.30 not found` at import time

conda install -c conda-forge libstdcxx-ng

The system libstdc++ shipped with older Ubuntu LTS lags the version PyTorch needs. The conda-forge package is the safe override.

Tremendous VRAM allocation request (`Tried to allocate ... GiB`)

Issue #79 — diffoctreerast can mis-size its allocation when given certain input image shapes (transparency / unusual aspect ratios). Pre-process input images to a square aspect ratio (the upstream app.py does this automatically; if calling pipeline.run directly, mirror its preprocessing).

Windows install

Windows is documented as not fully tested by Microsoft — see Issue #3. The steps above target Linux. Report other problems via the submission form.