How much VRAM does TRELLIS image-large need?

About 16 GB — the minimum this recipe targets.

How hard is this setup?

Advanced — follow the steps above.

TRELLIS image-large on RX 7800 XT: The First AMD ROCm Image-to-3D Mesh Path

What You'll Build

A working ROCm install of Microsoft's TRELLIS image-large (1.2B-parameter image-to-3D mesh generator, MIT-licensed, arXiv:2412.01506) on an AMD Radeon RX 7800 XT 16 GB — converting a single input image into a textured GLB mesh, a Gaussian-splat representation (.ply), or a radiance field. This is the first documented end-to-end AMD/ROCm path for TRELLIS image-large: it runs on the CalebisGross/TRELLIS-AMD fork, which replaces TRELLIS's two CUDA-only custom ops with AMD-compatible equivalents.

ℹ️ Image-to-3D, not text-to-3D. TRELLIS image-large takes a single image as input and produces 3D representations (mesh / Gaussian splat / radiance field) — it does not generate 3D from a text prompt. It lives in our 3d vertical because the catalogue groups all 3D-asset generators together; the model card is explicit that the input is an image. Bring your own reference picture (or generate one with an image model first).

⚠️ Why a fork, not the canonical repo. Stock TRELLIS depends on two CUDA-only kernels with no upstream ROCm build: spconv (sparse 3D convolution) and nvdiffrast (differentiable rasterization). The canonical TRELLIS README installs them via --spconv --nvdiffrast and states "An NVIDIA GPU with at least 16GB of memory is necessary." On RDNA3 there is no drop-in for either. The CalebisGross/TRELLIS-AMD fork substitutes torchsparse (built for the HIP backend) for spconv and a custom nvdiffrast-hip rasterizer for nvdiffrast — see "How the fork makes this work" below.

Hardware data: RX 7800 XT (16 GB GDDR6, gfx1101) · tested end-to-end on this exact card per the CalebisGross/TRELLIS-AMD README · See benchmark data

⚠️ Quality caveat (read before you rely on output): the AMD HIP coarse rasterizer carries an unresolved correctness defect the fork documents as ~7% silent triangle culls per frame during mesh rendering. The fork's author reports that "visual fidelity holds (mesh.mp4 fully recognizable across 300 frames)", but the underlying invariant violation is open. For visualization and prototyping this is fine; for production-grade meshes, validate critical geometry. Details in Troubleshooting below and in experiments/raster/findings.md.

Requirements

Component	Minimum	Tested
GPU	16 GB VRAM, RDNA3 (gfx1101)	RX 7800 XT (16 GB)
ROCm	6.4+	ROCm 6.4.2 (guide) / 7.2.1 (README)
PyTorch	ROCm build (torch+rocm)	2.9.1+rocm6.4 (guide) / 2.10.0+rocm7.0 (README)
RAM	16 GB+ (idle submodels offload to CPU between phases)	—
Storage	~3.3 GB model weights + deps	3.30 GB ckpts on disk
Software	Linux, Python 3.10+, libsparsehash-dev	Ubuntu/Debian tested

ℹ️ Version note. The fork's README reports its latest tested stack as ROCm 7.2.1 / torch 2.10.0+rocm7.0, while AMD_GPU_GUIDE.md and the bundled install_amd.sh pin ROCm 6.4.2 / torch 2.9.1+rocm6.4. Both configurations are documented as working on the RX 7800 XT. The install script below installs the ROCm 6.4 wheel; if you already run a newer ROCm, it detects an existing ROCm-enabled torch and skips the reinstall.

Installation

1. Install the torchsparse build dependency

torchsparse needs libsparsehash headers at build time:

# Ubuntu/Debian
sudo apt-get install libsparsehash-dev
# Fedora:    sudo dnf install sparsehash-devel
# Arch:      sudo pacman -S google-sparsehash

2. Clone the fork and run the AMD installer

git clone https://github.com/CalebisGross/TRELLIS-AMD
cd TRELLIS-AMD
chmod +x install_amd.sh
./install_amd.sh

install_amd.sh creates a .venv, installs the ROCm PyTorch wheel (--index-url https://download.pytorch.org/whl/rocm6.4), then builds the three AMD-compatible extensions in order:

nvdiffrast-hip — pip install . --no-build-isolation from extensions/nvdiffrast-hip (the AMD-safe coarse/fine rasterizer).
diff-gaussian-rasterization — a manual HIP build via extensions/diff-gaussian-rasterization/build_hip.sh.
torchsparse — built for the HIP GPU backend with PYTORCH_ROCM_ARCH=gfx1100 ROCM_HOME=/opt/rocm FORCE_CUDA=1 pip install . --no-build-isolation.

ℹ️ The installer's FORCE_CUDA=1 is not a CUDA flag here — on a ROCm PyTorch build it tells the extension to compile the GPU (HIP) backend rather than a CPU-only build. This is the standard torchsparse-on-ROCm idiom.

Running

Activate the environment and launch the Gradio app with the AMD backend flags set:

source .venv/bin/activate
ATTN_BACKEND=sdpa XFORMERS_DISABLED=1 SPARSE_BACKEND=torchsparse \
  TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1 python app.py

Then open http://localhost:7860. The four environment flags are load-bearing on RDNA3:

ATTN_BACKEND=sdpa — use PyTorch SDPA attention (RDNA3 has no FlashAttention-2 prebuilt; SDPA is the supported path).
XFORMERS_DISABLED=1 — xformers is CUDA-oriented; disable it.
SPARSE_BACKEND=torchsparse — route TRELLIS's sparse convolutions through torchsparse instead of spconv. This is the spconv substitution; the model's trellis/modules/sparse backend selector reads this env var.
TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1 — enables the AOTriton attention kernels on ROCm.

For a scripted (non-UI) run, example.py performs the same pipeline and is the file that implements the 16 GB-fit staging (see below). GLB export runs through five console-logged steps (mesh postprocess → UV → multiview render → texture bake → finalize).

Results

Speed: Per the fork's README timing table on the RX 7800 XT: 3D generation ~45 s (12 diffusion steps), Gaussian splatting working at 145+ it/s (~30 s), mesh extraction ~60 s, and GLB export with textures 5–10 minutes (texture baking is the heavy stage — the HIP rasterizer is serialized, so this is far slower than an NVIDIA card). Our backend has no first-party benchmark for this pair yet — see /check/trellis-image-large/rx-7800-xt. If you run it, please contribute your numbers so the next reader gets a measured datapoint.
VRAM usage: Fits the RX 7800 XT's 16 GB. The default TRELLIS footprint exceeds 16 GB, so example.py splits the run into staged phases and moves idle submodels to CPU between phases — the _move_unused_models_to_cpu helper is documented in-code as required on 16 GB cards where the full pipeline footprint exceeds VRAM. The on-disk weights are 3.30 GB (HF Files); runtime peak is the multi-stage figure the staging keeps under 16 GB.
Quality notes: The ~7% silent-triangle-cull defect (above) is the main quality caveat. Hole-filling also uses 100 Hammersley views instead of upstream's 1000 (the HIP rasterizer hangs on degenerate pole views; the author reports hole detection "visually indistinguishable"). Mesh quality is otherwise the standard TRELLIS image-large output.

For the full benchmark data, see /check/trellis-image-large/rx-7800-xt.

How the fork makes this work

TRELLIS image-large is a multi-stage structured-3D-latent pipeline: a sparse-structure flow + SLAT flow (the two ~1.1–1.2 GB DiT models) feed mesh / Gaussian / radiance-field decoders. Two stages use custom CUDA ops with no ROCm build:

spconv → torchsparse (HIP). The sparse 3D convolutions are routed through torchsparse instead of spconv. The fork's trellis/modules/sparse/conv/__init__.py dispatches from .conv_torchsparse import * when SPARSE_BACKEND=torchsparse, and the installer builds torchsparse for the HIP backend — so the substitution is wired into the model graph, not merely available.
nvdiffrast → nvdiffrast-hip. The fork ships a rewritten coarse rasterizer (CoarseRasterSimple.inl) that replaces CUDA warp intrinsics (__syncwarp, __ballot_sync) — which deadlock on RDNA3 — with a serialized, HIP-safe path, plus a bounds-check fix for an out-of-range triHeader[i].misc read. The full debugging trail is in AMD_GPU_GUIDE.md and experiments/raster/.
diff-gaussian-rasterization gets a manual HIP build for the Gaussian-splat export.

Troubleshooting

~7% silent triangle culls (rasterizer correctness)

Per experiments/raster/findings.md: about 7% of bin-queued triangles hit the bounds-check else-branch in coarseRasterImpl per frame and "The fix culls them silently". The author reports visual fidelity holds across a 300-frame test, but the root cause (a missing .misc write, "Phase C" hypothesis) is unresolved. Treat output as prototype-grade; validate critical geometry independently.

GLB export takes 5–10 minutes

This is expected, not a hang. The coarse rasterizer is serialized and slower than NVIDIA's warp-parallel version, and texture baking runs 2500 optimization steps. The console logs five GLB-export steps; heavy CPU+GPU load during texture bake is normal.

CUDA symbol / torchsparse "no attribute" errors

These mean a stock CUDA extension got picked up instead of the AMD-modified one, or torchsparse built without the GPU backend. Rebuild torchsparse for HIP:

cd extensions/torchsparse && CUDA_HOME=/opt/rocm FORCE_CUDA=1 pip install . --no-build-isolation

Empty mesh

Confirm the input image has a clear foreground subject after rembg background removal. If the mesh is too sparse, raise the Mesh Simplify slider toward 0 in the UI (or pass simplify=0.0 to to_glb()) to keep more triangles.

GPU hang / crash

Ensure ROCm 7.0+ and a ROCm PyTorch build (torch 2.9.1+rocm6.4 or newer). On gfx1101 this fork runs natively without HSA_OVERRIDE_GFX_VERSION — the card is officially ROCm-supported.

Report problems with this pair via the submission form.