self-hosted/ai
§01·recipe · 3d

TRELLIS.2-4B on Apple M4 Max: image-to-3D in unified memory via the community Metal port

3dadvanced17GB+ VRAMJun 23, 2026

This advanced recipe sets up TRELLIS.2-4B on the Apple M4 Max, needing about 17 GB of VRAM.

models
tools
prerequisites
  • Apple M4 Max with 48 GB unified memory (~32 GB GPU-addressable) — the README recommends 24 GB+ unified memory; the 48 GB M4 Max clears this comfortably
  • macOS on Apple Silicon (M1 or later), with the Xcode Metal Toolchain for the accelerated texture baker
  • Python 3.11+
  • ~18 GB free disk for TRELLIS.2-4B (15.12 GiB) + DINOv3 (1.21 GB) + RMBG-2.0 (0.88 GB)
  • A Hugging Face account with access granted to the gated DINOv3 and RMBG-2.0 weights

What You'll Build

Generate high-fidelity textured 3D assets from a single image locally with TRELLIS.2-4B — Microsoft's 4-billion-parameter image-to-3D flow-matching transformer — on an Apple M4 Max, with no NVIDIA GPU and no CUDA. The model outputs PBR-ready meshes (base color, roughness, metallic) at voxel-grid resolutions from 512³ up to 1536³, using its O-Voxel sparse representation (Microsoft model card: "a state-of-the-art large 3D generative model designed for high-fidelity image-to-3D generation", "O-Voxel", "Parameters: 4 Billion", "Resolution: Varies from 512³ to 1536³"). The path to Apple Silicon is a community Metal/MPS port, not Microsoft's official repo.

Hardware data: Apple M4 Max (48 GB unified memory, ~32 GB GPU-addressable) · 4B image-to-3D pipeline, ~18 GB weights on disk · See benchmark data

⚠️ The Apple path is a community Metal port, not official Microsoft support. Microsoft's canonical TRELLIS.2 repo is CUDA-only — it builds flex_gemm and nvdiffrast and assumes an NVIDIA GPU. The recipe below uses shivampkumar/trellis-mac, a community port whose README line 1 reads "TRELLIS.2 for Apple Silicon" and which describes itself as "a port of Microsoft's TRELLIS.2 — a state-of-the-art image-to-3D model — from CUDA-only to Apple Silicon via PyTorch MPS. No NVIDIA GPU required." It works (a real M-series run is cited below) but is community-maintained — treat it accordingly.

Variant pin — this is v2, not v1. This recipe targets TRELLIS.2-4B at canonical slug microsoft/TRELLIS.2-4B (note the dot in TRELLIS.2). It is NOT the older microsoft/TRELLIS-image-large (the 2024 v1, a different architecture). The distinction is load-bearing on Apple: v1 depends on spconv, which has no Metal port, so v1 has no working Apple path. v2's sparse-conv kernel (flex_gemm) does have a Metal replacement (mtlgemm), which is why an Apple port exists for v2 only.

Requirements

ComponentMinimumTested
GPU / memory24 GB+ unified memory recommended (per the port README)Apple M4 Max (48 GB unified memory, ~32 GB GPU-addressable)
RAMSame pool — unified48 GB unified
Storage~18 GB (TRELLIS.2-4B 15.12 GiB + DINOv3 1.21 GB + RMBG-2.0 0.88 GB)~18 GB
SoftwaremacOS on Apple Silicon (M1+), Python 3.11+, Xcode Metal Toolchain

The binding constraint on Apple Silicon is addressable unified memory, not raw capacity. The M4 Max's 48 GB of unified memory is shared by CPU and GPU; on a sub-64 GB Mac macOS lets the GPU address roughly two-thirds of it by default — about ~32 GB safe (~36 GB optimistic) via Metal's recommendedMaxWorkingSetSize. The TRELLIS.2-4B checkpoints total 15.12 GiB on disk (HF tree API) and the port also downloads the DINOv3 image encoder (1.21 GB) and the RMBG-2.0 background remover (0.88 GB) — about 18 GB of weights in all. The port itself reports that "Memory usage peaks at around 18GB unified memory during generation" (README Performance notes). At ~18 GB peak against the M4 Max's ~32 GB default GPU-addressable pool, this pipeline fits comfortably and needs no iogpu.wired_limit_mb tuning for the default 512 pipeline.

Installation

The install path is the shivampkumar/trellis-mac port. There is nothing CUDA-shaped to install — no cu12x PyTorch wheel, no flash-attn build, no flex_gemm/nvdiffrast compile. The port replaces each CUDA kernel with a Metal/MPS equivalent (see What Was Ported): flex_gemmmtlgemm, nvdiffrastmtldiffrast, flash_attn → PyTorch SDPA, with a cumesh decode step skipped in favour of fast_simplification.

1. Clone the port

git clone https://github.com/shivampkumar/trellis-mac.git
cd trellis-mac

2. Download the Xcode Metal Toolchain (for the accelerated texture baker)

xcodebuild -downloadComponent MetalToolchain

Per the README, this lets setup.sh build the Metal-accelerated texture baker. Without it, setup falls back to a pure "Python KDTree baker (slower, slightly lower quality)." To skip the Metal build entirely (e.g. on older hardware), prefix setup with SKIP_METAL=1 (see step 4).

3. Log into Hugging Face and request the gated weights

hf auth login

TRELLIS.2-4B itself is MIT-licensed and ungated, but the port also downloads two gated auxiliary models. Request access (the README notes "usually instant approval" for these) before running setup:

4. Run setup (creates the venv, installs deps, clones & patches TRELLIS.2, builds the Metal backends)

bash setup.sh
source .venv/bin/activate

To skip the Metal build and use the pure-Python baker:

SKIP_METAL=1 bash setup.sh
source .venv/bin/activate

setup.sh pre-clones its Git dependencies into deps/ so all network I/O happens up front; the TRELLIS.2-4B / DINOv3 / RMBG-2.0 weights (~18 GB) download on first generation and cache locally.

Running

After activating the environment, generate a 3D model from a single image:

python generate.py path/to/image.png

With options (verbatim from the port's Usage section):

python generate.py photo.png --seed 123 --output my_model --pipeline-type 512

Documented flags: --seed (default 42), --output (default output_3d), --pipeline-type (512, 1024, 1024_cascade; default 512), --texture-size (512, 1024, 2048; default 1024), and --no-texture to export geometry only. The output is a GLB with base-color, metallic, and roughness textures, written to the current directory (or the --output path). The default 512 pipeline is the lightest path and the one the cited reference run used.

Results

  • Speed: No first-party Apple M4 Max benchmark for this pair exists yet — /check/trellis-2/m4-max returns verdict: unknown with zero measurements. We deliberately do not quote an M4 Max time. The port's README cites "~5 minutes 13 seconds on M4 Pro" (24GB, cold start, weights cached, cool machine, pipeline type 512) (README). That confirms the pipeline runs on M4-generation Apple Silicon — but the M4 Pro is a different chip than the M4 Max (the M4 Max has substantially higher memory bandwidth, ~546 GB/s vs the M4 Pro's lower figure), so that timing is an existence-proof that the path works, not an M4 Max throughput number. If you run this on an M4 Max, please contribute your timing so we can seed a real datapoint.
  • Memory usage: The port reports generation "peaks at around 18GB unified memory" (README Performance notes) — TRELLIS.2-4B + DINOv3 + RMBG-2.0 resident, plus generation/baking activations and the sparse-voxel structures. The M4 Max's ~32 GB default GPU-addressable pool clears this with headroom; the port recommends 24 GB+ unified memory and the cited run fit a 24 GB machine.
  • Quality notes: TRELLIS.2-4B outputs PBR-ready meshes (base color, roughness, metallic, opacity) at 512³–1536³ (model card). The port's reference output is "400K+ vertex meshes with baked PBR textures". Note two community-port quality caveats from the What Was Ported table: SDPA attention is "padded, not fused — room for improvement", and the cumesh decode step is skipped on meshes large enough to crash the Metal port (replaced with fast_simplification before baking). Without the Metal Toolchain, texture baking falls back to a pure-Python KDTree baker that is "slower, slightly lower quality" — the README measures the bake taking ~15s instead of ~11s, with coverage near UV chart boundaries slightly softer.

For the full benchmark data (and to be the first to populate it), see /check/trellis-2/m4-max.

Troubleshooting

Tried to install flash-attn / a cu12x wheel / build flex_gemm or nvdiffrast — it failed

None of those apply on Apple Silicon. There is no CUDA, no FlashAttention, and no flex_gemm/nvdiffrast build on a Mac. The shivampkumar/trellis-mac port replaces every CUDA kernel with a Metal/MPS equivalent (mtlgemm, mtldiffrast, PyTorch SDPA) and patches all .cuda() calls to use the active MPS device. If a generic TRELLIS.2 tutorial tells you to pip install flash-attn, install a cu128 PyTorch wheel, or compile flex_gemm for sm_120, skip those steps entirely — bash setup.sh from the port is the complete Apple path.

"Access to model … is restricted" when weights download

DINOv3 and RMBG-2.0 are gated on Hugging Face. Run hf auth login and accept the license on each model page before generating: DINOv3 ViT-L and RMBG-2.0 (the README notes "usually instant approval"). TRELLIS.2-4B itself is ungated and MIT-licensed, so it downloads without an access step. Licensing note: while TRELLIS.2-4B is MIT, RMBG-2.0 is CC BY-NC 4.0 (non-commercial) — because it sits in the default pipeline as the background remover, 3D assets you generate this way are encumbered for non-commercial use unless you swap in a different background-removal step or obtain BRIA's commercial agreement.

Texture baking is slow or lower quality than expected

The Metal-accelerated texture baker requires the Xcode Metal Toolchain (xcodebuild -downloadComponent MetalToolchain). If you ran SKIP_METAL=1 bash setup.sh or the toolchain was unavailable, the port falls back to a pure-Python KDTree baker (xatlas UV unwrap + scipy cKDTree inverse-distance weighting), which the README calls "slower, slightly lower quality". Re-run setup with the toolchain installed to enable the Metal stack (mtldiffrast / mtlbvh / mtlmesh).

Mesh extraction or BVH instability on very large meshes

The port pre-simplifies the decoder mesh to ~200K faces with fast_simplification before handing it to the Metal BVH, because "the BVH builder is unstable on 800K+ face inputs" (What Was Ported). If you hit instability at higher pipeline resolutions, stay on the default --pipeline-type 512, or remove deps/ and re-run bash setup.sh to reset local clone state (a documented recovery step in the README).

No other widely-reported issues on this Apple port. Report problems via the submission form.

common questions
How much VRAM does TRELLIS.2-4B need?

About 17 GB — the minimum this recipe targets.

Which GPUs is TRELLIS.2-4B tested on?

Apple M4 Max (48 GB).

How hard is this setup?

Advanced — follow the steps above.