Hunyuan3D-2.1 on RTX 4070: Image-to-Mesh 3D Generation (Shape-Only)

What You'll Build

A working image-to-3D pipeline using Tencent's Hunyuan3D-2.1 shape model: drop in a reference photo, get back an untextured .glb mesh ready for any DCC tool (Blender, Unity, Unreal, three.js). Texture generation is skipped on purpose — see the box below.

Hardware data: RTX 4070 (12GB VRAM) · ~10GB peak for shape generation · See benchmark data

⚠️ Why shape-only on 12GB? The official Hunyuan3D-2.1 README states: "It takes 10 GB VRAM for shape generation, 21GB for texture generation and 29GB for shape and texture generation in total." On a 12GB card the shape stage's 10GB peak fits with a little room to spare, but the texture stage's 21GB peak is far above the ceiling and will OOM. The texture stage is heavy enough that it overflows even larger cards: a separate 16GB Blackwell GPU owner (a mobile RTX 5080) reports in Issue #15 being "able to generate models just fine on my mobile 5080 but unfortunately lack the VRAM for texture painting", and a 24GB owner in Issue #80 reports the texture stage OOMing on a 24GB card. If 16GB and even 24GB can't texture, 12GB certainly can't — generate the mesh here and texture it on a rented A100/L40S, in a tool like Substance Painter, or with the lighter Hunyuan3D-Omni control variant. If you must texture locally, see the mmgp-offload workaround in Troubleshooting.

ℹ️ Meshes, not images. Hunyuan3D-2.1 produces 3D geometry (.glb meshes), not 2D pictures. It sits in our 3d vertical; the shape stage covered here outputs an untextured mesh — colour/material is the separate texture stage that is out of scope on this 12GB card.

Requirements

Component	Minimum	Tested
GPU	10GB VRAM (shape pipeline)	RTX 4070 (12GB)
RAM	16GB	—
Storage	~7GB for the shape DiT checkpoint (~15GB if you also pull the VAE/texture folders)	—
Software	Python 3.10, PyTorch 2.5.1+cu124	—

The RTX 4070 is an Ada Lovelace AD104 (sm_89) card: 5888 CUDA cores, ~504 GB/s memory bandwidth, 12GB GDDR6X on a 192-bit bus, PCIe Gen4 x16, 200W TGP. Because sm_89 kernels ship in the stock cu124 PyTorch wheel, no special CUDA build is required (this is the key difference from Blackwell 50-series cards, which need a cu128 wheel for sm_120).

⚠️ Watch your display headroom. The shape stage's 10GB peak leaves only ~0.5–1.3GB of slack on a 12GB card once a desktop/monitor is attached (a desktop card typically exposes ~10.5–11.3GB usable). On a headless Linux box (~11.6GB usable) there's more room. If you OOM mid-generation with a monitor plugged in, free VRAM by closing the browser/compositor, or run headless.

Installation

The weights are not gated on Hugging Face — they download freely on the first pipeline call via huggingface_hub, with no click-through to accept (verified live: the model page reports gated: false). There is, however, a license you must comply with before deploying: see step 3.

1. Clone the official repository

git clone https://github.com/Tencent-Hunyuan/Hunyuan3D-2.1.git
cd Hunyuan3D-2.1

2. Install PyTorch and dependencies (Ada uses the stock cu124 wheel)

The README tests with Python 3.10 and torch==2.5.1+cu124. On the RTX 4070 (Ada Lovelace, sm_89) this is exactly the right wheel — the stock cu124 build already ships sm_89 kernels, so no special index-url override is required. (This is the key difference from Blackwell 50-series cards, which need a newer cu128 wheel for sm_120 kernels; Ada does not.)

pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu124
pip install -r requirements.txt

The in-tree texture extensions (hy3dpaint/custom_rasterizer, compile_mesh_painter.sh, and the Real-ESRGAN checkpoint) are only needed for the paint stage, which does not fit a 12GB card — skip building them for the shape-only path. If you do build them later, note that on Ada sm_89 the prebuilt CUDA toolchain compiles the custom rasterizer for the target directly — no Blackwell-style TORCH_CUDA_ARCH_LIST override is needed.

3. Read the Tencent license before you deploy

The weights are governed by the Tencent Hunyuan 3D 2.1 Community License — the HF card lists license: other (tencent-hunyuan-community), not Apache 2.0. Verbatim from the license header: "THIS LICENSE AGREEMENT DOES NOT APPLY IN THE EUROPEAN UNION, UNITED KINGDOM AND SOUTH KOREA AND IS EXPRESSLY LIMITED TO THE TERRITORY, AS DEFINED BELOW." It also requires a separate license from Tencent for any product whose monthly active users exceed 1 million in the preceding calendar month. The free, ungated download does not waive these terms — read the LICENSE in full before deploying anything user-facing.

Running

Use the official Python API from the Hunyuan3D-2.1 README. Run only the shape pipeline — do not instantiate Hunyuan3DPaintPipeline on a 12GB card. On the first call the shape DiT checkpoint (hunyuan3d-dit-v2-1/model.fp16.ckpt, ~6.9GB) downloads automatically.

import sys
sys.path.insert(0, './hy3dshape')
from hy3dshape.pipelines import Hunyuan3DDiTFlowMatchingPipeline

shape_pipeline = Hunyuan3DDiTFlowMatchingPipeline.from_pretrained('tencent/Hunyuan3D-2.1')
mesh_untextured = shape_pipeline(image='assets/demo.png')[0]
mesh_untextured.export('output.glb')

For best results, give the pipeline a clean cutout — the repo ships a BackgroundRemover helper in hy3dshape/rembg.py you can run on the input image before passing it in. The pipeline outputs an untextured .glb mesh (glTF 2.0 binary); open the result in Blender, three.js, or any glTF 2.0 viewer.

If you prefer a UI, run the official Gradio app with the low-VRAM flag. It still won't let you texture-generate on 12GB, but it works for shape:

python3 gradio_app.py \
  --model_path tencent/Hunyuan3D-2.1 \
  --subfolder hunyuan3d-dit-v2-1 \
  --texgen_model_path tencent/Hunyuan3D-2.1 \
  --low_vram_mode

Results

VRAM usage: ~10GB peak for shape generation, cited verbatim from the official README ("It takes 10 GB VRAM for shape generation"). The 3.3B shape DiT ships as a single ~6.9GB FP16 checkpoint (model.fp16.ckpt, 7,366,389,768 bytes per the HF file tree); the ~10GB figure accounts for the VAE and activations on top. This fits the RTX 4070's 12GB, but with little display headroom (see the Requirements callout). For empirical numbers on this exact GPU pair once a community submission lands, see /check/hunyuan-3d/rtx-4070.
Speed: intentionally omitted. Tencent publishes no per-GPU timings, and no RTX 4070 benchmark for Hunyuan3D-2.1 exists yet. The community timings that exist (e.g. in Issue #24) are full mesh+texture runs on different GPUs, not an RTX 4070 shape-only run, so quoting one here would mislead. Until an RTX 4070-named measurement appears we route timings to /contribute and /check/hunyuan-3d/rtx-4070.
Output format: .glb (glTF 2.0 binary). Universal — import to Blender, Unity, Unreal, three.js, or convert to .obj/.fbx via trimesh in Python.
Quality notes: Image-to-3D quality is best when the input photo has a clean background and a single subject. The shape stage produces watertight geometry suitable for greyboxing, retopology sources, and downstream texturing in a DCC tool.

For the full benchmark data, see /check/hunyuan-3d/rtx-4070.

Troubleshooting

CUDA OOM during texture generation

Don't run Hunyuan3DPaintPipeline on the RTX 4070. Per the README, the texture stage alone needs 21GB and the combined shape+texture run needs 29GB — both far above the 12GB envelope. Generate the mesh here and texture it downstream on bigger hardware or in a non-AI workflow.

OOM during shape generation (with a monitor attached)

The 10GB shape peak is close to the usable VRAM of a 12GB desktop card once a display is attached (~10.5–11.3GB usable). If shape generation OOMs, close your browser and any GPU-using desktop apps to free VRAM, pass --low_vram_mode to the Gradio app, or run the job on a headless Linux session (~11.6GB usable). This headroom problem does not exist on 16GB+ cards.

Want to texture locally anyway? mmgp offloading

The texture stage needs 21GB, but the deepbeepmeep/Hunyuan3D-2GP memory-management layer (pip install mmgp) streams the paint pipeline through limited VRAM. In Issue #15 a community user confirmed this works on a 16GB Blackwell card after adapting the demo script — a community workaround, not official Tencent support. On a 12GB RTX 4070 it will be tighter still and slower than a card that fits the full 29GB combined peak. The RTX 4070's PCIe Gen4 x16 link means any CPU-offload streaming runs at roughly half the host bandwidth of a Gen5 card, so expect the offloaded texture pass to be slower again — for a no-compromise local mesh+texture pass you need a 24GB+ card, and per Issue #80 even 24GB can fall short, pushing the comfortable target toward 29GB+ (e.g. an RTX 5090 / A100 / L40S).

FlashAttention build failures

Hunyuan3D-2.1's shape diffusion uses PyTorch's default SDPA (scaled_dot_product_attention) backend, which needs no extra install and is fully supported on Ada sm_89. Ada is not affected by the Blackwell sm_120 FlashAttention wheel gap — prebuilt flash-attn wheels cover sm_89 — so if a dependency pulls in flash-attn and the build fails, it is an unrelated toolchain issue, not an architecture gap. The shape path runs fine on SDPA without flash-attn.

Want less VRAM up front? Look at Hunyuan3D-Omni

The sibling model Hunyuan3D-Omni adds multi-modal control (point cloud / voxel / pose / bounding-box conditioning on top of image input) and its HF card states "It takes 10 GB VRAM for generation." Same install path (stock cu124 Ada wheel), same license. Useful if you want skeletal or voxel control over the generated mesh.

For texture and PBR options that exceed 12GB, see /check/hunyuan-3d/rtx-4070 and contribute your results via /contribute.