Hunyuan3D-2.1 on RTX 5070: Image-to-Mesh 3D Generation (Shape-Only)

What You'll Build

A working image-to-3D pipeline using Tencent's Hunyuan3D-2.1 shape model: drop in a reference photo, get back an untextured .glb mesh ready for any DCC tool (Blender, Unity, Unreal, three.js). Texture generation is skipped on purpose — see the box below.

Hardware data: RTX 5070 (12GB VRAM) · 10GB peak for shape generation · See benchmark data

⚠️ Why shape-only on 12GB? The official Hunyuan3D-2.1 README states: "It takes 10 GB VRAM for shape generation, 21GB for texture generation and 29GB for shape and texture generation in total." On a 12GB card the shape stage's 10GB peak fits with a little room to spare, but the texture stage's 21GB peak is far above the ceiling and will OOM. The 12GB ceiling is even tighter than larger Blackwell cards: an owner of a 16GB Blackwell GPU (a mobile RTX 5080) already reports in Issue #15 being "able to generate models just fine on my mobile 5080 but unfortunately lack the VRAM for texture painting." If 16GB can't texture, 12GB certainly can't — generate the mesh here and texture it on a rented A100/L40S, in a tool like Substance Painter, or with the lighter Hunyuan3D-Omni control variant. If you must texture locally, see the mmgp-offload workaround in Troubleshooting.

ℹ️ Meshes, not images. Hunyuan3D-2.1 produces 3D geometry (.glb meshes), not 2D pictures. It sits in our 3d vertical; the shape stage covered here outputs an untextured mesh — colour/material is the separate texture stage that's out of scope on this 12GB card.

Requirements

Component	Minimum	Tested
GPU	10GB VRAM (shape pipeline)	RTX 5070 (12GB)
RAM	16GB	—
Storage	~8GB for shape weights (~15GB if texture weights also download)	—
Software	Python 3.10, CUDA 12.8, PyTorch 2.7.0+cu128	—

The RTX 5070 is a Blackwell GB205 (sm_120) card: 6144 CUDA cores, ~672 GB/s memory bandwidth, 12GB GDDR7 on a 192-bit bus, 250W TGP. The sm_120 compute capability is what forces the cu128 PyTorch wheel in step 2.

⚠️ Watch your display headroom. The shape stage's 10GB peak leaves only ~0.5–1.3GB of slack on a 12GB card once a desktop/monitor is attached (a desktop card typically exposes ~10.5–11.3GB usable). On a headless Linux box (~11.6GB usable) there's more room. If you OOM mid-generation with a monitor plugged in, free VRAM by closing the browser/compositor, or run headless.

Installation

1. Clone the official repository

git clone https://github.com/Tencent-Hunyuan/Hunyuan3D-2.1.git
cd Hunyuan3D-2.1

2. Install PyTorch with cu128 wheels (Blackwell override)

The README pins torch==2.5.1+cu124, which does not include Blackwell sm_120 kernels and fails at first inference on a 5070 with no kernel image is available for execution on the device. Use the cu128 build instead. The shape pipeline needs only the base requirements — the in-tree texture extensions (custom_rasterizer, compile_mesh_painter.sh, Real-ESRGAN checkpoint) are for the paint stage, which we don't run on a 12GB card, so they can be skipped.

pip install torch==2.7.0 torchvision==0.22.0 torchaudio==2.7.0 --index-url https://download.pytorch.org/whl/cu128
pip install -r requirements.txt

The torch==2.7.0 + cu128 combination is the community-confirmed Blackwell 50-series path documented in Issue #22 — opened by community user pengpeng (GitHub NONE) to upgrade the repo for 50-series cards, and detailed by PlanarFox (community NONE) with a working FROM nvidia/cuda:12.8.0-devel Dockerfile. The same thread also documents TORCH_CUDA_ARCH_LIST="12.0" for anyone who additionally needs to build the texture extensions; the shape-only path here does not require it. This is community-aligned guidance, not official Tencent support.

3. Read the Tencent license before you deploy

The weights are not gated on Hugging Face (they download freely on the first pipeline call via huggingface_hub, no click-through), but they are governed by the Tencent Hunyuan 3D 2.1 Community License — license: other, not Apache 2.0. Verbatim from the license header: "THIS LICENSE AGREEMENT DOES NOT APPLY IN THE EUROPEAN UNION, UNITED KINGDOM AND SOUTH KOREA AND IS EXPRESSLY LIMITED TO THE TERRITORY, AS DEFINED BELOW." It also requires you to request a separate license from Tencent for any product whose monthly active users exceed 1 million. The free download does not waive these terms — read the LICENSE in full before deploying anything user-facing.

Running

Use the official Python API from the Hunyuan3D-2.1 README. Run only the shape pipeline — do not instantiate Hunyuan3DPaintPipeline on a 12GB card.

import sys
sys.path.insert(0, './hy3dshape')
from hy3dshape.pipelines import Hunyuan3DDiTFlowMatchingPipeline

shape_pipeline = Hunyuan3DDiTFlowMatchingPipeline.from_pretrained('tencent/Hunyuan3D-2.1')
mesh_untextured = shape_pipeline(image='assets/demo.png')[0]
mesh_untextured.export('output.glb')

The pipeline outputs an untextured .glb mesh (glTF 2.0 binary). Open the result in Blender, three.js, or any glTF 2.0 viewer.

If you prefer a UI, run the official Gradio app with the low-VRAM flag. It still won't let you texture-generate on 12GB, but it works for shape:

python3 gradio_app.py --model_path tencent/Hunyuan3D-2.1 \
  --subfolder hunyuan3d-dit-v2-1 \
  --texgen_model_path tencent/Hunyuan3D-2.1 \
  --low_vram_mode

Results

VRAM usage: 10GB peak for shape generation, cited verbatim from the official README. On a 12GB card this fits, but with little display headroom (see the Requirements callout). For empirical numbers on this exact GPU pair once a community submission lands, see /check/hunyuan-3d/rtx-5070.
Speed: intentionally omitted. Tencent publishes no per-GPU timings, and no RTX 5070 benchmark for Hunyuan3D-2.1 exists yet. The only community timings (in Issue #24) are for the RTX 4090 (Ada sm_89, ~70s mesh+texture) and RTX 3090 (Ampere sm_86, <100s mesh) — different architectures running full mesh+texture, not a 5070 shape-only run, so they don't belong here. Until a 5070-named measurement appears we route timings to /contribute and /check/hunyuan-3d/rtx-5070.
Output format: .glb (glTF 2.0 binary). Universal — import to Blender, Unity, Unreal, three.js, or convert to .obj/.fbx via trimesh in Python.
Quality notes: Image-to-3D quality is best when the input photo has a clean background and a single subject.

For the full benchmark data, see /check/hunyuan-3d/rtx-5070.

Troubleshooting

CUDA OOM during texture generation

Don't run Hunyuan3DPaintPipeline on the 5070. As cited in the README, the texture stage alone needs 21GB — far above the 12GB envelope. Generate the mesh here and texture it downstream on bigger hardware or in a non-AI workflow.

`no kernel image is available for execution on the device`

You installed the README's pinned torch==2.5.1+cu124 wheel, which does not include sm_120 kernels for Blackwell. Reinstall with the cu128 wheel per step 2 above (Issue #22 tracks the 50-series upgrade; Issue #122 is the Blackwell Docker thread).

OOM during shape generation (with a monitor attached)

The 10GB shape peak is close to the usable VRAM of a 12GB desktop card once a display is attached (~10.5–11.3GB usable). If shape generation OOMs, close your browser and any GPU-using desktop apps to free VRAM, pass --low_vram_mode to the Gradio app, or run the job on a headless Linux session (~11.6GB usable). This headroom problem does not exist on 16GB+ cards.

FlashAttention-2 errors / wheel-build failures

Hunyuan3D-2.1's shape diffusion uses PyTorch's default SDPA (scaled_dot_product_attention) backend, which has full Blackwell sm_120 support via the cu128 wheel and needs no extra install. If a dependency tree pulls in flash-attn for an unrelated reason and you hit Could not build wheels for flash-attn, note that the canonical FA2 wheel does not yet ship sm_120 kernels — tracked upstream at Dao-AILab/flash-attention#2168. Pin the dependency tree to skip flash-attn or run without it (SDPA is the default path).

Want to texture locally anyway? mmgp offloading

Texture generation needs 21GB, but the deepbeepmeep/Hunyuan3D-2GP memory-management layer (pip install mmgp) lets you stream the paint pipeline through limited VRAM. The mobile-5080 owner in Issue #15 confirmed this works for them on a 16GB Blackwell card after adapting the demo script (community workaround, not official Tencent support). On a 12GB 5070 it will be tighter still and slower than a card that fits the full 29GB combined peak — for a no-compromise local mesh+texture pass you need ≥29GB (e.g. an RTX 5090).

Want less VRAM up front? Look at Hunyuan3D-Omni

The sibling model Hunyuan3D-Omni adds multi-modal control (point cloud / voxel / pose / bounding-box conditioning on top of image input) and its HF card states "It takes 10 GB VRAM for generation." Same install (cu128 Blackwell wheel), same license. Useful if you want skeletal or voxel control over the generated mesh.