What You'll Build
A full image-to-3D pipeline using Tencent's Hunyuan3D-2.1 — drop in a reference photo, get back a fully textured .glb mesh with physically-based rendering (PBR) materials (albedo, metallic, roughness) ready for any DCC tool. Unlike the 16GB RTX 5060 Ti sibling — which is forced to ship the shape-only flow — the 32GB RTX 5090 is the first consumer NVIDIA card that fits Hunyuan3D-2.1's combined shape + texture peak in a single process, no offload, no two-machine workflow.
Hardware data: RTX 5090 (32GB VRAM) · 29GB peak for combined shape + texture · See benchmark data
ℹ️ First consumer card to fit the 29 GB combined pipeline. Per the official README: "It takes 10 GB VRAM for shape generation, 21GB for texture generation and 29GB for shape and texture generation in total." The 5090's 32 GB envelope clears the 29 GB peak with ~3 GB margin — the 4090 (24 GB) and 3090 (24 GB) physically can't, and the 5060 Ti (16 GB) is shape-only. This is the workflow Tencent designed; the smaller cards run a subset.
⚠️ Blackwell sm_120 needs a different PyTorch wheel than the README documents. The repo pins
torch==2.5.1+cu124, but the RTX 5090'ssm_120compute capability requires CUDA 12.8 toolchain and PyTorch 2.7+. We swap the wheel in step 2 below. This is acknowledged community workaround territory — see Issue #22.
Requirements
| Component | Minimum | Community-tested |
|---|---|---|
| GPU | 29GB VRAM (full shape + texture pipeline) | RTX 5090 (32GB) — see Issue #22 PlanarFox Dockerfile + Issue #122 Gylfkxjyjdll for community-confirmed positive 5090 runtime; verdict=unknown until we measure firsthand |
| RAM | 32GB | — |
| Storage | ~20GB for weights + Real-ESRGAN checkpoint | — |
| Software | Python 3.10, CUDA 12.8, PyTorch 2.7.0+cu128 | — |
Installation
1. Clone the official repository
git clone https://github.com/Tencent-Hunyuan/Hunyuan3D-2.1.git
cd Hunyuan3D-2.1
2. Install PyTorch with cu128 wheels (Blackwell override)
The README pins torch==2.5.1+cu124, which does not include Blackwell sm_120 kernels and fails at first inference on a 5090 with no kernel image is available for execution on the device. Use the cu128 build instead, with TORCH_CUDA_ARCH_LIST="12.0" exported so the in-tree custom rasterizer and mesh-painter extensions build the right kernel.
export TORCH_CUDA_ARCH_LIST="12.0"
pip install torch==2.7.0 torchvision==0.22.0 torchaudio==2.7.0 --index-url https://download.pytorch.org/whl/cu128
pip install -r requirements.txt
cd hy3dpaint/custom_rasterizer
pip install -e .
cd ../..
cd hy3dpaint/DifferentiableRenderer
bash compile_mesh_painter.sh
cd ../..
wget https://github.com/xinntao/Real-ESRGAN/releases/download/v0.1.0/RealESRGAN_x4plus.pth -P hy3dpaint/ckpt
The torch==2.7.0 + cu128 + TORCH_CUDA_ARCH_LIST="12.0" triplet is the community-confirmed RTX 5090 path documented in Issue #22 by PlanarFox (community contributor; reporter pengpeng followed up that they ran it on a "5090 mobile" with the same recipe) and Issue #122 by Gylfkxjyjdll (full Dockerfile, also community-authored). Contributor DenisKochetov confirmed in the same thread that it works with cuda 12.8 — note GitHub CONTRIBUTOR association is not the Tencent maintainer team, so treat as community-aligned guidance rather than official support.
3. Read the Tencent license before you deploy
The weights are not gated on Hugging Face (they download freely on the first pipeline call via huggingface_hub, no click-through), but they are governed by the Tencent Hunyuan 3D 2.1 Community License — license: other, not Apache 2.0. Verbatim from the license header: "THIS LICENSE AGREEMENT DOES NOT APPLY IN THE EUROPEAN UNION, UNITED KINGDOM AND SOUTH KOREA AND IS EXPRESSLY LIMITED TO THE TERRITORY, AS DEFINED BELOW." Section 4 requires explicit Tencent approval for any product whose monthly active users exceed 1 million. The free download does not waive these terms — read the LICENSE in full before deploying anything user-facing.
Running
Use the official Python API from the Hunyuan3D-2.1 README. On the 5090 you can run both the shape and the paint pipelines in one process — no need for --low_vram_mode, no need to split the workload across two machines.
import sys
sys.path.insert(0, './hy3dshape')
sys.path.insert(0, './hy3dpaint')
from textureGenPipeline import Hunyuan3DPaintPipeline, Hunyuan3DPaintConfig
from hy3dshape.pipelines import Hunyuan3DDiTFlowMatchingPipeline
# Stage 1 — shape (Hunyuan3D-Shape-v2-1, 3.3B DiT)
shape_pipeline = Hunyuan3DDiTFlowMatchingPipeline.from_pretrained('tencent/Hunyuan3D-2.1')
mesh_untextured = shape_pipeline(image='assets/demo.png')[0]
mesh_untextured.export('output_untextured.glb')
# Stage 2 — PBR texture (Hunyuan3D-Paint-v2-1, 2B)
paint_pipeline = Hunyuan3DPaintPipeline(Hunyuan3DPaintConfig(max_num_view=6, resolution=512))
mesh_textured = paint_pipeline(mesh_path='output_untextured.glb', image_path='assets/demo.png')
The paint stage produces a .glb with embedded PBR maps (albedo, metallic, roughness) — open in Blender, three.js, Unity, Unreal, or any glTF 2.0 viewer that supports KHR_materials_pbrSpecularGlossiness / metallic-roughness.
If you prefer a UI, run the Gradio app — on a 5090 you don't need --low_vram_mode, but it doesn't hurt:
python3 gradio_app.py \
--model_path tencent/Hunyuan3D-2.1 \
--subfolder hunyuan3d-dit-v2-1 \
--texgen_model_path tencent/Hunyuan3D-2.1
Results
- VRAM usage: 29 GB peak for the combined shape + texture pipeline, cited verbatim from the official README ("It takes 10 GB VRAM for shape generation, 21GB for texture generation and 29GB for shape and texture generation in total."). That leaves ~3 GB of headroom on the 5090's 32 GB envelope — comfortable margin for a Blender background process or a co-located image-classification head, but not enough for a heavy second pipeline. For empirical numbers on this exact pair once a community submission lands, see /check/hunyuan-3d/rtx-5090.
- Output format:
.glb(glTF 2.0 binary) with embedded PBR material maps. Universal — import to Blender, Unity, Unreal, three.js, or convert to.obj/.fbx/.usdviatrimeshin Python. - Speed: intentionally omitted. Tencent does not publish per-GPU timings, and no first-party RTX 5090 benchmark exists for Hunyuan3D-2.1 as of this writing. Community runs in Issue #24 name "~70s mesh+texture on 4090" (Ada sm_89, by community user
nepfaff) and "less than 100s mesh + 7-10 mins mesh+texture on 3090" (Ampere sm_86, byjtydhr88, GitHubCONTRIBUTOR— community-aligned but not Tencent staff); both are different architectures, both use different stage configurations, and neither belongs in a 5090-pinned Results line. The 5090's ~30% bandwidth + compute uplift over the 4090 suggests sub-60s end-to-end is plausible, but until a 5090-named measurement appears, we route timings to/check/.
For the full benchmark data, see /check/hunyuan-3d/rtx-5090.
Troubleshooting
CUDA error: an illegal memory access was encountered during texture stage
A community report against cu128 + driver 580 + torch 2.9.0 dev nightly on a 5090 is open at Issue #146 — reporter zapan-669 (community, no maintainer ack at time of writing) sees the shape stage complete cleanly but the paint pipeline crashes during rasterization. Workarounds reported in adjacent threads:
- Pin
torch==2.7.0+cu128(stable) rather than a dev nightly; the install command in step 2 above uses this combination. - Make sure
TORCH_CUDA_ARCH_LIST="12.0"is set in your environment before runningpip install -e .inhy3dpaint/custom_rasterizer— the custom rasterizer kernel is compiled at install time and needs the right arch list. - Restart the Python process between runs. Community contributor
Gylfkxjyjdllreports in Issue #122: "With the 5090? Yes i can generate with texture. In the first run i hade something like memory leak. The vram increased with every run and failed. But after restarting the docker it worked as expected" (community user, no Tencent badge — included verbatim for the workaround pattern, not as Tencent guidance).
no kernel image is available for execution on the device
You installed the README's pinned torch==2.5.1+cu124 wheel, which does not include sm_120 kernels for Blackwell. Reinstall with the cu128 wheel per step 2 above.
FlashAttention-2 errors / wheel-build failures
Hunyuan3D-2.1's diffusion stages use diffusers' default SDPA (scaled_dot_product_attention) backend, which has full Blackwell sm_120 support via the cu128 PyTorch wheel and needs no extra install. If a dependency tree pulls in flash-attn for an unrelated reason and you hit Could not build wheels for flash-attn, note that the canonical FA2 wheel does not yet ship sm_120 kernels — tracked upstream at Dao-AILab/flash-attention#2168. Pin the dependency tree to skip flash-attn or run without it (SDPA is the default path).
Want lower VRAM (8 GB-style)? Look at Hunyuan3D-Omni
If you want to free up most of the 5090's 32 GB for a parallel workload, the sibling model Hunyuan3D-Omni explicitly states "It takes 10 GB VRAM for generation." on its model card. It adds multi-modal control (point cloud / voxel / pose / bounding-box conditioning) on top of image input, uses the same install path, and the same license. Useful if you want to colocate the 3D pipeline with a second model (e.g. an LLM for prompt enhancement, an image-segmentation backbone for input preprocessing) on the same card.