What You'll Build
A working SenseNova U1 install on a 16 GB RTX 5060 Ti that does all four of the U1 tasks from one model — text-to-image at up to 2048×2048, visual Q&A on images, image editing, and interleaved text+image generation — without a VAE or separate visual encoder. The recipe pins the 8B-MoT dense variant (18B total parameters: ~8B understanding + ~8B generation merger), running at Q4_K_S GGUF (13.9 GB on disk) with --vram_mode balanced to keep peak VRAM in the ~10–12 GB range that the official model card calls out as the consumer-GPU sweet spot.
Hardware data: RTX 5060 Ti (16GB VRAM) · Q4_K_S GGUF, --vram_mode balanced · See benchmark data
ℹ️ One model name, two very different variants. The U1 release ships two siblings from the same
sensenovaorg: the denseSenseNova-U1-8B-MoT(18B total params, this recipe) and the sparse-MoESenseNova-U1-A3B-MoT(39B total / ~3B active). The A3B's smallest community quant —smthem/SenseNova-U1-A3B-MoT-SFT-ggufQ4_K_S — is 26.4 GB, so the MoE variant does not fit a 16 GB card at any currently-published quant. This recipe is only valid for the dense 8B-MoT row.
Requirements
| Component | Minimum | Tested |
|---|---|---|
| GPU | 12 GB VRAM (per model card "~10–12 GB consumer cards" guidance) | RTX 5060 Ti (16 GB) |
| RAM | 32 GB (--vram_mode balanced does async CPU↔GPU prefetch) | — |
| Storage | 14 GB (Q4_K_S GGUF) or 35.2 GB (full BF16 8 shards × ~4–5 GB) | — |
| Python | 3.11 | — |
| PyTorch | 2.8 + CUDA 12.8 wheel (cu128) | — |
| Software | uv package manager; ComfyUI optional | — |
The model is released under Apache 2.0 — commercial use is permitted.
Installation
1. Clone the upstream repo
git clone https://github.com/OpenSenseNova/SenseNova-U1.git
cd SenseNova-U1
2. Install with uv (default targets CUDA 12.8 / Blackwell)
uv sync
source .venv/bin/activate
Per docs/installation.md: "Python 3.11, torch 2.8, CUDA 12.8 (cu128)" is the recommended configuration. The default uv sync pulls cu128 wheels — exactly what RTX 50-series (sm_120 Blackwell) needs. If you have older NVIDIA drivers, edit [tool.uv.sources] in pyproject.toml to use cu126 before running uv sync.
3. Add GGUF support
uv pip install -e ".[gguf]"
This is what lets the t2i CLI load the smaller Q4 quantization rather than the full 35 GB BF16 checkpoint.
4. Download the Q4_K_S GGUF merger weights
huggingface-cli download smthem/SenseNova-U1-8B-MoT-Merger-gguf \
SenseNova-U1-8B-MoT-Q6_K.gguf \
--local-dir ./checkpoints
# Or for the tighter 14GB option:
huggingface-cli download smthem/SenseNova-U1-8B-MoT-Merger-gguf \
SenseNova-U1-8B-MoT-8step-Q4_K_S.gguf \
--local-dir ./checkpoints
The GGUF redistributor publishes a per-quant-tier file-size table: Q4_K_S = 13.9 GB, Q6_K = 16.0–16.1 GB, Q8_0 = 20 GB. On a 16 GB card, Q4_K_S is the safe choice with headroom for the understanding-side weights; Q6_K sits right at the card capacity and will rely on --vram_mode balanced aggressively. The upstream model card links this exact mirror in its "Memory-Efficient Mode" section.
5. (Optional) FlashAttention — only if uv sync didn't already pull it
uv sync --extra flash
Per docs/installation.md, "flash-attention is optional; the model transparently falls back to torch SDPA." This is important on Blackwell cards where FA2/FA3 wheels can lag — the SDPA fallback means you don't have to wait for sm_120-compatible kernels. Skip this step if uv sync already succeeded.
Running
Text-to-image with Q4 GGUF + balanced VRAM mode
python examples/t2i/inference.py \
--model_path sensenova/SenseNova-U1-8B-MoT \
--gguf_checkpoint ./checkpoints/SenseNova-U1-8B-MoT-8step-Q4_K_S.gguf \
--vram_mode balanced \
--prompt "A male peacock trying to attract a female" \
--width 2048 --height 2048 \
--cfg_scale 4.0 --num_steps 50 \
--output output.png
--vram_mode balanced does async CPU↔GPU prefetch of layer weights, and per the upstream README "Q4 GGUF + balanced is the recommended setup for ~10–12 GB consumer cards" — comfortably inside the 5060 Ti's 16 GB.
For tighter memory (e.g. if you're sharing the GPU with another process), drop to --vram_mode low, which does synchronous per-layer swap and is slower but cuts peak resident VRAM further.
Visual Q&A (understanding) on the same model
python examples/vqa/inference.py \
--model_path sensenova/SenseNova-U1-8B-MoT \
--image examples/vqa/data/images/menu.jpg \
--question "What's the cheapest item on this menu?" \
--max_new_tokens 8192 \
--output answer.txt
The unified architecture (NEO-unify, per the model card) means VQA, image editing, and interleaved text+image generation all run from the same checkpoint — see examples/editing/ and examples/interleave/ in the repo for the other two task scripts.
ComfyUI alternative
A community ComfyUI node — smthemex/ComfyUI_SenseNova_U1, maintained by the same author as the GGUF mirror — wraps the same checkpoints in a ComfyUI workflow. Its README documents "Test it use 8G Vram 36G Ram" (with aggressive CPU offload) and "If Vram >16G make prefetch_count =0" — so on a 16 GB card you can set prefetch_count=0 for full-GPU residency, or keep the default for headroom.
Results
- Speed: Not quoted here — the only cited timing in upstream docs (
docs/inference_infra.mdreferenced from the model card) names H100/H200 with TP2+CFG2 ("~9 s end-to-end for 2048×2048", "~0.15 s/step"). No comparable single-card consumer benchmark exists yet. The/check/sensenova-u1/rtx-5060-tipage will surface community-submitted timings as they land. - VRAM usage: Per the HF model card, Q4 GGUF +
--vram_mode balancedis the recommended setup for "~10–12 GB consumer cards." The Q4_K_S file itself is 13.9 GB on disk; runtime peak with the balanced mode's async prefetch sits in the documented 10–12 GB band, leaving ~4 GB headroom on a 16 GB 5060 Ti for the understanding-path activations. - Quality notes: The VAE-free / visual-encoder-free design — described on the model card as "models pixel-word information end-to-end" via the NEO-unify architecture — is the headline differentiator vs. Flux/SD-class diffusion models. The same checkpoint does generation, understanding, editing, and interleaved output without swapping weights, which is what makes the unified-VRAM budget realistic.
For the full benchmark data, see /check/sensenova-u1/rtx-5060-ti.
Troubleshooting
flash_attn install fails or the model crashes on first inference (Blackwell sm_120)
FlashAttention 2/3 wheels frequently lag on new GPU architectures. Per the upstream installation guide, flash-attention is optional here — the model "transparently falls back to torch SDPA" without it. Simply skip the --extra flash step (or uv sync without it) and the inference path will use PyTorch's built-in SDPA, which has full sm_120 support in torch 2.8 + cu128.
"Out of memory" with --vram_mode balanced on the 16 GB card
Two escalation paths, in order:
- Drop quantization tier: if you were using Q6_K (16.1 GB on disk, per the mirror), switch to Q4_K_S (13.9 GB). Q6_K sits right at the card's capacity and only works with aggressive prefetch.
- Switch to
--vram_mode low: synchronous per-layer CPU↔GPU swap; slower but cuts peak VRAM further. Used by the ComfyUI node's "8G Vram 36G Ram" test path.
I want the A3B MoE variant instead
It does not fit a 16 GB card at any currently-published quant. The smallest community GGUF is smthem/SenseNova-U1-A3B-MoT-SFT-gguf at Q4_K_S = 26.4 GB. The A3B's "~3B active params" describes inference compute, not VRAM — all 39B total parameters must be resident because the MoE router picks experts at runtime. Until a smaller quant lands (Q3 / Q2_K), keep to the 8B-MoT dense variant on this card or request the A3B recipe via /contribute.
My downloaded image is washed-out or has CFG artifacts at default settings
Per the upstream README, the canonical default for the standard checkpoint is --cfg_scale 4.0 --num_steps 50. For the 8-step distillation variants (Q4_K_S files with 8step in the filename), use the model card's published 8-step preset rather than the 50-step config. Report unexpected quality regressions via the submission form.