How much VRAM does SenseNova U1 need?

About 12 GB — the minimum this recipe targets.

How hard is this setup?

Intermediate — follow the steps above.

SenseNova U1 (8B-MoT) on RTX 5060 Ti: VAE-Free Unified Image Gen + Understanding via Q4 GGUF

What You'll Build

A working SenseNova U1 install on a 16 GB RTX 5060 Ti that does all four of the U1 tasks from one model — text-to-image at up to 2048×2048, visual Q&A on images, image editing, and interleaved text+image generation — without a VAE or separate visual encoder. The recipe pins the 8B-MoT dense variant (18B total parameters: ~8B understanding + ~8B generation merger), running at Q4_K_S GGUF (13.9 GB on disk) with --vram_mode balanced to keep peak VRAM in the ~10–12 GB range that the official model card calls out as the consumer-GPU sweet spot.

Hardware data: RTX 5060 Ti (16GB VRAM) · Q4_K_S GGUF, --vram_mode balanced · See benchmark data

ℹ️ One model name, two very different variants. The U1 release ships two siblings from the same sensenova org: the dense SenseNova-U1-8B-MoT (18B total params, this recipe) and the sparse-MoE SenseNova-U1-A3B-MoT (39B total / ~3B active). The A3B's smallest community quant — smthem/SenseNova-U1-A3B-MoT-SFT-gguf Q4_K_S — is 26.4 GB, so the MoE variant does not fit a 16 GB card at any currently-published quant. This recipe is only valid for the dense 8B-MoT row.

Requirements

Component	Minimum	Tested
GPU	12 GB VRAM (per model card "~10–12 GB consumer cards" guidance)	RTX 5060 Ti (16 GB)
RAM	32 GB (`--vram_mode balanced` does async CPU↔GPU prefetch)	—
Storage	14 GB (Q4_K_S GGUF) or 35.2 GB (full BF16 8 shards × ~4–5 GB)	—
Python	3.11	—
PyTorch	2.8 + CUDA 12.8 wheel (cu128)	—
Software	uv package manager; ComfyUI optional	—

The model is released under Apache 2.0 — commercial use is permitted.

Installation

1. Clone the upstream repo

git clone https://github.com/OpenSenseNova/SenseNova-U1.git
cd SenseNova-U1

2. Install with uv (default targets CUDA 12.8 / Blackwell)

uv sync
source .venv/bin/activate

Per docs/installation.md: "Python 3.11, torch 2.8, CUDA 12.8 (cu128)" is the recommended configuration. The default uv sync pulls cu128 wheels — exactly what RTX 50-series (sm_120 Blackwell) needs. If you have older NVIDIA drivers, edit [tool.uv.sources] in pyproject.toml to use cu126 before running uv sync.

3. Add GGUF support

uv pip install -e ".[gguf]"

This is what lets the t2i CLI load the smaller Q4 quantization rather than the full 35 GB BF16 checkpoint.

4. Download the Q4_K_S GGUF merger weights

huggingface-cli download smthem/SenseNova-U1-8B-MoT-Merger-gguf \
  SenseNova-U1-8B-MoT-Q6_K.gguf \
  --local-dir ./checkpoints
# Or for the tighter 14GB option:
huggingface-cli download smthem/SenseNova-U1-8B-MoT-Merger-gguf \
  SenseNova-U1-8B-MoT-8step-Q4_K_S.gguf \
  --local-dir ./checkpoints

The GGUF redistributor publishes a per-quant-tier file-size table: Q4_K_S = 13.9 GB, Q6_K = 16.0–16.1 GB, Q8_0 = 20 GB. On a 16 GB card, Q4_K_S is the safe choice with headroom for the understanding-side weights; Q6_K sits right at the card capacity and will rely on --vram_mode balanced aggressively. The upstream model card links this exact mirror in its "Memory-Efficient Mode" section.

5. (Optional) FlashAttention — only if `uv sync` didn't already pull it

uv sync --extra flash

Per docs/installation.md, "flash-attention is optional; the model transparently falls back to torch SDPA." This is important on Blackwell cards where FA2/FA3 wheels can lag — the SDPA fallback means you don't have to wait for sm_120-compatible kernels. Skip this step if uv sync already succeeded.

Running

Text-to-image with Q4 GGUF + balanced VRAM mode

python examples/t2i/inference.py \
  --model_path sensenova/SenseNova-U1-8B-MoT \
  --gguf_checkpoint ./checkpoints/SenseNova-U1-8B-MoT-8step-Q4_K_S.gguf \
  --vram_mode balanced \
  --prompt "A male peacock trying to attract a female" \
  --width 2048 --height 2048 \
  --cfg_scale 4.0 --num_steps 50 \
  --output output.png

--vram_mode balanced does async CPU↔GPU prefetch of layer weights, and per the upstream README "Q4 GGUF + balanced is the recommended setup for ~10–12 GB consumer cards" — comfortably inside the 5060 Ti's 16 GB.

For tighter memory (e.g. if you're sharing the GPU with another process), drop to --vram_mode low, which does synchronous per-layer swap and is slower but cuts peak resident VRAM further.

Visual Q&A (understanding) on the same model

python examples/vqa/inference.py \
  --model_path sensenova/SenseNova-U1-8B-MoT \
  --image examples/vqa/data/images/menu.jpg \
  --question "What's the cheapest item on this menu?" \
  --max_new_tokens 8192 \
  --output answer.txt

The unified architecture (NEO-unify, per the model card) means VQA, image editing, and interleaved text+image generation all run from the same checkpoint — see examples/editing/ and examples/interleave/ in the repo for the other two task scripts.

ComfyUI alternative

A community ComfyUI node — smthemex/ComfyUI_SenseNova_U1, maintained by the same author as the GGUF mirror — wraps the same checkpoints in a ComfyUI workflow. Its README documents "Test it use 8G Vram 36G Ram" (with aggressive CPU offload) and "If Vram >16G make prefetch_count =0" — so on a 16 GB card you can set prefetch_count=0 for full-GPU residency, or keep the default for headroom.

Results

Speed: Not quoted here — the only cited timing in upstream docs (docs/inference_infra.md referenced from the model card) names H100/H200 with TP2+CFG2 ("~9 s end-to-end for 2048×2048", "~0.15 s/step"). No comparable single-card consumer benchmark exists yet. The /check/sensenova-u1/rtx-5060-ti page will surface community-submitted timings as they land.
VRAM usage: Per the HF model card, Q4 GGUF + --vram_mode balanced is the recommended setup for "~10–12 GB consumer cards." The Q4_K_S file itself is 13.9 GB on disk; runtime peak with the balanced mode's async prefetch sits in the documented 10–12 GB band, leaving ~4 GB headroom on a 16 GB 5060 Ti for the understanding-path activations.
Quality notes: The VAE-free / visual-encoder-free design — described on the model card as "models pixel-word information end-to-end" via the NEO-unify architecture — is the headline differentiator vs. Flux/SD-class diffusion models. The same checkpoint does generation, understanding, editing, and interleaved output without swapping weights, which is what makes the unified-VRAM budget realistic.

For the full benchmark data, see /check/sensenova-u1/rtx-5060-ti.

Troubleshooting

`flash_attn` install fails or the model crashes on first inference (Blackwell sm_120)

FlashAttention 2/3 wheels frequently lag on new GPU architectures. Per the upstream installation guide, flash-attention is optional here — the model "transparently falls back to torch SDPA" without it. Simply skip the --extra flash step (or uv sync without it) and the inference path will use PyTorch's built-in SDPA, which has full sm_120 support in torch 2.8 + cu128.

"Out of memory" with `--vram_mode balanced` on the 16 GB card

Two escalation paths, in order:

Drop quantization tier: if you were using Q6_K (16.1 GB on disk, per the mirror), switch to Q4_K_S (13.9 GB). Q6_K sits right at the card's capacity and only works with aggressive prefetch.
Switch to --vram_mode low: synchronous per-layer CPU↔GPU swap; slower but cuts peak VRAM further. Used by the ComfyUI node's "8G Vram 36G Ram" test path.

I want the A3B MoE variant instead

It does not fit a 16 GB card at any currently-published quant. The smallest community GGUF is smthem/SenseNova-U1-A3B-MoT-SFT-gguf at Q4_K_S = 26.4 GB. The A3B's "~3B active params" describes inference compute, not VRAM — all 39B total parameters must be resident because the MoE router picks experts at runtime. Until a smaller quant lands (Q3 / Q2_K), keep to the 8B-MoT dense variant on this card or request the A3B recipe via /contribute.

My downloaded image is washed-out or has CFG artifacts at default settings

Per the upstream README, the canonical default for the standard checkpoint is --cfg_scale 4.0 --num_steps 50. For the 8-step distillation variants (Q4_K_S files with 8step in the filename), use the model card's published 8-step preset rather than the 50-step config. Report unexpected quality regressions via the submission form.