What You'll Build
A working SenseNova U1 install on a 16 GB RTX 4080 that does all four U1 tasks from one model — text-to-image at up to 2048×2048, visual Q&A on images, image editing, and interleaved text+image generation — without a VAE or separate visual encoder. The recipe pins the 8B-MoT dense variant (~18B total parameters: ~8B understanding + ~8B generation merger), running at Q4_K_S GGUF (13.9 GB on disk) with --vram_mode balanced to keep peak VRAM in the ~10–12 GB range that the official model card calls out as the consumer-GPU sweet spot.
Hardware data: RTX 4080 (16GB VRAM) · Q4_K_S GGUF, --vram_mode balanced · See benchmark data
ℹ️ Unified gen + understanding, in the
imagevertical. SenseNova U1 is a multimodal model — the same checkpoint produces images, answers questions about images, edits images, and produces interleaved text+image output. We catalogue it underimageto match the wider site's grouping (and our/check/row for it); the model card is explicit that the NEO-unify architecture models language and visual information "end-to-end as a unified compound."
ℹ️ One model name, two very different variants. The U1 release ships two siblings from the same
sensenovaorg: the denseSenseNova-U1-8B-MoT(~18B total params, this recipe) and the sparse-MoESenseNova-U1-A3B-MoT(~39B total / ~3B active). The A3B's smallest community quant —smthem/SenseNova-U1-A3B-MoT-SFT-ggufQ4_K_S — is 26.4 GB on disk, so the MoE variant does not fit a 16 GB card at any currently-published quant. This recipe is only valid for the dense 8B-MoT row.
Requirements
| Component | Minimum | Tested |
|---|---|---|
| GPU | 12 GB VRAM (per model card "~10–12 GB consumer cards" guidance) | RTX 4080 (16 GB) |
| RAM | 32 GB (--vram_mode balanced does async CPU↔GPU prefetch — system RAM holds non-resident layers) | — |
| Storage | 14 GB (Q4_K_S GGUF) or ~35 GB (full BF16 merger checkpoint) | — |
| Python | 3.11 | — |
| PyTorch | 2.8 + CUDA 12.6 or 12.8 wheel (Ada sm_89 supported on both) | — |
| Software | uv package manager; ComfyUI optional | — |
The model is released under Apache 2.0 — commercial use is permitted.
ℹ️ PCIe Gen4 ×16 host bandwidth helps the offload path. The
--vram_mode balancedpath streams non-resident layers from system RAM over the PCIe bus during the forward pass — the model card describes thebalancedmode as async prefetch that "overlaps H2D copy with compute". That host-to-device copy is gated by PCIe bandwidth, and the RTX 4080 runs a full Gen4 ×16 link. Any consumer-GPU VRAM/offload figure in this recipe is therefore a conservative floor on the 4080 — never tighter — even where the underlying number was first established on a narrower-link Ada card.
Installation
1. Clone the upstream repo
git clone https://github.com/OpenSenseNova/SenseNova-U1.git
cd SenseNova-U1
2. Install with uv
uv sync
source .venv/bin/activate
Per docs/installation.md: "Python 3.11, torch 2.8, CUDA 12.8 (cu128)" is the recommended configuration. The default uv sync pulls cu128 wheels — RTX 40-series (sm_89 Ada Lovelace) has full kernel coverage on both cu128 and cu126, so the default is fine. If you have older NVIDIA drivers you can switch to cu126 by editing [tool.uv.sources] in pyproject.toml per the upstream guide before running uv sync. (Unlike Blackwell GPUs (sm_120), no special wheel selection is required for the 4080 — the default pip install torch already includes sm_89 kernels.)
3. Add GGUF support
uv pip install -e ".[gguf]"
This is what lets the t2i CLI load the smaller Q4 quantization rather than the full BF16 checkpoint. The upstream README lists "gguf>=0.10.0" "diffusers>=0.30.0" as the relevant pins.
4. Download the Q4_K_S GGUF merger weights
huggingface-cli download smthem/SenseNova-U1-8B-MoT-Merger-gguf \
SenseNova-U1-8B-MoT-8step-Q4_K_S.gguf \
--local-dir ./checkpoints
The GGUF redistributor publishes a per-quant-tier file table — for this card you want Q4_K_S = 13.9 GB on disk (the smallest stable quant); Q6_K = 16.1 GB sits at or above the card's capacity and only works with aggressive prefetch; Q8_0 = 20.0 GB and BF16 ~35 GB do not fit. The upstream model card links this exact mirror in its quantized-weights section ("GGUF weights for SenseNova-U1-8B-MoT-Merger are available at [🤗 smthem/SenseNova-U1-8B-MoT-Merger-gguf]").
5. (Optional) Install FlashAttention
uv sync --extra flash
Per docs/installation.md, flash-attention is optional — the model "transparently falls back to torch SDPA" without it. On Ada sm_89, FA2 wheels have full kernel coverage, so this step works out of the box if you want the speedup; you can also just skip it and rely on SDPA. Either path is fine on the 4080.
Running
Text-to-image with Q4 GGUF + balanced VRAM mode
python examples/t2i/inference.py \
--model_path sensenova/SenseNova-U1-8B-MoT \
--gguf_checkpoint ./checkpoints/SenseNova-U1-8B-MoT-8step-Q4_K_S.gguf \
--vram_mode balanced \
--prompt "A male peacock trying to attract a female" \
--width 2048 --height 2048 \
--cfg_scale 1.0 --num_steps 8 \
--output output.png
Per the upstream model card, the balanced mode does async prefetch that "overlaps H2D copy with compute", and a "Q4 GGUF + balanced is the recommended setup for ~10–12 GB consumer cards." On a 16 GB 4080 this leaves ~4 GB of headroom on top of that envelope for the understanding-side activations and any small surge during the prefill — and because the 4080's PCIe Gen4 ×16 link feeds the async H2D copy faster than a narrower-link card, the prefetch is less likely to stall the compute.
If you're sharing the GPU with another process (e.g. a desktop compositor + a browser eating ~1 GB), drop to --vram_mode low, which does synchronous per-layer CPU↔GPU swap — slower but cuts peak resident VRAM further.
Visual Q&A (understanding) on the same model
python examples/vqa/inference.py \
--model_path sensenova/SenseNova-U1-8B-MoT \
--image examples/vqa/data/images/menu.jpg \
--question "What's the cheapest item on this menu?" \
--max_new_tokens 8192 \
--output answer.txt
The unified architecture (NEO-unify, per the model card) means VQA, image editing, and interleaved text+image generation all run from the same checkpoint — see examples/editing/ and examples/interleave/ in the repo for the other two task scripts.
ComfyUI alternative
A community ComfyUI node — smthemex/ComfyUI_SenseNova_U1, maintained by the same author as the GGUF mirror — wraps the same checkpoints in a ComfyUI workflow. Its README documents "Test it use 8G Vram 36G Ram" (with aggressive CPU offload) and "If Vram >16G make prefetch_count =0" — so on a 16 GB card you can set prefetch_count=0 for full-GPU residency, or keep the default for headroom.
Results
- Speed: Not quoted here — the only cited timing in upstream docs (model card) names H100/H200 with TP2+CFG2 ("~9 s end-to-end" and "~0.15 s/step" for a 2048×2048 image). No comparable single-card RTX 4080 benchmark exists yet for SenseNova U1, and the model's CPU↔GPU layer-streaming behaviour under
--vram_mode balancedmakes per-card timing strongly dependent on host RAM speed and PCIe link width, so we will not extrapolate a number. The/check/sensenova-u1/rtx-4080page will surface community-submitted timings as they land — contribute a measurement if you run it. - VRAM usage: Per the HF model card, Q4 GGUF +
--vram_mode balancedis the recommended setup for "~10–12 GB consumer cards." The Q4_K_S file itself is 13.9 GB on disk; runtime peak with the balanced mode's async prefetch sits in the documented 10–12 GB band, leaving ~4 GB headroom on a 16 GB 4080 for the understanding-path activations. - Quality notes: The VAE-free / visual-encoder-free design — described on the model card as eliminating both the Visual Encoder and Variational Auto-Encoder via the NEO-unify architecture — is the headline differentiator vs. Flux/SD-class diffusion models. The same checkpoint does generation, understanding, editing, and interleaved output without swapping weights, which is what makes the unified-VRAM budget realistic.
For the full benchmark data, see /check/sensenova-u1/rtx-4080.
Troubleshooting
flash_attn install fails or you'd rather skip it
FlashAttention is optional here. Per the upstream installation guide, the model "transparently falls back to torch SDPA" without it. Run uv sync without the --extra flash flag and the inference path will use PyTorch's built-in SDPA — which has full sm_89 (Ada) support in torch 2.8 on both cu126 and cu128. You won't see a runtime crash, just a small speed reduction in the attention layers.
"Out of memory" with --vram_mode balanced on the 16 GB card
Two escalation paths, in order:
- Drop quantization tier: if you were using Q6_K (16.1 GB on disk, per the mirror), switch to Q4_K_S (13.9 GB). Q6_K sits at the card's capacity and only works with very aggressive prefetch + a clean GPU (no other processes).
- Switch to
--vram_mode low: synchronous per-layer CPU↔GPU swap; slower but cuts peak VRAM further. This is the same path the ComfyUI node uses to achieve the "8G Vram 36G Ram" footprint.
I want the A3B MoE variant instead
It does not fit a 16 GB card at any currently-published quant. The smallest community GGUF is smthem/SenseNova-U1-A3B-MoT-SFT-gguf at Q4_K_S = 26.4 GB. The A3B's "~3B active params" describes inference compute, not VRAM — all 39B total parameters must be resident because the MoE router picks experts at runtime. Until a smaller quant lands (Q3 / Q2_K), keep to the 8B-MoT dense variant on this card or request the A3B recipe via /contribute.
My downloaded image is washed-out or has CFG artifacts at default settings
Per the upstream README, the canonical default for the standard checkpoint is --cfg_scale 4.0 --num_steps 50 — but the file downloaded above is the 8-step distillation variant (filename contains 8step), so the canonical example uses --cfg_scale 1.0 --num_steps 8. If you switch to a non-distilled checkpoint (without 8step in the name), switch the inference flags back to --cfg_scale 4.0 --num_steps 50. Report unexpected quality regressions via the submission form.