SenseNova U1 (8B-MoT) on RTX 4080: VAE-Free Unified Image Gen + Understanding via Q4 GGUF

What You'll Build

A working SenseNova U1 install on a 16 GB RTX 4080 that does all four U1 tasks from one model — text-to-image at up to 2048×2048, visual Q&A on images, image editing, and interleaved text+image generation — without a VAE or separate visual encoder. The recipe pins the 8B-MoT dense variant (~18B total parameters: ~8B understanding + ~8B generation merger), running at Q4_K_S GGUF (13.9 GB on disk) with --vram_mode balanced to keep peak VRAM in the ~10–12 GB range that the official model card calls out as the consumer-GPU sweet spot.

Hardware data: RTX 4080 (16GB VRAM) · Q4_K_S GGUF, --vram_mode balanced · See benchmark data

ℹ️ Unified gen + understanding, in the image vertical. SenseNova U1 is a multimodal model — the same checkpoint produces images, answers questions about images, edits images, and produces interleaved text+image output. We catalogue it under image to match the wider site's grouping (and our /check/ row for it); the model card is explicit that the NEO-unify architecture models language and visual information "end-to-end as a unified compound."

ℹ️ One model name, two very different variants. The U1 release ships two siblings from the same sensenova org: the dense SenseNova-U1-8B-MoT (~18B total params, this recipe) and the sparse-MoE SenseNova-U1-A3B-MoT (~39B total / ~3B active). The A3B's smallest community quant — smthem/SenseNova-U1-A3B-MoT-SFT-gguf Q4_K_S — is 26.4 GB on disk, so the MoE variant does not fit a 16 GB card at any currently-published quant. This recipe is only valid for the dense 8B-MoT row.

Requirements

Component	Minimum	Tested
GPU	12 GB VRAM (per model card "~10–12 GB consumer cards" guidance)	RTX 4080 (16 GB)
RAM	32 GB (`--vram_mode balanced` does async CPU↔GPU prefetch — system RAM holds non-resident layers)	—
Storage	14 GB (Q4_K_S GGUF) or ~35 GB (full BF16 merger checkpoint)	—
Python	3.11	—
PyTorch	2.8 + CUDA 12.6 or 12.8 wheel (Ada sm_89 supported on both)	—
Software	uv package manager; ComfyUI optional	—

The model is released under Apache 2.0 — commercial use is permitted.

ℹ️ PCIe Gen4 ×16 host bandwidth helps the offload path. The --vram_mode balanced path streams non-resident layers from system RAM over the PCIe bus during the forward pass — the model card describes the balanced mode as async prefetch that "overlaps H2D copy with compute". That host-to-device copy is gated by PCIe bandwidth, and the RTX 4080 runs a full Gen4 ×16 link. Any consumer-GPU VRAM/offload figure in this recipe is therefore a conservative floor on the 4080 — never tighter — even where the underlying number was first established on a narrower-link Ada card.

Installation

1. Clone the upstream repo

git clone https://github.com/OpenSenseNova/SenseNova-U1.git
cd SenseNova-U1

2. Install with uv

uv sync
source .venv/bin/activate

Per docs/installation.md: "Python 3.11, torch 2.8, CUDA 12.8 (cu128)" is the recommended configuration. The default uv sync pulls cu128 wheels — RTX 40-series (sm_89 Ada Lovelace) has full kernel coverage on both cu128 and cu126, so the default is fine. If you have older NVIDIA drivers you can switch to cu126 by editing [tool.uv.sources] in pyproject.toml per the upstream guide before running uv sync. (Unlike Blackwell GPUs (sm_120), no special wheel selection is required for the 4080 — the default pip install torch already includes sm_89 kernels.)

3. Add GGUF support

uv pip install -e ".[gguf]"

This is what lets the t2i CLI load the smaller Q4 quantization rather than the full BF16 checkpoint. The upstream README lists "gguf>=0.10.0" "diffusers>=0.30.0" as the relevant pins.

4. Download the Q4_K_S GGUF merger weights

huggingface-cli download smthem/SenseNova-U1-8B-MoT-Merger-gguf \
  SenseNova-U1-8B-MoT-8step-Q4_K_S.gguf \
  --local-dir ./checkpoints

The GGUF redistributor publishes a per-quant-tier file table — for this card you want Q4_K_S = 13.9 GB on disk (the smallest stable quant); Q6_K = 16.1 GB sits at or above the card's capacity and only works with aggressive prefetch; Q8_0 = 20.0 GB and BF16 ~35 GB do not fit. The upstream model card links this exact mirror in its quantized-weights section ("GGUF weights for SenseNova-U1-8B-MoT-Merger are available at [🤗 smthem/SenseNova-U1-8B-MoT-Merger-gguf]").

5. (Optional) Install FlashAttention

uv sync --extra flash

Per docs/installation.md, flash-attention is optional — the model "transparently falls back to torch SDPA" without it. On Ada sm_89, FA2 wheels have full kernel coverage, so this step works out of the box if you want the speedup; you can also just skip it and rely on SDPA. Either path is fine on the 4080.

Running

Text-to-image with Q4 GGUF + balanced VRAM mode

python examples/t2i/inference.py \
  --model_path sensenova/SenseNova-U1-8B-MoT \
  --gguf_checkpoint ./checkpoints/SenseNova-U1-8B-MoT-8step-Q4_K_S.gguf \
  --vram_mode balanced \
  --prompt "A male peacock trying to attract a female" \
  --width 2048 --height 2048 \
  --cfg_scale 1.0 --num_steps 8 \
  --output output.png

Per the upstream model card, the balanced mode does async prefetch that "overlaps H2D copy with compute", and a "Q4 GGUF + balanced is the recommended setup for ~10–12 GB consumer cards." On a 16 GB 4080 this leaves ~4 GB of headroom on top of that envelope for the understanding-side activations and any small surge during the prefill — and because the 4080's PCIe Gen4 ×16 link feeds the async H2D copy faster than a narrower-link card, the prefetch is less likely to stall the compute.

If you're sharing the GPU with another process (e.g. a desktop compositor + a browser eating ~1 GB), drop to --vram_mode low, which does synchronous per-layer CPU↔GPU swap — slower but cuts peak resident VRAM further.

Visual Q&A (understanding) on the same model

python examples/vqa/inference.py \
  --model_path sensenova/SenseNova-U1-8B-MoT \
  --image examples/vqa/data/images/menu.jpg \
  --question "What's the cheapest item on this menu?" \
  --max_new_tokens 8192 \
  --output answer.txt

The unified architecture (NEO-unify, per the model card) means VQA, image editing, and interleaved text+image generation all run from the same checkpoint — see examples/editing/ and examples/interleave/ in the repo for the other two task scripts.

ComfyUI alternative

A community ComfyUI node — smthemex/ComfyUI_SenseNova_U1, maintained by the same author as the GGUF mirror — wraps the same checkpoints in a ComfyUI workflow. Its README documents "Test it use 8G Vram 36G Ram" (with aggressive CPU offload) and "If Vram >16G make prefetch_count =0" — so on a 16 GB card you can set prefetch_count=0 for full-GPU residency, or keep the default for headroom.

Results

Speed: Not quoted here — the only cited timing in upstream docs (model card) names H100/H200 with TP2+CFG2 ("~9 s end-to-end" and "~0.15 s/step" for a 2048×2048 image). No comparable single-card RTX 4080 benchmark exists yet for SenseNova U1, and the model's CPU↔GPU layer-streaming behaviour under --vram_mode balanced makes per-card timing strongly dependent on host RAM speed and PCIe link width, so we will not extrapolate a number. The /check/sensenova-u1/rtx-4080 page will surface community-submitted timings as they land — contribute a measurement if you run it.
VRAM usage: Per the HF model card, Q4 GGUF + --vram_mode balanced is the recommended setup for "~10–12 GB consumer cards." The Q4_K_S file itself is 13.9 GB on disk; runtime peak with the balanced mode's async prefetch sits in the documented 10–12 GB band, leaving ~4 GB headroom on a 16 GB 4080 for the understanding-path activations.
Quality notes: The VAE-free / visual-encoder-free design — described on the model card as eliminating both the Visual Encoder and Variational Auto-Encoder via the NEO-unify architecture — is the headline differentiator vs. Flux/SD-class diffusion models. The same checkpoint does generation, understanding, editing, and interleaved output without swapping weights, which is what makes the unified-VRAM budget realistic.

For the full benchmark data, see /check/sensenova-u1/rtx-4080.

Troubleshooting

`flash_attn` install fails or you'd rather skip it

FlashAttention is optional here. Per the upstream installation guide, the model "transparently falls back to torch SDPA" without it. Run uv sync without the --extra flash flag and the inference path will use PyTorch's built-in SDPA — which has full sm_89 (Ada) support in torch 2.8 on both cu126 and cu128. You won't see a runtime crash, just a small speed reduction in the attention layers.

"Out of memory" with `--vram_mode balanced` on the 16 GB card

Two escalation paths, in order:

Drop quantization tier: if you were using Q6_K (16.1 GB on disk, per the mirror), switch to Q4_K_S (13.9 GB). Q6_K sits at the card's capacity and only works with very aggressive prefetch + a clean GPU (no other processes).
Switch to --vram_mode low: synchronous per-layer CPU↔GPU swap; slower but cuts peak VRAM further. This is the same path the ComfyUI node uses to achieve the "8G Vram 36G Ram" footprint.

I want the A3B MoE variant instead

It does not fit a 16 GB card at any currently-published quant. The smallest community GGUF is smthem/SenseNova-U1-A3B-MoT-SFT-gguf at Q4_K_S = 26.4 GB. The A3B's "~3B active params" describes inference compute, not VRAM — all 39B total parameters must be resident because the MoE router picks experts at runtime. Until a smaller quant lands (Q3 / Q2_K), keep to the 8B-MoT dense variant on this card or request the A3B recipe via /contribute.

My downloaded image is washed-out or has CFG artifacts at default settings

Per the upstream README, the canonical default for the standard checkpoint is --cfg_scale 4.0 --num_steps 50 — but the file downloaded above is the 8-step distillation variant (filename contains 8step), so the canonical example uses --cfg_scale 1.0 --num_steps 8. If you switch to a non-distilled checkpoint (without 8step in the name), switch the inference flags back to --cfg_scale 4.0 --num_steps 50. Report unexpected quality regressions via the submission form.