SenseNova U1 (8B-MoT) on RTX 5070 Ti: VAE-Free Unified Image Gen + Understanding via Q4 GGUF

What You'll Build

A working SenseNova U1 install on a 16 GB RTX 5070 Ti that does all four U1 tasks from one model — text-to-image at up to 2048×2048, visual Q&A on images, image editing, and interleaved text+image generation — without a VAE or separate visual encoder. The recipe pins the 8B-MoT dense variant (~18B total parameters: ~8B understanding + ~8B generation merger), running at Q4_K_S GGUF (13.88 GB on disk) with --vram_mode balanced to keep peak VRAM in the ~10–12 GB range that the official model card calls out as the consumer-GPU sweet spot.

Hardware data: RTX 5070 Ti (16GB VRAM) · Q4_K_S GGUF, --vram_mode balanced · See benchmark data

ℹ️ Unified gen + understanding, in the image vertical. SenseNova U1 is a multimodal model — the same checkpoint produces images, answers questions about images, edits images, and produces interleaved text+image output. We catalogue it under image to match the wider site's grouping (and our /check/ row for it); the model card is explicit that its NEO-unify architecture "eliminates both Visual Encoder (VE) and Variational Auto-Encoder (VAE) where pixel-word information are inherently and deeply correlated."

ℹ️ One model name, two very different variants. The U1 release ships two siblings from the same sensenova org: the dense SenseNova-U1-8B-MoT (~18B total params, this recipe) and the sparse-MoE SenseNova-U1-A3B-MoT (~39B total / ~3B active). The A3B's smallest community quant — smthem/SenseNova-U1-A3B-MoT-SFT-gguf Q4_K_S — is 26.43 GB on disk, so the MoE variant does not fit a 16 GB card at any currently-published quant. This recipe is only valid for the dense 8B-MoT row.

Requirements

Component	Minimum	Tested
GPU	12 GB VRAM (per model card "~10–12 GB consumer cards" guidance)	RTX 5070 Ti (16 GB)
RAM	32 GB (`--vram_mode balanced` does async CPU↔GPU prefetch — system RAM holds non-resident layers)	—
Storage	14 GB (Q4_K_S GGUF) or ~35 GB (full BF16 merger checkpoint)	—
Python	3.11	—
PyTorch	2.8 + CUDA 12.8 wheel (cu128 — required for Blackwell sm_120)	—
Software	uv package manager; ComfyUI optional	—

The model is released under Apache 2.0 — commercial use is permitted.

Installation

1. Clone the upstream repo

git clone https://github.com/OpenSenseNova/SenseNova-U1.git
cd SenseNova-U1

2. Install with uv (default targets CUDA 12.8 / Blackwell)

uv sync
source .venv/bin/activate

Per the upstream installation guide, the recommended software stack is "Python 3.11, torch 2.8, CUDA 12.8 (cu128)." The default uv sync pulls cu128 wheels — exactly what RTX 50-series cards (sm_120 Blackwell) need; the 5070 Ti has no sm_120 kernels under older CUDA wheels, so do not downgrade to cu126 here unless your driver forces it. The same guide notes that if your driver does not support cu128 you can change [tool.uv.sources] / [[tool.uv.index]] in pyproject.toml to a cu126 index before running uv sync.

3. Add GGUF support

uv pip install -e ".[gguf]"   # or: pip install "gguf>=0.10.0" "diffusers>=0.30.0"

This is what lets the inference scripts load a quantized .gguf file via the diffusers GGUF Linear layer instead of the full ~35 GB BF16 safetensors — the version pins above are the ones the model card lists for the optional GGUF extra.

4. Download the Q4_K_S GGUF merger weights

huggingface-cli download smthem/SenseNova-U1-8B-MoT-Merger-gguf \
  SenseNova-U1-8B-MoT-8step-Q4_K_S.gguf \
  --local-dir ./checkpoints

The GGUF redistributor publishes a per-quant-tier file table — for a 16 GB card you want SenseNova-U1-8B-MoT-8step-Q4_K_S.gguf = 13.88 GB on disk (the smallest stable quant). The next tier up, SenseNova-U1-8B-MoT-8step-Q6_K.gguf = 16.06 GB, sits at or above the card's capacity and only works with aggressive prefetch; SenseNova-U1-8B-MoT-8step-Q8_0.gguf = 19.99 GB and the BF16 merger safetensors (35.10 GB) do not fit. The upstream model card links this exact mirror in its memory-efficient-inference section, crediting GitHub user @smthem for contributing the quantized GGUF weights.

5. (Optional) Install FlashAttention

uv sync --extra flash

Per the upstream installation guide, flash-attn is declared as an optional extra — "without it the model transparently falls back to torch SDPA" — and once flash-attn is importable the runtime picks it automatically (--attn_backend auto). This matters on Blackwell sm_120, where FA2/FA3 wheels frequently lag the latest GPU arch: the SDPA fallback (full sm_120 coverage in torch 2.8 + cu128) means you don't have to block on a matching flash-attn wheel. Skip this step if uv sync already succeeded.

Running

Text-to-image with Q4 GGUF + balanced VRAM mode

python examples/t2i/inference.py \
  --model_path sensenova/SenseNova-U1-8B-MoT \
  --gguf_checkpoint ./checkpoints/SenseNova-U1-8B-MoT-8step-Q4_K_S.gguf \
  --vram_mode balanced \
  --prompt "A male peacock trying to attract a female" \
  --width 2048 --height 2048 \
  --cfg_scale 4.0 --cfg_norm none --timestep_shift 3.0 --num_steps 50 \
  --output output.png

Per the model card, --vram_mode keeps the language-model layers resident on CPU pinned memory and streams them onto the GPU on-demand during forward; the balanced mode "Async prefetch overlaps H2D copy with compute", and "a Q4 GGUF + balanced is the recommended setup for ~10–12 GB consumer cards." On the 5070 Ti's PCIe Gen5 x16 bus this CPU↔GPU prefetch is fast, and the documented 10–12 GB runtime envelope leaves ~4 GB of the card's 16 GB free for the understanding-side activations and any small surge during prefill.

If you're sharing the GPU with another process (e.g. a desktop compositor + a browser eating ~1 GB), drop to --vram_mode low, which does a synchronous per-layer CPU↔GPU swap — slower but cuts peak resident VRAM further.

💡 The downloaded file is the 8-step weight set, but the canonical default flags above (--cfg_scale 4.0 --cfg_norm none --timestep_shift 3.0 --num_steps 50) come from the model card's standard text-to-image example. See the Troubleshooting note below on the 8-step distillation path before switching to an 8-step preset.

Visual Q&A (understanding) on the same model

python examples/vqa/inference.py \
  --model_path sensenova/SenseNova-U1-8B-MoT \
  --image examples/vqa/data/images/menu.jpg \
  --question "What's the cheapest item on this menu?" \
  --max_new_tokens 8192 \
  --output answer.txt

The unified architecture (NEO-unify, per the model card) means VQA, image editing, and interleaved text+image generation all run from the same checkpoint — see examples/editing/ and examples/interleave/ in the repo for the other two task scripts.

ComfyUI alternative

A community ComfyUI node — smthemex/ComfyUI_SenseNova_U1, maintained by the same author as the GGUF mirror — wraps the same checkpoints in a ComfyUI workflow. Its README documents "Test it use 8G Vram 36G Ram" (with aggressive CPU offload) and "If Vram >16G make prefetch_count =0" — so on a 16 GB card you can set prefetch_count=0 for full-GPU residency when using the Q6 GGUF, or keep the default swap for headroom.

Results

Speed: Omitted — no RTX 5070 Ti benchmark exists for SenseNova U1 yet, and the only cited timing in the upstream docs names enterprise hardware: "~0.15 s/step" and "~9 s end-to-end" for a 2048×2048 image on H100 / H200 with the TP2 + CFG2 LightLLM + LightX2V serving stack (model card). That figure cannot be forward-extrapolated to a single consumer 5070 Ti (different parallelism, different memory-bandwidth class — the 5070 Ti's ~896 GB/s GDDR7 is a fraction of an H100/H200 HBM stack). The /check/sensenova-u1/rtx-5070-ti page will surface community-submitted single-card timings as they land — please contribute yours.
VRAM usage: Per the HF model card, Q4 GGUF + --vram_mode balanced is the recommended setup for "~10–12 GB consumer cards." The Q4_K_S file itself is 13.88 GB on disk; runtime peak with the balanced mode's async CPU↔GPU prefetch sits in the documented 10–12 GB band, leaving ~4 GB headroom on a 16 GB 5070 Ti for the understanding-path activations. See /check/sensenova-u1/rtx-5070-ti for live data.
Quality notes: The VAE-free / visual-encoder-free design — the model card's NEO-unify architecture "eliminates both Visual Encoder (VE) and Variational Auto-Encoder (VAE)" — is the headline differentiator vs. Flux/SD-class diffusion models. The same checkpoint does generation, understanding, editing, and interleaved output without swapping weights, which is what makes the unified-VRAM budget realistic.

For the full benchmark data, see /check/sensenova-u1/rtx-5070-ti.

Troubleshooting

`flash_attn` install fails or the model crashes on first inference (Blackwell sm_120)

FlashAttention 2/3 wheels frequently lag on new GPU architectures, and the 5070 Ti's sm_120 target is among the newest. The upstream installation guide declares flash-attn as an optional extra — "without it the model transparently falls back to torch SDPA" — so simply skip the uv sync --extra flash step and the inference path will use PyTorch's built-in SDPA, which has full sm_120 support in torch 2.8 + cu128. You won't see a runtime crash, just a small speed reduction in the attention layers.

"Out of memory" with `--vram_mode balanced` on the 16 GB card

Two escalation paths, in order:

Drop quantization tier: if you were using SenseNova-U1-8B-MoT-8step-Q6_K.gguf (16.06 GB on disk, per the mirror), switch to SenseNova-U1-8B-MoT-8step-Q4_K_S.gguf (13.88 GB). Q6_K sits at or above the card's capacity and only works with very aggressive prefetch on a clean GPU (no other processes).
Switch to --vram_mode low: a synchronous per-layer CPU↔GPU swap; slower but cuts peak VRAM further. This is the same offload path the ComfyUI node uses to reach its "8G Vram 36G Ram" footprint.

I want the A3B MoE variant instead

It does not fit a 16 GB card at any currently-published quant. The smallest community GGUF is smthem/SenseNova-U1-A3B-MoT-SFT-gguf at Q4_K_S = 26.43 GB. The A3B's "~3B active params" describes inference compute, not VRAM — all ~39B total parameters must be resident because the MoE router picks experts per token at runtime. Until a smaller quant lands (Q3 / Q2_K), keep to the 8B-MoT dense variant on this card or request the A3B recipe via /contribute.

My image is washed-out or has CFG artifacts at the 8-step settings

The file downloaded above carries the 8-step weights, but the canonical default flags in the Running section (--cfg_scale 4.0 --cfg_norm none --timestep_shift 3.0 --num_steps 50) are the model card's settings for the standard checkpoint. The upstream docs/base_vs_distill.md shows the 8-step path uses --cfg_scale 1.0 --cfg_norm none --timestep_shift 3.0 --num_steps 8 instead. Note that the upstream 8-step preview checkpoint is marked deprecated and the maintainers flag a known issue in the 8-step LoRA — if 8-step output looks off, run the standard 50-step flags above, or report regressions via the submission form.