self-hosted/ai
§01·recipe · image

SenseNova U1 (8B-MoT) on RTX 5070: VAE-Free Unified Image Gen + Understanding via Q4 GGUF + Layer Offload

imageintermediate12GB+ VRAMJun 5, 2026
models
tools
prerequisites
  • NVIDIA RTX 5070 (12GB VRAM) or equivalent Blackwell card
  • Python 3.11
  • PyTorch 2.8 with the CUDA 12.8 wheel (cu128) — required for Blackwell sm_120
  • uv package manager
  • 32GB+ system RAM (layer-offload streams non-resident layers through host memory)
  • ~14GB free disk for the Q4_K_S GGUF checkpoint

What You'll Build

A working SenseNova U1 install on a 12 GB RTX 5070 that does all four U1 tasks from one model — text-to-image at up to 2048×2048, visual Q&A on images, image editing, and interleaved text+image generation — without a VAE or a separate visual encoder. The recipe pins the 8B-MoT dense variant (~18B total parameters: ~8B understanding + ~8B generation merger), running at Q4_K_S GGUF (13.88 GB on disk) with the layer-offload --vram_mode that streams the language-model weights between CPU RAM and the GPU so the resident footprint fits a 12 GB card.

Hardware data: RTX 5070 (12GB VRAM) · Q4_K_S GGUF + layer-offload --vram_mode · See benchmark data

ℹ️ Unified gen + understanding, in the image vertical. SenseNova U1 is a multimodal model — the same checkpoint produces images, answers questions about images, edits images, and produces interleaved text+image output. We catalogue it under image to match the wider site's grouping (and our /check/ row for it); the model card is explicit that its NEO-unify architecture "eliminates both Visual Encoder (VE) and Variational Auto-Encoder (VAE) where pixel-word information are inherently and deeply correlated."

⚠️ Tight fit on 12 GB — lead with the aggressive offload mode. A 12 GB desktop card with a display attached only exposes ~10.5–11.3 GB usable. The Q4 GGUF + balanced combination that the model card recommends for "~10–12 GB consumer cards" sits at the very top of that band, so on a 12 GB card with a monitor it can OOM. This recipe therefore leads with the more aggressive low mode (and a community-measured sub-12 GB ComfyUI path) and demotes balanced to "use it if you have spare VRAM." See Running.

ℹ️ One model name, two very different variants. The U1 release ships two siblings from the same sensenova org: the dense SenseNova-U1-8B-MoT (~18B total params, this recipe) and the sparse-MoE SenseNova-U1-A3B-MoT (~39B total / ~3B active). The A3B's smallest community quant — smthem/SenseNova-U1-A3B-MoT-SFT-gguf Q4_K_S — is 26.43 GB on disk, so the MoE variant does not fit a 12 GB card at any currently-published quant. This recipe is only valid for the dense 8B-MoT row.

Requirements

ComponentMinimumTested
GPU12 GB VRAM (Q4 GGUF + layer offload; see Running for the mode)RTX 5070 (12 GB)
RAM32 GB+ (offload holds non-resident layers in pinned host memory; the community ComfyUI node documents a 36 GB-RAM footprint)
Storage14 GB (Q4_K_S GGUF)
Python3.11
PyTorch2.8 + CUDA 12.8 wheel (cu128 — required for Blackwell sm_120)
Softwareuv package manager; ComfyUI optional

The model is released under Apache 2.0 — commercial use is permitted.

Installation

1. Clone the upstream repo

git clone https://github.com/OpenSenseNova/SenseNova-U1.git
cd SenseNova-U1

2. Install with uv (default targets CUDA 12.8 / Blackwell)

uv sync
source .venv/bin/activate

Per the upstream installation guide, the recommended software stack is "Python 3.11, torch 2.8, CUDA 12.8 (cu128)." The default uv sync pulls cu128 wheels — exactly what RTX 50-series cards (sm_120 Blackwell) need; the RTX 5070 has no sm_120 kernels under older CUDA wheels, so do not downgrade to cu126 here unless your driver forces it. The same guide notes that if your driver does not support cu128 you can change [tool.uv.sources] / [[tool.uv.index]] in pyproject.toml to a cu126 index before running uv sync.

3. Add GGUF support

uv pip install -e ".[gguf]"   # or: pip install "gguf>=0.10.0" "diffusers>=0.30.0"

This is what lets the inference scripts load a quantized .gguf file via the diffusers GGUF Linear layer instead of the full BF16 safetensors — the version pins above are the ones the model card lists for the optional GGUF extra. The base --model_path is still required for the tokenizer, config, and non-LM weights.

4. Download the Q4_K_S GGUF merger weights

huggingface-cli download smthem/SenseNova-U1-8B-MoT-Merger-gguf \
  SenseNova-U1-8B-MoT-8step-Q4_K_S.gguf \
  --local-dir ./checkpoints

The GGUF redistributor publishes a per-quant-tier file table — for a 12 GB card you want the smallest stable quant, SenseNova-U1-8B-MoT-8step-Q4_K_S.gguf = 13.88 GB on disk. Every larger tier (SenseNova-U1-8B-MoT-8step-Q6_K.gguf = 16.06 GB, SenseNova-U1-8B-MoT-8step-Q8_0.gguf = 19.99 GB, and the BF16 merger safetensors at 35.10 GB) needs much more residency and is not worth fighting on a 12 GB card. The upstream model card links this exact mirror in its memory-efficient-inference section, crediting GitHub user @smthem for contributing the quantized GGUF weights — so the redistributor citation is endorsed by the canonical org itself.

5. (Optional) Install FlashAttention

uv sync --extra flash

Per the upstream installation guide, flash-attn is declared as an optional extra — "without it the model transparently falls back to torch SDPA" — and once flash-attn is importable the runtime picks it automatically (--attn_backend auto). This matters on Blackwell sm_120, where FA2/FA3 wheels frequently lag the latest GPU arch: the SDPA fallback (full sm_120 coverage in torch 2.8 + cu128) means you don't have to block on a matching flash-attn wheel. Skip this step if uv sync already succeeded — you do not need flash-attn to run on the RTX 5070.

Running

Text-to-image with Q4 GGUF + layer offload

On a 12 GB RTX 5070 with a display attached, lead with --vram_mode low — the model card describes it as a "Synchronous per-layer CPU↔GPU swap" whose role is the "Lowest VRAM footprint". It keeps the language-model layers in pinned host memory and streams them onto the GPU on demand, which is what drops the resident weight footprint under the ~11 GB the card actually exposes.

python examples/t2i/inference.py \
  --model_path sensenova/SenseNova-U1-8B-MoT \
  --gguf_checkpoint ./checkpoints/SenseNova-U1-8B-MoT-8step-Q4_K_S.gguf \
  --vram_mode low \
  --prompt "A male peacock trying to attract a female" \
  --width 2048 --height 2048 \
  --cfg_scale 1.0 --cfg_norm none --timestep_shift 3.0 --num_steps 8 \
  --output output.png

Per the model card, --vram_mode exists to "keep the language-model layers resident on CPU pinned memory and stream them onto the GPU on-demand during forward". On the RTX 5070's PCIe Gen5 x16 bus this CPU↔GPU swap is fast, which softens the synchronous penalty of low. An independent data point for how low this can go: the community ComfyUI node smthemex/ComfyUI_SenseNova_U1 — which wraps the smthem/SenseNova-U1-8B-MoT-Merger-gguf mirror that the canonical card itself links — documents "Test it use 8G Vram 36G Ram" for the 8B-MoT GGUF path with aggressive offload, comfortably under the 5070's 12 GB. (Note that file uses the 8-step quant; the --num_steps 8 and --cfg_scale 1.0 flags above are the model card's 8-step settings.)

If you have more VRAM free — a headless box, or a second card driving the display — switch up to --vram_mode balanced, which the model card describes as "Async prefetch overlaps H2D copy with compute" and recommends together with a Q4 GGUF for "~10–12 GB consumer cards": it recovers speed by overlapping the copy with compute, but its peak sits at the top of the 12 GB band, so it's the right call only when you genuinely have the headroom.

Visual Q&A (understanding) on the same model

python examples/vqa/inference.py \
  --model_path sensenova/SenseNova-U1-8B-MoT \
  --gguf_checkpoint ./checkpoints/SenseNova-U1-8B-MoT-8step-Q4_K_S.gguf \
  --vram_mode low \
  --image examples/vqa/data/images/menu.jpg \
  --question "What's the cheapest item on this menu?" \
  --max_new_tokens 8192 \
  --output answer.txt

The unified architecture (NEO-unify, per the model card) means VQA, image editing, and interleaved text+image generation all run from the same checkpoint — see examples/editing/ and examples/interleave/ in the repo for the other two task scripts. All four scripts accept the same --gguf_checkpoint and --vram_mode low flags on a 12 GB card.

ComfyUI alternative

The same community ComfyUI node — smthemex/ComfyUI_SenseNova_U1 — wraps these checkpoints in a workflow with a prefetch_count swap knob. For the 12 GB 5070 you keep the default layer-swap on (its documented "Test it use 8G Vram 36G Ram" path); the node's own note that "If Vram >16G make prefetch_count =0" is the disable-swap case for larger cards, so on a 5070 you leave swapping enabled.

Results

  • Speed: Omitted — no RTX 5070 benchmark exists for SenseNova U1 yet, and the only timing the upstream docs publish names enterprise hardware: "~0.15 s/step" and "~9 s end-to-end" for a 2048×2048 image on H100 / H200 with the TP2 + CFG2 LightLLM + LightX2V serving stack (model card). That figure cannot be forward-extrapolated to a single consumer RTX 5070 — different parallelism, an HBM memory-bandwidth class far above the 5070's ~672 GB/s GDDR7, and a layer-offload path the datacenter stack doesn't use. The /check/sensenova-u1/rtx-5070 page will surface community-submitted single-card timings as they land — please contribute yours.
  • VRAM usage: The recipe path is Q4_K_S GGUF + --vram_mode low, the model card's "Lowest VRAM footprint" mode. The Q4_K_S file is 13.88 GB on disk, but with layer streaming the resident weight footprint drops well under that — the community ComfyUI node measures the 8B-MoT GGUF path at "8G Vram 36G Ram", which fits the 5070's 12 GB with display headroom. min_vram_gb here is the streaming runtime envelope, not the on-disk file size. See /check/sensenova-u1/rtx-5070 for live data once a community 5070 measurement lands.
  • Quality notes: The VAE-free / visual-encoder-free design — the model card's NEO-unify architecture "eliminates both Visual Encoder (VE) and Variational Auto-Encoder (VAE)" — is the headline differentiator vs. Flux/SD-class diffusion models. The same checkpoint does generation, understanding, editing, and interleaved output without swapping weights, which is what makes a single unified-VRAM budget realistic on a 12 GB card.

For the full benchmark data, see /check/sensenova-u1/rtx-5070.

Troubleshooting

flash_attn install fails or the model crashes on first inference (Blackwell sm_120)

FlashAttention 2/3 wheels frequently lag on new GPU architectures, and the RTX 5070's sm_120 target is among the newest. The upstream installation guide declares flash-attn as an optional extra — "without it the model transparently falls back to torch SDPA" — so simply skip the uv sync --extra flash step and the inference path will use PyTorch's built-in SDPA, which has full sm_120 support in torch 2.8 + cu128. You won't see a runtime crash, just a small speed reduction in the attention layers. The FA2 sm_120 wheel gap is tracked upstream at Dao-AILab#2168.

"Out of memory" on the 12 GB card

Two escalation paths, in order:

  1. Stay on --vram_mode low, not balanced. balanced recovers speed by prefetching, but its peak sits at the top of the model card's "~10–12 GB consumer cards" band — which a 12 GB card with a display (~11 GB usable) can't always absorb. low is the "Lowest VRAM footprint" mode and is the correct default for the 5070.
  2. Make sure you have enough system RAM. Layer offload moves weight residency off the GPU and into pinned host memory — the community ComfyUI node documents a 36 GB-RAM footprint for the 8 GB-VRAM path. If you have only 16 GB system RAM, the offload buffers themselves can become the bottleneck.

I want the A3B MoE variant instead

It does not fit a 12 GB card at any currently-published quant. The smallest community GGUF is smthem/SenseNova-U1-A3B-MoT-SFT-gguf at Q4_K_S = 26.43 GB. The A3B's "~3B active params" describes inference compute, not VRAM — all ~39B total parameters must be resident because the MoE router picks experts per token at runtime. Keep to the 8B-MoT dense variant on this card, or request the A3B recipe via /contribute.

My image is washed-out or has CFG artifacts

The Q4_K_S file downloaded above carries the 8-step weights, so the Running commands use the model card's 8-step settings (--cfg_scale 1.0 --num_steps 8). The upstream docs/base_vs_distill.md compares the 8-step distilled path against the standard checkpoint; the 8-step preview model is flagged with known issues. If 8-step output looks off, the model card's standard text-to-image preset is --cfg_scale 4.0 --cfg_norm none --timestep_shift 3.0 --num_steps 50 — slower, but the reference quality. Report regressions via the submission form.