self-hosted/ai
§01·recipe · image

SenseNova U1 (8B-MoT) on RTX 4070: VAE-Free Unified Image Gen + Understanding via Q4 GGUF + Layer Offload

imageintermediate12GB+ VRAMJun 7, 2026
models
tools
prerequisites
  • NVIDIA RTX 4070 (12GB VRAM) or equivalent Ada Lovelace card
  • Python 3.11
  • PyTorch 2.8 with a recent CUDA wheel (cu128 default, or cu126 — both ship sm_89 kernels for Ada)
  • uv package manager
  • 32GB+ system RAM (layer-offload streams non-resident layers through host memory)
  • ~14GB free disk for the Q4_K_S GGUF checkpoint

What You'll Build

A working SenseNova U1 install on a 12 GB RTX 4070 that does all four U1 tasks from one model — text-to-image at up to 2048×2048, visual Q&A on images, image editing, and interleaved text+image generation — without a VAE or a separate visual encoder. The recipe pins the 8B-MoT dense variant (~18B total parameters: ~8B understanding + ~8B generation merger), running at Q4_K_S GGUF (13.88 GB on disk) with the layer-offload --vram_mode that streams the language-model weights between CPU RAM and the GPU so the resident footprint fits a 12 GB card.

Hardware data: RTX 4070 (12GB VRAM) · Q4_K_S GGUF + layer-offload --vram_mode · See benchmark data

ℹ️ Unified gen + understanding, in the image vertical. SenseNova U1 is a multimodal model — the same checkpoint produces images, answers questions about images, edits images, and produces interleaved text+image output. We catalogue it under image to match the wider site's grouping (and our /check/ row for it); the model card is explicit that its NEO-unify architecture "eliminates both Visual Encoder (VE) and Variational Auto-Encoder (VAE) where pixel-word information are inherently and deeply correlated."

⚠️ Tight fit on 12 GB — lead with the aggressive offload mode. A 12 GB desktop card with a display attached only exposes roughly 10.5–11.3 GB usable. The Q4 GGUF + balanced combination that the model card recommends for "~10–12 GB consumer cards" sits at the very top of that band, so on a 12 GB card with a monitor it can OOM. This recipe therefore leads with the more aggressive low mode (and a community-measured sub-12 GB ComfyUI path) and demotes balanced to "use it if you have spare VRAM." See Running.

ℹ️ One model name, two very different variants. The U1 release ships two siblings from the same sensenova org: the dense SenseNova-U1-8B-MoT (~18B total params, this recipe) and the sparse-MoE SenseNova-U1-A3B-MoT (~39B total / ~3B active). The A3B's smallest community quant — smthem/SenseNova-U1-A3B-MoT-SFT-gguf Q4_K_S — is 26.43 GB on disk, so the MoE variant does not fit a 12 GB card at any currently-published quant. This recipe is only valid for the dense 8B-MoT row.

Requirements

ComponentMinimumTested
GPU12 GB VRAM (Q4 GGUF + layer offload; see Running for the mode)RTX 4070 (12 GB)
RAM32 GB+ (offload holds non-resident layers in pinned host memory; the community ComfyUI node documents a 36 GB-RAM footprint)
Storage14 GB (Q4_K_S GGUF)
Python3.11
PyTorch2.8 + CUDA 12.8 wheel (cu128 default; cu126 also fine — both ship Ada sm_89 kernels)
Softwareuv package manager; ComfyUI optional

The model is released under Apache 2.0 — commercial use is permitted.

ℹ️ PCIe Gen4 ×16 host bandwidth on the offload path. The --vram_mode offload streams non-resident language-model layers from system RAM over the PCIe bus during the forward pass. The RTX 4070 runs a full Gen4 ×16 link — half the host bandwidth of a Gen5 card — so the synchronous low-mode swap is workable but its throughput on the offloaded portion is lower than on a Gen5 board. The fit on 12 GB is unaffected (offload is about residency, not bandwidth); only the offloaded streaming speed is gated by the Gen4 link.

Installation

1. Clone the upstream repo

git clone https://github.com/OpenSenseNova/SenseNova-U1.git
cd SenseNova-U1

2. Install with uv

uv sync
source .venv/bin/activate

Per the upstream installation guide, the recommended software stack is "Python 3.11, torch 2.8, CUDA 12.8 (cu128)." The default uv sync pulls cu128 wheels, and the RTX 4070 (sm_89 Ada Lovelace) has full kernel coverage on both cu128 and cu126 — so the default is fine. If your driver does not support cu128, the same guide says you can change [tool.uv.sources] / [[tool.uv.index]] in pyproject.toml to a cu126 index before running uv sync. Unlike Blackwell GPUs (sm_120), the RTX 4070 needs no special wheel selection — the default uv sync already includes sm_89 kernels.

3. Add GGUF support

uv pip install -e ".[gguf]"   # or: pip install "gguf>=0.10.0" "diffusers>=0.30.0"

This is what lets the inference scripts load a quantized .gguf file via the diffusers GGUF Linear layer instead of the full BF16 safetensors — the version pins above are the ones the model card lists for the optional GGUF extra. The base --model_path is still required for the tokenizer, config, and non-LM weights.

4. Download the Q4_K_S GGUF merger weights

huggingface-cli download smthem/SenseNova-U1-8B-MoT-Merger-gguf \
  SenseNova-U1-8B-MoT-8step-Q4_K_S.gguf \
  --local-dir ./checkpoints

The GGUF redistributor publishes a per-quant-tier file table — for a 12 GB card you want the smallest stable quant, SenseNova-U1-8B-MoT-8step-Q4_K_S.gguf = 13.88 GB on disk. Every larger tier (SenseNova-U1-8B-MoT-8step-Q6_K.gguf = 16.06 GB, SenseNova-U1-8B-MoT-8step-Q8_0.gguf = 19.99 GB, and the BF16 merger safetensors at 35.10 GB) needs much more residency and is not worth fighting on a 12 GB card. The upstream model card links this exact mirror in its memory-efficient-inference section, crediting GitHub user @smthem for contributing the quantized GGUF weights — so the redistributor citation is endorsed by the canonical org itself.

5. (Optional) Install FlashAttention

uv sync --extra flash

Per the upstream installation guide, flash-attn is declared as an optional extra — "without it the model transparently falls back to torch SDPA" — and once flash-attn is importable the runtime picks it automatically (--attn_backend auto). On Ada sm_89 the FlashAttention 2 prebuilt wheels ship working kernels, so this step succeeds out of the box if you want the attention speedup. You do not need it to run on the RTX 4070 — the SDPA fallback is fine — so skip it if uv sync already succeeded and you'd rather not build the extra.

Running

Text-to-image with Q4 GGUF + layer offload

On a 12 GB RTX 4070 with a display attached, lead with --vram_mode low — the model card describes it as a "Synchronous per-layer CPU↔GPU swap" whose role is the "Lowest VRAM footprint". It keeps the language-model layers in pinned host memory and streams them onto the GPU on demand, which is what drops the resident weight footprint under the ~11 GB the card actually exposes.

python examples/t2i/inference.py \
  --model_path sensenova/SenseNova-U1-8B-MoT \
  --gguf_checkpoint ./checkpoints/SenseNova-U1-8B-MoT-8step-Q4_K_S.gguf \
  --vram_mode low \
  --prompt "A male peacock trying to attract a female" \
  --width 2048 --height 2048 \
  --cfg_scale 1.0 --cfg_norm none --timestep_shift 3.0 --num_steps 8 \
  --output output.png

Per the model card, --vram_mode exists to "keep the language-model layers resident on CPU pinned memory and stream them onto the GPU on-demand during forward". On the RTX 4070's PCIe Gen4 ×16 bus this CPU↔GPU swap works fine, though it streams at about half the host bandwidth of a Gen5 board, so the synchronous low penalty is a touch heavier than on a newer card. An independent data point for how low the resident footprint can go: the community ComfyUI node smthemex/ComfyUI_SenseNova_U1 — which wraps the smthem/SenseNova-U1-8B-MoT-Merger-gguf mirror that the canonical card itself links — documents "Test it use 8G Vram 36G Ram" for the 8B-MoT GGUF path with aggressive offload, comfortably under the 4070's 12 GB. (Note that file uses the 8-step quant; the --num_steps 8 and --cfg_scale 1.0 flags above are the model card's 8-step settings.)

If you have more VRAM free — a headless box, or a second card driving the display — switch up to --vram_mode balanced, which the model card describes as "Async prefetch overlaps H2D copy with compute" and recommends together with a Q4 GGUF for "~10–12 GB consumer cards": it recovers speed by overlapping the copy with compute, but its peak sits at the top of the 12 GB band, so it's the right call only when you genuinely have the headroom. On the Gen4 link the prefetch overlap also has less bandwidth to hide behind than on a Gen5 board.

Visual Q&A (understanding) on the same model

python examples/vqa/inference.py \
  --model_path sensenova/SenseNova-U1-8B-MoT \
  --gguf_checkpoint ./checkpoints/SenseNova-U1-8B-MoT-8step-Q4_K_S.gguf \
  --vram_mode low \
  --image examples/vqa/data/images/menu.jpg \
  --question "What's the cheapest item on this menu?" \
  --max_new_tokens 8192 \
  --output answer.txt

The unified architecture (NEO-unify, per the model card) means VQA, image editing, and interleaved text+image generation all run from the same checkpoint — see examples/editing/ and examples/interleave/ in the repo for the other two task scripts. All four scripts accept the same --gguf_checkpoint and --vram_mode low flags on a 12 GB card.

ComfyUI alternative

The same community ComfyUI node — smthemex/ComfyUI_SenseNova_U1 — wraps these checkpoints in a workflow with a prefetch_count swap knob. For the 12 GB 4070 you keep the default layer-swap on (its documented "Test it use 8G Vram 36G Ram" path); the node's own note that "If Vram >16G make prefetch_count =0" is the disable-swap case for larger cards, so on a 4070 you leave swapping enabled.

Results

  • Speed: Omitted — no RTX 4070 benchmark exists for SenseNova U1 yet, and the only timing the upstream docs publish names enterprise hardware: "~0.15 s/step" and "~9 s end-to-end" for a 2048×2048 image on H100 / H200 with the TP2 + CFG2 LightLLM + LightX2V serving stack (model card). That figure cannot be forward-extrapolated to a single consumer RTX 4070 — different parallelism, an HBM memory-bandwidth class far above the 4070's ~504 GB/s GDDR6X, and a layer-offload path the datacenter stack doesn't use. The /check/sensenova-u1/rtx-4070 page will surface community-submitted single-card timings as they land — please contribute yours.
  • VRAM usage: The recipe path is Q4_K_S GGUF + --vram_mode low, the model card's "Lowest VRAM footprint" mode. The Q4_K_S file is 13.88 GB on disk, but with layer streaming the resident weight footprint drops well under that — the community ComfyUI node measures the 8B-MoT GGUF path at "8G Vram 36G Ram", which fits the 4070's 12 GB with display headroom. min_vram_gb here is the streaming runtime envelope, not the on-disk file size. See /check/sensenova-u1/rtx-4070 for live data once a community 4070 measurement lands.
  • Quality notes: The VAE-free / visual-encoder-free design — the model card's NEO-unify architecture "eliminates both Visual Encoder (VE) and Variational Auto-Encoder (VAE)" — is the headline differentiator vs. Flux/SD-class diffusion models. The same checkpoint does generation, understanding, editing, and interleaved output without swapping weights, which is what makes a single unified-VRAM budget realistic on a 12 GB card.

For the full benchmark data, see /check/sensenova-u1/rtx-4070.

Troubleshooting

flash_attn install fails or you'd rather skip it

FlashAttention is optional here. Per the upstream installation guide, the model "transparently falls back to torch SDPA" without it. Run uv sync without the --extra flash flag and the inference path will use PyTorch's built-in SDPA — which has full sm_89 (Ada) support in torch 2.8 on both cu126 and cu128. You won't see a runtime crash, just a small speed reduction in the attention layers. (On Ada the FA2 prebuilt wheels do ship working sm_89 kernels, so uv sync --extra flash also succeeds if you want the speedup — unlike on the newest Blackwell cards, there is no kernel-availability gap to work around here.)

"Out of memory" on the 12 GB card

Two escalation paths, in order:

  1. Stay on --vram_mode low, not balanced. balanced recovers speed by prefetching, but its peak sits at the top of the model card's "~10–12 GB consumer cards" band — which a 12 GB card with a display (~11 GB usable) can't always absorb. low is the "Lowest VRAM footprint" mode and is the correct default for the 4070.
  2. Make sure you have enough system RAM. Layer offload moves weight residency off the GPU and into pinned host memory — the community ComfyUI node documents a 36 GB-RAM footprint for the 8 GB-VRAM path. If you have only 16 GB system RAM, the offload buffers themselves can become the bottleneck.

I want the A3B MoE variant instead

It does not fit a 12 GB card at any currently-published quant. The smallest community GGUF is smthem/SenseNova-U1-A3B-MoT-SFT-gguf at Q4_K_S = 26.43 GB. The A3B's "~3B active params" describes inference compute, not VRAM — all ~39B total parameters must be resident because the MoE router picks experts per token at runtime. Keep to the 8B-MoT dense variant on this card, or request the A3B recipe via /contribute.

My image is washed-out or has CFG artifacts

The Q4_K_S file downloaded above carries the 8-step weights, so the Running commands use the model card's 8-step settings (--cfg_scale 1.0 --num_steps 8). The upstream docs/base_vs_distill.md compares the 8-step distilled path against the standard checkpoint; the 8-step preview model is flagged with known issues. If 8-step output looks off, the model card's standard text-to-image preset is --cfg_scale 4.0 --cfg_norm none --timestep_shift 3.0 --num_steps 50 — slower, but the reference quality. Report regressions via the submission form.