SenseNova U1 (8B-MoT) on RTX 4060 Ti 16GB: VAE-Free Unified Image Gen + Understanding via Q4 GGUF

What You'll Build

A working SenseNova U1 install on a 16 GB RTX 4060 Ti that does all four U1 tasks from one model — text-to-image at up to 2048×2048, visual Q&A on images, image editing, and interleaved text+image generation — without a VAE or separate visual encoder. The recipe pins the 8B-MoT dense variant (~18B total parameters: ~8B understanding + ~8B generation merger), running at Q4_K_S GGUF (13.9 GB on disk) with --vram_mode balanced to keep peak VRAM in the ~10–12 GB range that the official model card calls out as the consumer-GPU sweet spot.

Hardware data: RTX 4060 Ti (16GB VRAM) · Q4_K_S GGUF, --vram_mode balanced · See benchmark data

ℹ️ Unified gen + understanding, in the image vertical. SenseNova U1 is a multimodal model — the same checkpoint produces images, answers questions about images, edits images, and produces interleaved text+image output. We catalogue it under image to match the wider site's grouping (and our /check/ row for it); the model card is explicit that the NEO-unify architecture models "pixel-word information end-to-end" via a single backbone.

ℹ️ One model name, two very different variants. The U1 release ships two siblings from the same sensenova org: the dense SenseNova-U1-8B-MoT (~18B total params, this recipe) and the sparse-MoE SenseNova-U1-A3B-MoT (~39B total / ~3B active). The A3B's smallest community quant — smthem/SenseNova-U1-A3B-MoT-SFT-gguf Q4_K_S — is 26.4 GB on disk, so the MoE variant does not fit a 16 GB card at any currently-published quant. This recipe is only valid for the dense 8B-MoT row.

Requirements

Component	Minimum	Tested
GPU	12 GB VRAM (per model card "~10–12 GB consumer cards" guidance)	RTX 4060 Ti (16 GB)
RAM	32 GB (`--vram_mode balanced` does async CPU↔GPU prefetch — system RAM holds non-resident layers)	—
Storage	14 GB (Q4_K_S GGUF) or ~33 GB (full BF16 merger checkpoint)	—
Python	3.11	—
PyTorch	2.8 + CUDA 12.6 or 12.8 wheel (Ada sm_89 supported on both)	—
Software	uv package manager; ComfyUI optional	—

The model is released under Apache 2.0 — commercial use is permitted.

Installation

1. Clone the upstream repo

git clone https://github.com/OpenSenseNova/SenseNova-U1.git
cd SenseNova-U1

2. Install with uv

uv sync
source .venv/bin/activate

Per docs/installation.md: "Python 3.11, torch 2.8, CUDA 12.8 (cu128)" is the recommended configuration. The default uv sync pulls cu128 wheels — RTX 40-series (sm_89 Ada Lovelace) has full kernel coverage on both cu128 and cu126, so the default is fine. If you have older NVIDIA drivers you can switch to cu126 by editing [tool.uv.sources] in pyproject.toml per the upstream guide before running uv sync. (Unlike Blackwell GPUs (sm_120), no special wheel selection is required for the 4060 Ti — the default pip install torch already includes sm_89 kernels.)

3. Add GGUF support

uv pip install -e ".[gguf]"

This is what lets the t2i CLI load the smaller Q4 quantization rather than the full BF16 checkpoint. The upstream README lists "gguf>=0.10.0" "diffusers>=0.30.0" as the relevant pins.

4. Download the Q4_K_S GGUF merger weights

huggingface-cli download smthem/SenseNova-U1-8B-MoT-Merger-gguf \
  SenseNova-U1-8B-MoT-8step-Q4_K_S.gguf \
  --local-dir ./checkpoints

The GGUF redistributor publishes a per-quant-tier file table — for this card you want Q4_K_S = 13.9 GB on disk (the smallest stable quant); Q6_K = 16.0 GB sits at or above the card's capacity and only works with aggressive prefetch; Q8_0 = 20 GB and BF16 ~33 GB do not fit. The upstream model card links this exact mirror in its quantized-weights section ("GGUF weights for SenseNova-U1-8B-MoT-Merger … are available at: smthem/SenseNova-U1-8B-MoT-Merger-gguf").

5. (Optional) Install FlashAttention

uv sync --extra flash

Per docs/installation.md, flash-attention is optional — the model "transparently falls back to torch SDPA" without it. On Ada sm_89, FA2 wheels have full kernel coverage, so this step works out of the box if you want the speedup; you can also just skip it and rely on SDPA. Either path is fine on the 4060 Ti.

Running

Text-to-image with Q4 GGUF + balanced VRAM mode

python examples/t2i/inference.py \
  --model_path sensenova/SenseNova-U1-8B-MoT \
  --gguf_checkpoint ./checkpoints/SenseNova-U1-8B-MoT-8step-Q4_K_S.gguf \
  --vram_mode balanced \
  --prompt "A male peacock trying to attract a female" \
  --width 2048 --height 2048 \
  --cfg_scale 1.0 --num_steps 8 \
  --output output.png

Per the upstream README, --vram_mode balanced does "async prefetch [that] overlaps H2D copy with compute," and "Q4 GGUF + balanced is the recommended setup for ~10–12 GB consumer cards." On a 16 GB 4060 Ti this leaves ~4 GB of headroom on top of that envelope for the understanding-side activations and any small surge during the prefill.

If you're sharing the GPU with another process (e.g. a desktop compositor + a browser eating ~1 GB), drop to --vram_mode low, which does synchronous per-layer CPU↔GPU swap — slower but cuts peak resident VRAM further.

Visual Q&A (understanding) on the same model

python examples/vqa/inference.py \
  --model_path sensenova/SenseNova-U1-8B-MoT \
  --image examples/vqa/data/images/menu.jpg \
  --question "What's the cheapest item on this menu?" \
  --max_new_tokens 8192 \
  --output answer.txt

The unified architecture (NEO-unify, per the model card) means VQA, image editing, and interleaved text+image generation all run from the same checkpoint — see examples/editing/ and examples/interleave/ in the repo for the other two task scripts.

ComfyUI alternative

A community ComfyUI node — smthemex/ComfyUI_SenseNova_U1, maintained by the same author as the GGUF mirror — wraps the same checkpoints in a ComfyUI workflow. Its README documents "Test it use 8G Vram 36G Ram" (with aggressive CPU offload) and "If Vram >16G make prefetch_count =0" — so on a 16 GB card you can set prefetch_count=0 for full-GPU residency, or keep the default for headroom.

Results

Speed: Not quoted here — the only cited timing in upstream docs (model card) names H100/H200 with TP2+CFG2 ("~9 s end-to-end for 2048×2048", "~0.15 s/step"). No comparable single-card RTX-class benchmark exists yet for SenseNova U1. The /check/sensenova-u1/rtx-4060-ti-16gb page will surface community-submitted timings as they land.
VRAM usage: Per the HF model card, Q4 GGUF + --vram_mode balanced is the recommended setup for "~10–12 GB consumer cards." The Q4_K_S file itself is 13.9 GB on disk; runtime peak with the balanced mode's async prefetch sits in the documented 10–12 GB band, leaving ~4 GB headroom on a 16 GB 4060 Ti for the understanding-path activations.
Quality notes: The VAE-free / visual-encoder-free design — described on the model card as modeling "pixel-word information end-to-end" via the NEO-unify architecture — is the headline differentiator vs. Flux/SD-class diffusion models. The same checkpoint does generation, understanding, editing, and interleaved output without swapping weights, which is what makes the unified-VRAM budget realistic.

For the full benchmark data, see /check/sensenova-u1/rtx-4060-ti-16gb.

Troubleshooting

`flash_attn` install fails or you'd rather skip it

FlashAttention is optional here. Per the upstream installation guide, the model "transparently falls back to torch SDPA" without it. Run uv sync without the --extra flash flag and the inference path will use PyTorch's built-in SDPA — which has full sm_89 (Ada) support in torch 2.8 on both cu126 and cu128. You won't see a runtime crash, just a small speed reduction in the attention layers.

"Out of memory" with `--vram_mode balanced` on the 16 GB card

Two escalation paths, in order:

Drop quantization tier: if you were using Q6_K (16.0 GB on disk, per the mirror), switch to Q4_K_S (13.9 GB). Q6_K sits at the card's capacity and only works with very aggressive prefetch + a clean GPU (no other processes).
Switch to --vram_mode low: synchronous per-layer CPU↔GPU swap; slower but cuts peak VRAM further. This is the same path the ComfyUI node uses to achieve the "8G Vram 36G Ram" footprint.

I want the A3B MoE variant instead

It does not fit a 16 GB card at any currently-published quant. The smallest community GGUF is smthem/SenseNova-U1-A3B-MoT-SFT-gguf at Q4_K_S = 26.4 GB. The A3B's "~3B active params" describes inference compute, not VRAM — all 39B total parameters must be resident because the MoE router picks experts at runtime. Until a smaller quant lands (Q3 / Q2_K), keep to the 8B-MoT dense variant on this card or request the A3B recipe via /contribute.

My downloaded image is washed-out or has CFG artifacts at default settings

Per the upstream README, the canonical default for the standard checkpoint is --cfg_scale 4.0 --num_steps 50 — but the file downloaded above is the 8-step distillation variant (filename contains 8step), so the canonical example uses --cfg_scale 1.0 --num_steps 8. If you switch to the non-distilled Q4_K_S.gguf (without 8step in the name), switch the inference flags back to --cfg_scale 4.0 --num_steps 50. Report unexpected quality regressions via the submission form.