Foundation-1 on RTX 4070 Ti SUPER: Structured Music Sample Generation

What You'll Build

A local, offline pipeline that turns structured tag prompts (instrument → timbre → FX → key → bars → BPM) into tempo-synced, bar-aligned music loops on your RTX 4070 Ti SUPER. Foundation-1 is a fine-tune of stabilityai/stable-audio-open-1.0 trained for music-production workflows; the RC Stable Audio Tools fork handles the BPM/bar timing alignment automatically.

Hardware data: RTX 4070 Ti SUPER (16GB VRAM) · ~7 GB VRAM during generation per the HuggingFace model card (wide headroom on a 16 GB card) · See benchmark data

ℹ️ Not a text-to-speech model. Foundation-1 is in our tts vertical because the catalogue groups all audio models together, but it generates one-shot music samples — bar-locked instrumental loops — not speech. It does not synthesize voices, words, or any spoken audio. For speech synthesis on this GPU, see Kokoro, VoxCPM, or Qwen3-TTS instead. Per its own HuggingFace card: "Foundation-1 is a specialized model for music sample generation, not a general-purpose music generator."

⚠️ Licensing — read before shipping, and don't conflate the two licenses. The two halves of this project are licensed separately:

Weights (Foundation_1.safetensors) — Stability AI Community License. The model is a fine-tune of Stable Audio Open 1.0; per the repo's LICENSE.md it permits "Free use for research and non-commercial purposes" and "Limited Commercial use for entities with annual revenues below USD $1M", while "An enterprise license is required for commercial use by entities with annual revenues exceeding USD $1M." The model card restates this: "It is available for non-commercial use or limited commercial use by entities with annual revenues below USD $1M."

Code (the RC Stable Audio Tools fork) — MIT-licensed.

If you're a hobbyist or under the $1M threshold you're clear; otherwise contact Stability AI for an enterprise license before publishing or selling outputs. The permissive MIT terms cover the tooling only — they do not loosen the weight license.

Why the 16 GB RTX 4070 Ti SUPER is over-provisioned for this model — and how to use the spare

Foundation-1's typical generation footprint is ~7 GB per the model card, against the 4070 Ti SUPER's 16 GB — so over half the card sits idle during a normal run. The 4070 Ti SUPER is the same 16 GB VRAM tier and the same Ada Lovelace (sm_89) architecture as the larger RTX 4080, so the install path and VRAM floor are identical; what differs is what you can do with the ~9 GB you don't need:

Colocate a second model. Keep a small LLM (e.g. a 7B Q4 quant, ~5 GB) or an ASR model (Whisper-small/medium) resident alongside Foundation-1 to build a prompt-to-loop or transcribe-to-loop pipeline without unloading either model between calls.
Run a DAW with GPU plugins or a browser-based DAW alongside generation without juggling memory the way 8 GB cards must.
Batch generation. Queue multiple seeds/prompts back-to-back — the headroom absorbs the transient allocation spikes that an 8 GB card can't.

You do not need any low-VRAM trick on this card; the default BF16 path fits with room to spare. The optional INT4 mode (below) exists only for sharing the card with a much larger model.

Requirements

Component	Minimum	Tested
GPU	8 GB VRAM (per HF card)	RTX 4070 Ti SUPER (16GB, ~9 GB headroom)
RAM	16 GB system RAM	—
Storage	~3 GB (2.43 GB weights + venv + deps)	—
Python	3.10 (3.11+ may fail SciPy resolution per the RC fork README)	—
PyTorch	2.4+ with CUDA wheel (default `pip install torch` is fine on Ada Lovelace sm_89)	—
Software	RC Stable Audio Tools fork or ComfyUI custom node	—

Installation

This recipe follows the canonical workflow recommended on the Foundation-1 model card — the RC Stable Audio Tools fork, which auto-handles BPM/bar timing alignment. For a ComfyUI alternative, see Troubleshooting.

1. Clone the RC Stable Audio Tools fork

git clone https://github.com/RoyalCities/RC-stable-audio-tools.git
cd RC-stable-audio-tools

2. Create a Python 3.10 virtual environment

Linux / macOS:

python3 -m venv venv
source venv/bin/activate

Windows:

python -m venv venv
venv\Scripts\activate

3. Install stable-audio-tools and the fork

pip install stable-audio-tools
pip install .

4. (Windows only) Force a CUDA torch wheel

Windows venvs sometimes resolve to the CPU-only torch wheel, which makes Gradio fall back to CPU silently. If python -c "import torch; print(torch.cuda.is_available())" prints False, reinstall torch from the CUDA 12.1 channel per the RC fork's documented Windows path (the default Linux pip install torch already resolves to a CUDA build, and the RTX 4070 Ti SUPER is Ada Lovelace sm_89 — no special Blackwell/cu128 wheel needed):

pip uninstall -y torch torchvision torchaudio
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

5. Download Foundation-1 weights

Place both files inside a single subfolder of models/ (the RC fork requires each model's checkpoint and its config to live in the same subfolder):

mkdir -p models/Foundation-1
cd models/Foundation-1
curl -L -o Foundation_1.safetensors \
  https://huggingface.co/RoyalCities/Foundation-1/resolve/main/Foundation_1.safetensors
curl -L -o model_config.json \
  https://huggingface.co/RoyalCities/Foundation-1/resolve/main/model_config.json
cd ../..

The safetensors file is 2.43 GB (HF Files tab). This release ships only the 16-bit weights — per the card, "Unlike prior releases where both 32-bit and 16-bit models were provided, this release includes only the 16-bit version." with "no quality loss, while reducing the model footprint."

Running

Launch the Gradio UI, pointing at the Foundation-1 checkpoint and config you just downloaded:

python run_gradio.py \
  --model-config models/Foundation-1/model_config.json \
  --ckpt-path models/Foundation-1/Foundation_1.safetensors

The Gradio interface opens in your browser. (Running bare python run_gradio.py with an empty models/ folder instead launches the fork's HuggingFace downloader UI, where you can fetch a model by repo id; after downloading you restart to get the full UI.) Foundation-1 uses a layered tag prompt schema documented on its model card:

[Instrument Family / Sub-Family], [Timbre], [Musical Behavior / Notation], [FX], [Key], [Bars], [BPM]

A working example prompt from the card's Audio Showcase:

Bass, FM Bass, Medium Delay, Medium Reverb, Low Distortion, Phaser, Sub Bass,
Bass, Upper Mids, Acid, Gritty, Wide, Dubstep, Thick, Silky, Warm, Rich,
Overdriven, Crisp, Deep, Clean, Pitch Bend, 303, 8 Bars, 140 BPM, E minor

Supported loop structures: 4 or 8 bars; supported BPMs: 100, 110, 120, 128, 130, 140, 150. The RC fork's BPM/bar selector locks generation duration to the prompt's musical structure automatically — per the card, "an 8-bar loop at 100 BPM ≈ 19 seconds" of output. Generations are saved as stereo .wav loops (and the fork auto-converts each sample to .MID), aligned to the bar/BPM grid above.

Results

Speed: The model card reports generation time on an RTX 3090 only — "On an RTX 3090, generation time is approximately ~7–8 seconds per sample." The RTX 3090 is an Ampere (sm_86, 24 GB, 936 GB/s) card, not architecturally comparable to the Ada Lovelace RTX 4070 Ti SUPER (sm_89, 16 GB, ~672 GB/s memory bandwidth), so that figure does not transfer as a 4070 Ti SUPER number — the binding constraint for this memory-bound workload is bandwidth, and the two cards differ on both arch and bandwidth. No RTX-4070-Ti-SUPER-named generation-time measurement exists yet. We are not quoting a fabricated figure; once a community benchmark lands it will appear at /check/foundation-1/rtx-4070-ti-super — please contribute yours via the submission form.
VRAM usage: ~7 GB during generation per the HF card: "Typical VRAM usage during generation is approximately ~7 GB." and "For reliable operation, a GPU with at least 8 GB of VRAM is recommended." On the RTX 4070 Ti SUPER's 16 GB that leaves roughly half the card free — comfortable enough to colocate a second model (see the headroom section above).
Output: stereo .wav loops aligned to the requested bar count and BPM. Per the model card limitations, percussion and drum sounds are out of scope for this release; the 10 instrument families covered are Synth, Keys, Bass, Bowed Strings, Mallet, Wind, Guitar, Brass, Vocal, and Plucked Strings.

For the full benchmark data, see /check/foundation-1/rtx-4070-ti-super.

Troubleshooting

Gradio launches but reports `torch.cuda.is_available() == False`

Either you didn't activate the venv before launching, or pip install torch resolved to the CPU wheel (Windows is the common culprit). Re-run step 4 to force the CUDA 12.1 channel, then verify:

python -c "import torch; print(torch.cuda.is_available(), torch.cuda.get_device_name(0))"

Should print True NVIDIA GeForce RTX 4070 Ti SUPER (or similar).

Dependency resolution failures on Python 3.11+

The RC fork's README explicitly notes to "Use Python 3.10. Newer versions (e.g. 3.11+) can fail dependency resolution due to pinned packages (notably older SciPy wheels)." Use a Python 3.10 venv as in step 2.

Prefer ComfyUI over Gradio

Two community ComfyUI custom nodes exist:

Saganaki22/ComfyUI-Foundation-1 — auto-downloads weights into ComfyUI/models/stable_audio/Foundation-1/, ships example workflows. Install via ComfyUI Manager (recommended) or git clone into ComfyUI/custom_nodes/ then python install.py. The install script uses pip install stable-audio-tools --no-deps to protect your ComfyUI environment from the upstream's aggressive pandas==2.0.2 pin.
SanDiegoDude/scg_Foundation-1-comfyUI — install via ComfyUI Manager recommended; weights land at ComfyUI/models/audio_checkpoints/Foundation-1/.

The same ~7 GB VRAM envelope and 8 GB minimum apply regardless of front-end. On the RTX 4070 Ti SUPER 16GB, both nodes have plenty of headroom.

Want to share the card with a much larger model? Enable INT4 / Low-VRAM Mode (TorchAO)

You don't need this on a 16 GB card — the default BF16 path fits comfortably — but the RC fork ships an optional INT4 weight-only mode you can use if you want to run Foundation-1 alongside a much larger model. Install:

# Windows (pinned, recommended)
pip install torchao==0.12.0
# Linux
pip install torchao

The fork notes INT4 mode "can be very slow on Windows because Triton fast-kernels are usually unavailable (falls back to slower paths)." If TorchAO isn't installed, the INT4 toggle stays hidden in the UI.

Prompt produces drift or incoherent phrases

Per the model card's Limitations section, if the generation duration is shorter than the musical structure implied by the prompt (for example requesting an 8-bar loop but generating only 5 seconds), the model may produce less coherent musical phrases. The RC fork handles this alignment automatically — if you're using bare stable-audio-tools or a third-party UI, set the audio duration to match the prompt's bar/BPM structure. Also: keep prompts in the documented tag order, use 1–3 timbre descriptors, and always include both Bars and BPM.

Percussion or drum prompts produce garbage

By design. The card lists "Percussion and drum sounds are outside the scope of this release." Use a different tool (e.g. a drum sample library or a percussion-specific model) for drum loops.

No widely-reported issues on RTX 4070 Ti SUPER specifically — if you hit one, report it via the submission form.