Foundation-1 on RTX 5070: Structured Music Sample Generation

What You'll Build

A local, offline pipeline that turns structured tag prompts (instrument → timbre → FX → key → bars → BPM) into tempo-synced, bar-aligned music loops on your RTX 5070. Foundation-1 is a fine-tune of stabilityai/stable-audio-open-1.0 trained for music-production workflows; the RC Stable Audio Tools fork handles the BPM/bar timing alignment automatically.

Hardware data: RTX 5070 (12 GB VRAM) · ~7 GB VRAM during generation per the HuggingFace model card (comfortable headroom on a 12 GB card) · See benchmark data

ℹ️ Not a text-to-speech model. Foundation-1 sits in our tts vertical because the catalogue groups all audio models together, but it generates one-shot music samples — bar-locked instrumental loops — not speech. It does not synthesize voices, words, or any spoken audio. For speech synthesis on this GPU, see Kokoro, VoxCPM, or Qwen3-TTS instead. The model's own HuggingFace card states plainly in its Limitations section that this is a specialized model for music sample generation, not a general-purpose music generator.

⚠️ Licensing — read before shipping. Foundation-1 weights are released under the Stability AI Community License. Per the HuggingFace model card, the weights are available for non-commercial use or limited commercial use by entities with annual revenues below USD $1M; above that threshold you must refer to the repository license file for full terms. If you're a hobbyist or under the $1M threshold you're clear; otherwise contact Stability AI for a commercial license before publishing or selling outputs. (The RC fork code is MIT; the constraint is on the model weights.)

Requirements

Component	Minimum	Tested
GPU	8 GB VRAM (per HF card)	RTX 5070 (12 GB, comfortable headroom)
RAM	16 GB system RAM	—
Storage	~3 GB (2.43 GB weights + venv + deps)	—
Python	3.10 (3.11+ may fail SciPy resolution per the RC fork README)	—
PyTorch	2.7+ with the CUDA 12.8 (cu128) wheel for Blackwell (sm_120)	—
Software	RC Stable Audio Tools fork or ComfyUI custom node	—

Installation

This recipe follows the canonical workflow recommended on the Foundation-1 model card — the RC Stable Audio Tools fork, which auto-handles BPM/bar timing alignment. For a ComfyUI alternative, see Troubleshooting.

1. Clone the RC Stable Audio Tools fork

git clone https://github.com/RoyalCities/RC-stable-audio-tools.git
cd RC-stable-audio-tools

2. Create a Python 3.10 virtual environment

The RC fork README is explicit: use Python 3.10, because 3.11+ can fail dependency resolution on the fork's pinned packages (notably older SciPy wheels).

Linux / macOS:

python3.10 -m venv venv
source venv/bin/activate

Windows:

py -3.10 -m venv venv
venv\Scripts\activate

3. Install stable-audio-tools and the fork

pip install stable-audio-tools
pip install .

4. Replace torch with the Blackwell-compatible (cu128) wheel

The RC fork's install line pins CUDA 12.1 (whl/cu121) — that wheel does not ship sm_120 kernels and will fail on the RTX 5070 with a no kernel image is available for execution on the device error. The RTX 5070 is Blackwell (GB205, sm_120), which needs the CUDA 12.8 build of PyTorch. Reinstall torch from the cu128 channel:

pip uninstall -y torch torchvision torchaudio
pip install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu128

(Once stable PyTorch 2.7+ with cu128 lands in the default index, drop the --pre flag and the nightly URL — the default pip install torch will then already include sm_120 kernels.)

5. Download Foundation-1 weights

The release ships exactly two files — the 16-bit model checkpoint and its config. Per the model card, only the 16-bit version is provided this release, with no quality loss versus the previous 32-bit release. Place both inside a single subfolder of models/:

mkdir -p models/Foundation-1
cd models/Foundation-1
curl -L -o Foundation_1.safetensors \
  https://huggingface.co/RoyalCities/Foundation-1/resolve/main/Foundation_1.safetensors
curl -L -o model_config.json \
  https://huggingface.co/RoyalCities/Foundation-1/resolve/main/model_config.json
cd ../..

The safetensors file is 2.43 GB (HF Files tab; 2,426,992,388 bytes).

Running

Launch the Gradio UI, pointing at the Foundation-1 checkpoint and config you just downloaded:

python run_gradio.py \
  --model-config models/Foundation-1/model_config.json \
  --ckpt-path models/Foundation-1/Foundation_1.safetensors

The Gradio interface opens in your browser. Foundation-1 uses a layered tag prompt schema documented on its model card:

[Instrument Family / Sub-Family], [Timbre], [Musical Behavior / Notation], [FX], [Key], [Bars], [BPM]

A working example prompt from the card:

Bass, FM Bass, Medium Delay, Medium Reverb, Low Distortion, Phaser, Sub Bass,
Bass, Upper Mids, Acid, Gritty, Wide, Dubstep, Thick, Silky, Warm, Rich,
Overdriven, Crisp, Deep, Clean, Pitch Bend, 303, 8 Bars, 140 BPM, E minor

Supported loop structures: 4 or 8 bars; supported BPMs: 100, 110, 120, 128, 130, 140, 150. The RC fork's BPM/bar selector locks generation duration to the prompt's musical structure automatically — the card gives the example that an 8-bar loop at 100 BPM works out to roughly 19 seconds of output (model card). The underlying Stable Audio Open base outputs 44.1 kHz stereo; Foundation-1 is constrained to the bar/BPM grid above.

Results

Speed: No RTX 5070 measurement has been published yet, so no firm number is quoted here. The model card reports generation taking approximately ~7–8 seconds per sample on an RTX 3090 (Ampere, 24 GB) — a different and older architecture than the Blackwell RTX 5070, so that figure does not transfer directly and is given here only as loose context. Help close the gap: run it and contribute a measurement via /contribute, and check /check/foundation-1/rtx-5070 for live benchmark data as it lands.
VRAM usage: ~7 GB during generation per the HF card, which recommends a GPU with at least 8 GB of VRAM for reliable operation. On the RTX 5070's 12 GB the ~7 GB peak leaves a few GB of working headroom even after the display's usable-VRAM tax — you won't be juggling memory the way 8 GB cards must.
Output: stereo .wav loops aligned to the requested bar count and BPM. Per the model card limitations, percussion and drum sounds are out of scope for this release; the 10 instrument families covered are Synth, Keys, Bass, Bowed Strings, Mallet, Wind, Guitar, Brass, Vocal, and Plucked Strings.

For the full benchmark data, see /check/foundation-1/rtx-5070.

Troubleshooting

`RuntimeError: CUDA error: no kernel image is available for execution on the device`

You installed the default pip install torch wheel (or the RC README's whl/cu121 line) — neither ships sm_120 kernels for the RTX 5070's Blackwell (GB205) architecture. Reinstall from the cu128 channel per step 4 above, then verify:

python -c "import torch; print(torch.cuda.is_available(), torch.cuda.get_device_name(0))"

Should print True NVIDIA GeForce RTX 5070 (or similar).

Dependency resolution failures on Python 3.11+

The RC fork's README explicitly notes that 3.11+ can fail dependency resolution due to pinned packages (notably older SciPy wheels). Use a Python 3.10 venv as in step 2.

Prefer ComfyUI over Gradio

Two community ComfyUI custom nodes exist:

Saganaki22/ComfyUI-Foundation-1 — downloads weights into ComfyUI/models/stable_audio/Foundation-1/ and ships example workflows. Install via ComfyUI Manager (recommended) or git clone into ComfyUI/custom_nodes/ then python install.py. The install script uses pip install stable-audio-tools --no-deps to protect your ComfyUI environment from the upstream's pandas==2.0.2 pin (which has no Python 3.13 wheel and breaks the install).
SanDiegoDude/scg_Foundation-1-comfyUI — install via ComfyUI Manager recommended; the SCG Foundation-1 Loader node downloads the model on first use and caches it at ComfyUI/models/audio_checkpoints/Foundation-1/.

The same ~7 GB VRAM envelope and 8 GB minimum apply regardless of front-end. On the RTX 5070's 12 GB, both nodes have headroom.

Optional: INT4 / Low-VRAM Mode (TorchAO)

You don't need this on a 12 GB card — the default 16-bit path fits comfortably at ~7 GB peak — but the RC fork ships an optional INT4 weight-only mode (via TorchAO) you can enable if you want to run Foundation-1 alongside a much larger model. Install:

# Windows (pinned, recommended)
pip install torchao==0.12.0
# Linux
pip install torchao

The fork warns that INT4 can be very slow on Windows because the Triton fast-kernels are usually unavailable and it falls back to slower paths. If TorchAO isn't installed or isn't compatible with your environment, the INT4 toggle stays hidden/disabled in the UI.

Prompt produces drift or incoherent phrases

Per the model card's Limitations section, if the generation duration is shorter than the musical structure implied by the prompt (e.g. requesting an 8-bar loop but generating only 5 seconds), the model may produce less coherent musical phrases. The RC fork handles this alignment automatically — if you're using bare stable-audio-tools or a third-party UI, set the audio duration to match the prompt's bar/BPM structure. Also: keep prompts in the documented tag order, use 1–3 timbre descriptors, and always include both Bars and BPM.

Percussion or drum prompts produce garbage

By design. The card lists percussion and drum sounds as explicitly out-of-scope for this release. Use a different tool (e.g. a drum sample library or a percussion-specific model) for drum loops.

No widely-reported issues on RTX 5070 specifically — if you hit one, report it via the submission form.