What You'll Build
A local, offline pipeline that turns structured tag prompts (instrument → timbre → FX → key → bars → BPM) into tempo-synced, bar-aligned music loops on your RTX 5060 Ti. Foundation-1 is a fine-tune of stabilityai/stable-audio-open-1.0 trained for music-production workflows; the RC Stable Audio Tools fork handles the BPM/bar timing alignment automatically.
Hardware data: RTX 5060 Ti (16 GB VRAM) · ~7 GB VRAM during generation per the HuggingFace model card · See benchmark data
ℹ️ Not a text-to-speech model. Foundation-1 is in our
ttsvertical because the catalogue groups all audio models together, but it generates one-shot music samples — bar-locked instrumental loops — not speech. It does not synthesize voices, words, or any spoken audio. For speech synthesis on this GPU, see Kokoro, VoxCPM, or Qwen3-TTS instead. Per its own HuggingFace card: "specialized model for music sample generation, not a general-purpose music generator."
⚠️ Licensing — read before shipping. Foundation-1 weights are released under the Stability AI Community License. The HuggingFace card states the model "is available for non-commercial use or limited commercial use by entities with annual revenues below USD $1M. For revenues exceeding USD $1M, please refer to the repository license file for full terms." If you're a hobbyist or under the $1M threshold you're clear; otherwise contact Stability AI for a commercial license before publishing or selling outputs. (The RC fork code is MIT; the constraint is on the model weights.)
Requirements
| Component | Minimum | Tested |
|---|---|---|
| GPU | 8 GB VRAM (per HF card) | RTX 5060 Ti (16 GB) |
| RAM | 16 GB system RAM | — |
| Storage | ~3 GB (2.43 GB weights + venv + deps) | — |
| Python | 3.10 (3.11+ may fail SciPy resolution) | — |
| PyTorch | 2.7+ with CUDA 12.8 wheel for Blackwell (sm_120) | — |
| Software | RC Stable Audio Tools fork or ComfyUI custom node | — |
Installation
This recipe follows the canonical workflow recommended on the Foundation-1 model card — the RC Stable Audio Tools fork, which auto-handles BPM/bar timing alignment. For a ComfyUI alternative, see Troubleshooting.
1. Clone the RC Stable Audio Tools fork
git clone https://github.com/RoyalCities/RC-stable-audio-tools.git
cd RC-stable-audio-tools
2. Create a Python 3.10 virtual environment
Linux / macOS:
python3.10 -m venv venv
source venv/bin/activate
Windows:
py -3.10 -m venv venv
venv\Scripts\activate
3. Install stable-audio-tools and the fork
pip install stable-audio-tools
pip install .
4. Replace torch with the Blackwell-compatible wheel
The RC fork's Windows install line pins CUDA 12.1 (whl/cu121) — that wheel does not ship sm_120 kernels and will fail on RTX 5060 Ti with a no kernel image is available for execution error. Reinstall torch from the CUDA 12.8 channel (PyTorch Blackwell guidance):
pip uninstall -y torch torchvision torchaudio
pip install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu128
(Once stable PyTorch 2.7+ with cu128 lands in the default index, drop the --pre flag and the nightly URL.)
5. Download Foundation-1 weights
Place both files inside a single subfolder of models/:
mkdir -p models/Foundation-1
cd models/Foundation-1
curl -L -o Foundation_1.safetensors \
https://huggingface.co/RoyalCities/Foundation-1/resolve/main/Foundation_1.safetensors
curl -L -o model_config.json \
https://huggingface.co/RoyalCities/Foundation-1/resolve/main/model_config.json
cd ../..
The safetensors file is 2.43 GB (HF Files tab).
Running
Launch the Gradio UI, pointing at the Foundation-1 checkpoint and config you just downloaded:
python run_gradio.py \
--model-config models/Foundation-1/model_config.json \
--ckpt-path models/Foundation-1/Foundation_1.safetensors
The Gradio interface opens in your browser. Foundation-1 uses a layered tag prompt schema documented on its model card:
[Instrument Family / Sub-Family], [Timbre], [Musical Behavior / Notation], [FX], [Key], [Bars], [BPM]
A working example prompt from the card:
Bass, FM Bass, Medium Delay, Medium Reverb, Low Distortion, Phaser, Sub Bass,
Bass, Upper Mids, Acid, Gritty, Wide, Dubstep, Thick, Silky, Warm, Rich,
Overdriven, Crisp, Deep, Clean, Pitch Bend, 303, 8 Bars, 140 BPM, E minor
Supported loop structures: 4 or 8 bars; supported BPMs: 100, 110, 120, 128, 130, 140, 150. The RC fork's BPM/bar selector locks generation duration to the prompt's musical structure automatically — for an 8-bar loop at 100 BPM that's roughly 19 seconds of output (model card). The underlying Stable Audio Open base outputs 44.1 kHz stereo up to 47 seconds; Foundation-1 is constrained to the bar/BPM grid above.
Results
- Speed: The model card reports generation takes "approximately ~7–8 seconds per sample" on an RTX 3090. No comparable-GPU number has been published for the RTX 5060 Ti yet; check /check/foundation-1/rtx-5060-ti for live benchmark data as it lands.
- VRAM usage: ~7 GB during generation per the HF card ("Typical VRAM usage during generation is approximately ~7 GB. For reliable operation, a GPU with at least 8 GB of VRAM is recommended"). The 5060 Ti's 16 GB headroom leaves room for parallel jobs or running alongside a DAW.
- Output: mono/stereo
.wavloops aligned to the requested bar count and BPM. Per the model card limitations, percussion and drum sounds are out of scope for this release; the 10 instrument families covered are Synth, Keys, Bass, Bowed Strings, Mallet, Wind, Guitar, Brass, Vocal, and Plucked Strings.
For the full benchmark data, see /check/foundation-1/rtx-5060-ti.
Troubleshooting
RuntimeError: CUDA error: no kernel image is available for execution on the device
You installed the default pip install torch wheel (or the RC README's whl/cu121 line) — neither ships sm_120 kernels for Blackwell. Reinstall from whl/nightly/cu128 per step 4 above. Tracking: PyTorch Blackwell guidance.
Dependency resolution failures on Python 3.11+
The RC fork's README explicitly notes 3.11+ "can fail dependency resolution due to pinned SciPy wheels." Use a Python 3.10 venv as in step 2.
Prefer ComfyUI over Gradio
Two community ComfyUI custom nodes exist:
Saganaki22/ComfyUI-Foundation-1— auto-downloads weights intoComfyUI/models/stable_audio/Foundation-1/, ships example workflows. Installs viagit cloneintoComfyUI/custom_nodes/thenpython install.py(which installsstable-audio-tools --no-depsto protect your ComfyUI environment from the upstream's aggressivepandas==2.0.2/numpy==1.23pins).SanDiegoDude/scg_Foundation-1-comfyUI— install via ComfyUI Manager recommended; weights land atComfyUI/models/audio_checkpoints/Foundation-1/.
Same ~7 GB VRAM and 8 GB minimum apply regardless of front-end.
Prompt produces drift or incoherent phrases
Per the model card's Limitations section, if generation duration doesn't match the prompt's bar/BPM structure (e.g. requesting an 8-bar loop but capping output at 5 seconds), output coherence degrades. The RC fork handles this alignment automatically — if you're using bare stable-audio-tools or a third-party UI, set the audio duration manually to match the bars × (60 / BPM) × 4 formula. Also: keep prompts in the documented tag order, use 1–3 timbre descriptors, and always include both Bars and BPM.
Percussion or drum prompts produce garbage
By design. The card lists percussion as explicitly out-of-scope for this release. Use a different tool (e.g. a drum sample library or a percussion-specific model) for drum loops.
No widely-reported issues on RTX 5060 Ti specifically — if you hit one, report it via the submission form.