self-hosted/ai
§01·recipe · tts

Foundation-1 on RTX 4060 Ti 16GB: Structured Music Sample Generation

ttsintermediate8GB+ VRAMMay 21, 2026
models
tools
prerequisites
  • NVIDIA RTX 4060 Ti 16GB or any GPU with ≥ 8 GB VRAM (model card recommends 8 GB minimum)
  • Python 3.10 (3.11+ may fail dependency resolution per the RC fork README)
  • Git, ~3 GB free disk for weights + dependencies

What You'll Build

A local, offline pipeline that turns structured tag prompts (instrument → timbre → FX → key → bars → BPM) into tempo-synced, bar-aligned music loops on your RTX 4060 Ti 16GB. Foundation-1 is a fine-tune of stabilityai/stable-audio-open-1.0 trained for music-production workflows; the RC Stable Audio Tools fork handles the BPM/bar timing alignment automatically.

Hardware data: RTX 4060 Ti 16GB · ~7 GB VRAM during generation per the HuggingFace model card (~9 GB headroom on a 16 GB card) · See benchmark data

ℹ️ Not a text-to-speech model. Foundation-1 is in our tts vertical because the catalogue groups all audio models together, but it generates one-shot music samples — bar-locked instrumental loops — not speech. It does not synthesize voices, words, or any spoken audio. For speech synthesis on this GPU, see Kokoro, VoxCPM, or Qwen3-TTS instead. Per its own HuggingFace card: "Foundation-1 is a specialized model for music sample generation, not a general-purpose music generator."

⚠️ Licensing — read before shipping. Foundation-1 weights are released under the Stability AI Community License. The HuggingFace card states the model "is available for non-commercial use or limited commercial use by entities with annual revenues below USD $1M." If you're a hobbyist or under the $1M threshold you're clear; otherwise contact Stability AI for a commercial license before publishing or selling outputs. (The RC fork code is MIT; the constraint is on the model weights.)

Requirements

ComponentMinimumTested
GPU8 GB VRAM (per HF card)RTX 4060 Ti 16GB (9 GB headroom)
RAM16 GB system RAM
Storage~3 GB (2.43 GB weights + venv + deps)
Python3.10 (3.11+ may fail SciPy resolution per the RC fork README)
PyTorch2.4+ with CUDA wheel (default pip install torch is fine on Ada Lovelace sm_89)
SoftwareRC Stable Audio Tools fork or ComfyUI custom node

Installation

This recipe follows the canonical workflow recommended on the Foundation-1 model card — the RC Stable Audio Tools fork, which auto-handles BPM/bar timing alignment. For a ComfyUI alternative, see Troubleshooting.

1. Clone the RC Stable Audio Tools fork

git clone https://github.com/RoyalCities/RC-stable-audio-tools.git
cd RC-stable-audio-tools

2. Create a Python 3.10 virtual environment

Linux / macOS:

python3.10 -m venv venv
source venv/bin/activate

Windows:

py -3.10 -m venv venv
venv\Scripts\activate

3. Install stable-audio-tools and the fork

pip install stable-audio-tools
pip install .

4. (Windows only) Force a CUDA torch wheel

Windows venvs sometimes resolve to the CPU-only torch wheel, which makes Gradio fall back to CPU silently. If python -c "import torch; print(torch.cuda.is_available())" prints False, reinstall torch from the CUDA 12.1 channel per the RC fork's documented Windows path (the default Linux pip install torch already resolves to a CUDA build, and the 4060 Ti 16GB is Ada Lovelace sm_89 — no special Blackwell/cu128 wheel needed):

pip uninstall -y torch torchvision torchaudio
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

5. Download Foundation-1 weights

Place both files inside a single subfolder of models/:

mkdir -p models/Foundation-1
cd models/Foundation-1
curl -L -o Foundation_1.safetensors \
  https://huggingface.co/RoyalCities/Foundation-1/resolve/main/Foundation_1.safetensors
curl -L -o model_config.json \
  https://huggingface.co/RoyalCities/Foundation-1/resolve/main/model_config.json
cd ../..

The safetensors file is 2.43 GB (HF Files tab).

Running

Launch the Gradio UI, pointing at the Foundation-1 checkpoint and config you just downloaded:

python run_gradio.py \
  --model-config models/Foundation-1/model_config.json \
  --ckpt-path models/Foundation-1/Foundation_1.safetensors

The Gradio interface opens in your browser. Foundation-1 uses a layered tag prompt schema documented on its model card:

[Instrument Family / Sub-Family], [Timbre], [Musical Behavior / Notation], [FX], [Key], [Bars], [BPM]

A working example prompt from the card:

Bass, FM Bass, Medium Delay, Medium Reverb, Low Distortion, Phaser, Sub Bass,
Bass, Upper Mids, Acid, Gritty, Wide, Dubstep, Thick, Silky, Warm, Rich,
Overdriven, Crisp, Deep, Clean, Pitch Bend, 303, 8 Bars, 140 BPM, E minor

Supported loop structures: 4 or 8 bars; supported BPMs: 100, 110, 120, 128, 130, 140, 150. The RC fork's BPM/bar selector locks generation duration to the prompt's musical structure automatically — for an 8-bar loop at 100 BPM that's roughly 19 seconds of output. The underlying Stable Audio Open base outputs 44.1 kHz stereo up to 47 seconds; Foundation-1 is constrained to the bar/BPM grid above.

Results

  • Speed: The model card reports generation takes "approximately ~7–8 seconds per sample" on an RTX 3090 (Ampere, 24 GB, 936 GB/s memory bandwidth). The RTX 4060 Ti 16GB (Ada Lovelace sm_89, 16 GB, 288 GB/s) is meaningfully slower on memory-bound workloads — plan on longer per-sample times until a community benchmark lands. Check /check/foundation-1/rtx-4060-ti-16gb for live benchmark data once one lands; the 4060 Ti 16GB has ~31% of the 3090's memory bandwidth, so a rough envelope is ~20–25 seconds per sample, but this is unmeasured.
  • VRAM usage: ~7 GB during generation per the HF card ("Typical VRAM usage during generation is approximately ~7 GB. For reliable operation, a GPU with at least 8 GB of VRAM is recommended"). The 4060 Ti 16GB's 9 GB of headroom is comfortable — you can keep a browser, a DAW with GPU plugins, or a second small inference job running alongside generation without juggling memory the way 8 GB cards must.
  • Output: mono/stereo .wav loops aligned to the requested bar count and BPM. Per the model card limitations, percussion and drum sounds are out of scope for this release; the 10 instrument families covered are Synth, Keys, Bass, Bowed Strings, Mallet, Wind, Guitar, Brass, Vocal, and Plucked Strings.

For the full benchmark data, see /check/foundation-1/rtx-4060-ti-16gb.

Troubleshooting

Gradio launches but reports torch.cuda.is_available() == False

Either you didn't activate the venv before launching, or pip install torch resolved to the CPU wheel (Windows is the common culprit). Re-run step 4 to force the CUDA 12.1 channel, then verify:

python -c "import torch; print(torch.cuda.is_available(), torch.cuda.get_device_name(0))"

Should print True NVIDIA GeForce RTX 4060 Ti (or similar).

Dependency resolution failures on Python 3.11+

The RC fork's README explicitly notes 3.11+ "can fail dependency resolution due to pinned packages (notably older SciPy wheels)." Use a Python 3.10 venv as in step 2.

Prefer ComfyUI over Gradio

Two community ComfyUI custom nodes exist:

  • Saganaki22/ComfyUI-Foundation-1 — auto-downloads weights into ComfyUI/models/stable_audio/Foundation-1/, ships example workflows. Install via ComfyUI Manager (recommended) or git clone into ComfyUI/custom_nodes/ then python install.py. The install script uses pip install stable-audio-tools --no-deps to protect your ComfyUI environment from the upstream's aggressive pandas==2.0.2 pin — that pandas version has no Python 3.13 wheel and is the primary driver of the --no-deps workaround.
  • SanDiegoDude/scg_Foundation-1-comfyUI — install via ComfyUI Manager recommended; weights land at ComfyUI/models/audio_checkpoints/Foundation-1/.

Same ~7 GB VRAM envelope and 8 GB minimum apply regardless of front-end. On the 4060 Ti 16GB, both nodes have plenty of headroom.

Want even more VRAM headroom? Enable INT4 / Low-VRAM Mode (TorchAO)

You don't need this on a 16 GB card — the default BF16 path fits comfortably with ~9 GB of free VRAM — but the RC fork ships an optional INT4 weight-only mode you can use if you want to run Foundation-1 alongside a much larger model. Install:

# Windows (pinned, recommended)
pip install torchao==0.12.0
# Linux
pip install torchao

The fork notes INT4 can be "very slow on Windows because Triton fast-kernels are usually unavailable (falls back to slower paths)." If TorchAO isn't installed, the INT4 toggle stays hidden in the UI.

Prompt produces drift or incoherent phrases

Per the model card's Limitations section, if generation duration doesn't match the prompt's bar/BPM structure (e.g. requesting an 8-bar loop but capping output at 5 seconds), output coherence degrades. The RC fork handles this alignment automatically — if you're using bare stable-audio-tools or a third-party UI, set the audio duration manually to match the bars × (60 / BPM) × 4 formula. Also: keep prompts in the documented tag order, use 1–3 timbre descriptors, and always include both Bars and BPM.

Percussion or drum prompts produce garbage

By design. The card lists percussion as explicitly out-of-scope for this release. Use a different tool (e.g. a drum sample library or a percussion-specific model) for drum loops.

No widely-reported issues on RTX 4060 Ti 16GB specifically — if you hit one, report it via the submission form.