How much VRAM does Foundation-1 need?

About 8 GB — the minimum this recipe targets.

How hard is this setup?

Intermediate — follow the steps above.

Foundation-1 on RTX 3060 Ti: Structured Music Sample Generation at the 8 GB Floor

What You'll Build

A local, offline pipeline that turns structured tag prompts (instrument → timbre → FX → key → bars → BPM) into tempo-synced, bar-aligned music loops on your RTX 3060 Ti. Foundation-1 is a fine-tune of stabilityai/stable-audio-open-1.0 trained for music-production workflows; the RC Stable Audio Tools fork handles the BPM/bar timing alignment automatically.

Hardware data: RTX 3060 Ti (8 GB VRAM, Ampere GA104-200 sm_86, 256-bit GDDR6 ~448 GB/s — Wikipedia GeForce 30 series, ASUS Dual RTX 3060 Ti) · ~7 GB VRAM during generation per the HuggingFace model card · See benchmark data

ℹ️ Not a text-to-speech model. Foundation-1 sits in our tts vertical because the catalogue groups all audio models together, but it generates one-shot music samples — bar-locked instrumental loops — not speech. It does not synthesize voices, words, or any spoken audio. For speech synthesis on this GPU, see Kokoro, VoxCPM, or Qwen3-TTS instead. The model's own HuggingFace card states plainly in its Limitations section that this is a specialized model for music sample generation, not a general-purpose music generator.

⚠️ Tight VRAM — your 8 GB card sits at the model's recommended floor. The HuggingFace card states: "Typical VRAM usage during generation is approximately ~7 GB. For reliable operation, a GPU with at least 8 GB of VRAM is recommended." That leaves roughly 1 GB of headroom on the RTX 3060 Ti — enough to run, but expect to close other GPU-using apps (browser hardware acceleration, OBS, a DAW with GPU plugins) before generation. The RTX 3060 Ti's faster Ampere compute does not change this: VRAM usage is a function of the model's weights and activations, not the GPU's speed, so the ~7 GB envelope is identical to any other 8 GB card. If you hit OOM, the RC fork ships an optional INT4 low-VRAM mode via TorchAO — see Troubleshooting.

⚠️ Licensing — read before shipping. Foundation-1 weights are released under the Stability AI Community License. The HuggingFace card states the model is available for non-commercial use or limited commercial use by entities with annual revenues below USD $1M; for revenues exceeding USD $1M, refer to the repository's LICENSE.md for full terms. If you're a hobbyist or under the $1M threshold you're clear; otherwise contact Stability AI for an enterprise license before publishing or selling outputs. The RC fork code is MIT; the constraint is on the model weights. The weights are not gated — there is no click-through access request on the HuggingFace page — but the license terms above still apply regardless.

Requirements

Component	Minimum	Tested
GPU	8 GB VRAM (per HF card)	RTX 3060 Ti (8 GB) — at the floor
RAM	16 GB system RAM	—
Storage	~3 GB (2.43 GB weights + venv + deps)	—
Python	3.10 (3.11+ may fail SciPy resolution per the RC fork README)	—
PyTorch	2.4+ with a CUDA wheel (the default CUDA build already ships Ampere sm_86 kernels)	—
Software	RC Stable Audio Tools fork or ComfyUI custom node	—

Installation

This recipe follows the canonical workflow recommended on the Foundation-1 model card — the RC Stable Audio Tools fork, which auto-handles BPM/bar timing alignment. For a ComfyUI alternative, see Troubleshooting.

1. Clone the RC Stable Audio Tools fork

git clone https://github.com/RoyalCities/RC-stable-audio-tools.git
cd RC-stable-audio-tools

2. Create a Python 3.10 virtual environment

The RC fork README is explicit: use Python 3.10, because 3.11+ can fail dependency resolution on the fork's pinned packages (notably older SciPy wheels).

Linux / macOS:

python3.10 -m venv venv
source venv/bin/activate

Windows:

py -3.10 -m venv venv
venv\Scripts\activate

3. Install stable-audio-tools and the fork

pip install stable-audio-tools
pip install .

4. Confirm (or force) a CUDA torch wheel

The RTX 3060 Ti is Ampere (GA104-200, compute capability sm_86). The default pip install torch resolves to a CUDA build on Linux, and that build already includes the sm_86 kernels the 3060 Ti needs — there is no special wheel selection here (unlike Blackwell RTX 50-series GPUs, which need a cu128 build for sm_120 kernels; the 3060 Ti's sm_86 kernels ship in the stock CUDA wheel, and even an older cu121 wheel covers sm_86). Windows venvs sometimes resolve to the CPU-only torch wheel, which makes Gradio fall back to CPU silently. If python -c "import torch; print(torch.cuda.is_available())" prints False, reinstall torch from the CUDA channel the RC fork documents:

pip uninstall -y torch torchvision torchaudio
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

Linux users can usually skip this step — pip install torch resolves to a CUDA build by default.

5. Download Foundation-1 weights

The release ships exactly two files — the 16-bit model checkpoint and its config. Per the model card, only the 16-bit version is provided this release (prior releases bundled both a 32-bit and a 16-bit model). Place both inside a single subfolder of models/:

mkdir -p models/Foundation-1
cd models/Foundation-1
curl -L -o Foundation_1.safetensors \
  https://huggingface.co/RoyalCities/Foundation-1/resolve/main/Foundation_1.safetensors
curl -L -o model_config.json \
  https://huggingface.co/RoyalCities/Foundation-1/resolve/main/model_config.json
cd ../..

The safetensors file is 2.43 GB (2,426,992,388 bytes, HF Files tab).

Running

Before launching, close other GPU-using apps (a browser with hardware acceleration, OBS, a DAW with GPU plugins) — at the 8 GB floor, idle GPU memory matters.

Launch the Gradio UI, pointing at the Foundation-1 checkpoint and config you just downloaded:

python run_gradio.py \
  --model-config models/Foundation-1/model_config.json \
  --ckpt-path models/Foundation-1/Foundation_1.safetensors

The Gradio interface opens in your browser. Foundation-1 uses a layered tag prompt schema documented on its model card:

[Instrument Family / Sub-Family], [Timbre], [Musical Behavior / Notation], [FX], [Key], [Bars], [BPM]

A working example prompt from the card's Audio Showcase:

Bass, FM Bass, Medium Delay, Medium Reverb, Low Distortion, Phaser, Sub Bass,
Bass, Upper Mids, Acid, Gritty, Wide, Dubstep, Thick, Silky, Warm, Rich,
Overdriven, Crisp, Deep, Clean, Pitch Bend, 303, 8 Bars, 140 BPM, E minor

Supported loop structures: 4 or 8 bars; supported BPMs: 100, 110, 120, 128, 130, 140, 150. The RC fork's BPM/bar selector locks generation duration to the prompt's musical structure automatically — the card gives the example that an 8-bar loop at 100 BPM works out to roughly 19 seconds of output (model card). The underlying Stable Audio Open base outputs 44.1 kHz stereo; Foundation-1 is constrained to the bar/BPM grid above, and the RC fork auto-extracts a .MID from each generated sample.

Results

Speed: No RTX 3060 Ti measurement has been published yet, so no firm number is quoted here. The model card reports generation taking roughly 7–8 seconds per sample on an RTX 3090 — the same Ampere generation as the 3060 Ti (both sm_86), but a far larger die: ~936 GB/s memory bandwidth (vs the 3060 Ti's ~448 GB/s) and 10,496 CUDA cores (vs 4,864). For this memory-bound workload bandwidth is the binding constraint, so the 3090's 7–8 s does not transfer down to the 3060 Ti and is given here only as loose same-architecture context — expect the 3060 Ti to be slower. Check /check/foundation-1/rtx-3060-ti for live benchmark data as it lands.
VRAM usage: ~7 GB during generation per the HF card ("Typical VRAM usage during generation is approximately ~7 GB. For reliable operation, a GPU with at least 8 GB of VRAM is recommended"). On the RTX 3060 Ti that leaves ~1 GB headroom — workable, but you cannot run another GPU-using app alongside generation without risking OOM. The 3060 Ti's faster Ampere compute does not buy back any VRAM here.
Output: stereo .wav loops aligned to the requested bar count and BPM. Per the model card limitations, percussion and drum sounds are out of scope for this release; the 10 instrument families covered are Synth, Keys, Bass, Bowed Strings, Mallet, Wind, Guitar, Brass, Vocal, and Plucked Strings.

For the full benchmark data, see /check/foundation-1/rtx-3060-ti.

Troubleshooting

`torch.cuda.OutOfMemoryError` during generation

Most common on 8 GB cards with a browser or DAW holding GPU memory. Fixes in order of effort:

Close other GPU-using apps. Run nvidia-smi before launching to see what's already resident — you want at least 7.5 GB free.
Disable browser hardware acceleration in your DAW/host machine while generating.
Enable the RC fork's optional INT4 / Low-VRAM Mode (TorchAO) — documented in the RC fork README. Install:
```
# Windows (pinned, recommended)
pip install torchao==0.12.0
# Linux
pip install torchao
```
The fork warns INT4 can be "very slow on Windows because Triton fast-kernels are usually unavailable (falls back to slower paths)" — try it if step 1 and 2 don't resolve OOM. (This is a Triton-kernel-availability caveat, not an Ampere-specific one — sm_86 is fully supported by the standard CUDA path.) If TorchAO isn't installed or isn't compatible with your environment, the INT4 toggle stays hidden/disabled in the UI.

Gradio launches but reports `torch.cuda.is_available() == False`

Either you didn't activate the venv before launching, or pip install torch resolved to the CPU wheel (Windows is the common culprit). Re-run step 4 to force the CUDA channel, then verify:

python -c "import torch; print(torch.cuda.is_available(), torch.cuda.get_device_name(0))"

Should print True NVIDIA GeForce RTX 3060 Ti (or similar). The RTX 3060 Ti is Ampere (sm_86), so no Blackwell-style cu128 wheel hunt is needed — the default CUDA build (and even an older cu121 build) already includes the right kernels.

Dependency resolution failures on Python 3.11+

The RC fork's README explicitly notes that 3.11+ can fail dependency resolution due to pinned packages (notably older SciPy wheels). Use a Python 3.10 venv as in step 2.

Prefer ComfyUI over Gradio

Two community ComfyUI custom nodes exist:

Saganaki22/ComfyUI-Foundation-1 — auto-downloads weights into ComfyUI/models/stable_audio/Foundation-1/ on first use and ships example workflows. Install via ComfyUI Manager (recommended) or git clone into ComfyUI/custom_nodes/ then python install.py. The install script uses pip install stable-audio-tools --no-deps to protect your ComfyUI environment from the upstream's pandas==2.0.2 pin.
SanDiegoDude/scg_Foundation-1-comfyUI — install via ComfyUI Manager (recommended); the loader node downloads the model on first use and caches it at ComfyUI/models/audio_checkpoints/Foundation-1/.

The same ~7 GB VRAM envelope and 8 GB minimum apply regardless of front-end — the 3060 Ti is still at the floor.

Prompt produces drift or incoherent phrases

Per the model card's Limitations section, if the generation duration is shorter than the musical structure implied by the prompt (for example requesting an 8-bar loop but generating only 5 seconds), the model may produce less coherent musical phrases. The RC fork handles this alignment automatically — if you're using bare stable-audio-tools or a third-party UI, set the audio duration to match the prompt's bar/BPM structure. Also: keep prompts in the documented tag order, use 1–3 timbre descriptors, and always include both Bars and BPM.

Percussion or drum prompts produce garbage

By design. The card lists percussion and drum sounds as explicitly out-of-scope for this release. Use a different tool (e.g. a drum sample library or a percussion-specific model) for drum loops.

No widely-reported issues on RTX 3060 Ti specifically — if you hit one, report it via the submission form.