self-hosted/ai
§01·recipe · tts

Foundation-1 on RX 7800 XT: Structured Music Sample Generation on ROCm

ttsintermediate8GB+ VRAMJun 18, 2026

This intermediate recipe sets up Foundation-1 on the RX 7800 XT, needing about 8 GB of VRAM.

models
tools
prerequisites
  • AMD Radeon RX 7800 XT (16 GB VRAM, RDNA3 / Navi 32 / gfx1101) or any ROCm-supported card with ≥ 8 GB VRAM (model card recommends 8 GB minimum)
  • Linux (Ubuntu 24.04 / 22.04 or RHEL) with the AMD ROCm stack installed (ROCm 7.2.x)
  • Python 3.10 (3.11+ may fail dependency resolution per the RC fork README)
  • Git, ~3 GB free disk for weights + dependencies

What You'll Build

A local, offline pipeline that turns structured tag prompts (instrument → timbre → FX → key → bars → BPM) into tempo-synced, bar-aligned music loops on your Radeon RX 7800 XT (RDNA3, Navi 32, gfx1101) through the ROCm stack. Foundation-1 is a fine-tune of stabilityai/stable-audio-open-1.0 trained for music-production workflows; the RC Stable Audio Tools fork handles the BPM/bar timing alignment automatically. The model is pure PyTorch, so it runs through PyTorch's native attention path on ROCm with no special kernels.

Hardware data: RX 7800 XT (16GB VRAM) · BF16/FP16 · PyTorch SDPA on ROCm 7.2 · ~7 GB usage per the HuggingFace model card (comfortable headroom on a 16 GB card) · See benchmark data

⚠️ This is a ROCm recipe, not CUDA. The RX 7800 XT runs on AMD's ROCm/HIP stack — there is no cu121/cu124/cu128 wheel here, no xformers install, and no FP8/FP4 path. You install PyTorch built for ROCm, not for CUDA. RDNA3 has no FP8/FP4 hardware (its WMMA units accept FP16, BF16, INT8, INT4 only), and at 16 GB Foundation-1's ~7 GB footprint stays comfortably under the ceiling — so you simply run the native BF16/FP16 weights. The attention path is PyTorch's scaled-dot-product attention (SDPA), which is exactly what stable-audio-tools uses on a PyTorch backend — no FlashAttention build is needed or wanted. If a guide tells you to pip install xformers, build a flash-attn wheel, or pick a cu12x torch wheel for this card, it's written for the wrong vendor.

ℹ️ Not a text-to-speech model. Foundation-1 is in our tts vertical because the catalogue groups all audio models together, but it generates one-shot music samples — bar-locked instrumental loops — not speech. It does not synthesize voices, words, or any spoken audio. For speech synthesis on this GPU, see Kokoro, VoxCPM, or Qwen3-TTS instead. Per its own HuggingFace card, it is "Structured text-to-sample generation for modern music production" — a specialized model for music sample generation, not a general-purpose music generator.

⚠️ Split license — read before shipping. The code and the weights carry two different licenses; do not conflate them.

  • Code (the RC Stable Audio Tools fork) is MIT-licensed — see the RC fork repository.
  • Weights (Foundation-1 model) are released under the Stability AI Community License. The HuggingFace card states the model "is available for non-commercial use or limited commercial use by entities with annual revenues below USD $1M."

If you're a hobbyist or under the $1M revenue threshold you're clear; otherwise contact Stability AI for a commercial license before publishing or selling outputs. The MIT code permission does not extend the weights' usage terms, and the Stability license on the weights does not restrict the MIT-licensed code.

Requirements

ComponentMinimumTested
GPU8 GB VRAM (ROCm-supported AMD card, per HF card)RX 7800 XT (16 GB, ~9 GB headroom)
RAM16 GB system RAM
Storage~3 GB (2.43 GB weights + venv + deps)
DriverAMD ROCm 7.2.x on Linux
Python3.10 (3.11+ may fail SciPy resolution per the RC fork README)
PyTorch2.4+ built for ROCm (whl/rocm7.2 index — NOT a CUDA wheel)
SoftwareRC Stable Audio Tools fork or ComfyUI custom node

Installation

This recipe follows the canonical workflow recommended on the Foundation-1 model card — the RC Stable Audio Tools fork, which auto-handles BPM/bar timing alignment — but installs PyTorch for ROCm instead of CUDA. For a ComfyUI alternative, see Troubleshooting.

1. Clone the RC Stable Audio Tools fork

git clone https://github.com/RoyalCities/RC-stable-audio-tools.git
cd RC-stable-audio-tools

2. Create a Python 3.10 virtual environment

python3.10 -m venv venv
source venv/bin/activate

3. Install PyTorch for ROCm (do this BEFORE the package install)

The RX 7800 XT (gfx1101) is an officially ROCm-supported GPU on Linux, so it uses the stable ROCm PyTorch wheel. Install torch from the ROCm wheel index first so the later pip install doesn't pull a CPU/CUDA build:

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm7.2

ℹ️ Verify the ROCm tag before you copy it. The rocmX.Y tag in that index moves over time (6.3 → 6.4 → 7.x). Read the current stable line at pytorch.org/get-started/locally (select ROCm) before running. AMD also ships its own Radeon-tuned wheels at repo.radeon.com if you prefer the AMD-recommended build; the upstream whl/rocm7.2 wheel above is the simplest canonical path on a supported card.

Confirm you got the ROCm build, not a CUDA or CPU one:

python -c "import torch; print(torch.__version__, torch.cuda.is_available())"

It should print a +rocm7.2-style version suffix and True (ROCm masquerades as the cuda device namespace under HIP).

4. Install stable-audio-tools and the fork

pip install stable-audio-tools
pip install .

If this step replaces your ROCm torch with a CPU or CUDA wheel (it can, because of dependency pins), re-run step 3 to reinstall the ROCm build, then re-check torch.cuda.is_available().

5. Download Foundation-1 weights

Place both files inside a single subfolder of models/:

mkdir -p models/Foundation-1
cd models/Foundation-1
curl -L -o Foundation_1.safetensors \
  https://huggingface.co/RoyalCities/Foundation-1/resolve/main/Foundation_1.safetensors
curl -L -o model_config.json \
  https://huggingface.co/RoyalCities/Foundation-1/resolve/main/model_config.json
cd ../..

The safetensors file is 2.43 GB (x-linked-size 2,426,992,388 bytes, HEAD-checked live; HF Files tab). This release ships only the 16-bit weights — per the card, "Unlike prior releases where both 32-bit and 16-bit models were provided, this release includes only the 16-bit version. There is no quality loss, while reducing the model footprint." On RDNA3 these 16-bit weights load and run as BF16/FP16 directly; there is no quantization step to do and no FP8 path to chase.

Running

Launch the Gradio UI, pointing at the Foundation-1 checkpoint and config you just downloaded:

python run_gradio.py \
  --model-config models/Foundation-1/model_config.json \
  --ckpt-path models/Foundation-1/Foundation_1.safetensors

The Gradio interface opens in your browser. Foundation-1 uses a layered tag prompt schema documented on its model card:

[Instrument Family / Sub-Family], [Timbre], [Musical Behavior / Notation], [FX], [Key], [Bars], [BPM]

A working example prompt from the card's Audio Showcase:

Bass, FM Bass, Medium Delay, Medium Reverb, Low Distortion, Phaser, Sub Bass,
Bass, Upper Mids, Acid, Gritty, Wide, Dubstep, Thick, Silky, Warm, Rich,
Overdriven, Crisp, Deep, Clean, Pitch Bend, 303, 8 Bars, 140 BPM, E minor

Supported loop structures: 4 or 8 bars; supported BPMs: 100, 110, 120, 128, 130, 140, 150. The RC fork's BPM/bar selector locks generation duration to the prompt's musical structure automatically — per the card, "an 8-bar loop at 100 BPM ≈ 19 seconds" of output. The underlying Stable Audio Open base outputs 44.1 kHz stereo; Foundation-1 is constrained to the bar/BPM grid above.

Because the model is pure PyTorch, attention runs through PyTorch's built-in scaled-dot-product attention (SDPA) on ROCm — the default, no flag or extra package required. Do not install xformers or a flash-attn wheel for this; on RDNA3 they are the wrong path and SDPA is what the stack already uses.

Results

  • Speed: The model card reports generation time on an RTX 3090 only — "On an RTX 3090, generation time is approximately ~7–8 seconds per sample." The RTX 3090 is an NVIDIA Ampere card on CUDA, not architecturally or runtime-comparable to the RDNA3 Radeon RX 7800 XT on ROCm, so that figure does not transfer as a 7800 XT number. No RX-7800-XT-named generation-time measurement was found in research. Rather than transfer a number from a different GPU and a different stack, the Speed figure is omitted. Once a community benchmark lands it will appear at /check/foundation-1/rx-7800-xt — please contribute yours via the submission form.
  • VRAM usage: ~7 GB during generation per the HF card: "Typical VRAM usage during generation is approximately ~7 GB. For reliable operation, a GPU with at least 8 GB of VRAM is recommended." On the RX 7800 XT's 16 GB that leaves roughly 9 GB free — comfortable enough to keep a generation session running without memory pressure.
  • Output: stereo .wav loops aligned to the requested bar count and BPM. Per the model card limitations, percussion and drum sounds are out of scope for this release; the 10 instrument families covered are Synth, Keys, Bass, Bowed Strings, Mallet, Wind, Guitar, Brass, Vocal, and Plucked Strings.

For the full benchmark data, see /check/foundation-1/rx-7800-xt.

Troubleshooting

Gradio launches but reports torch.cuda.is_available() == False

On ROCm, torch.cuda.is_available() returning False almost always means a CPU-only or CUDA torch wheel got installed instead of the ROCm build (commonly because the pip install . step in installation pulled a different torch via a dependency pin). Reinstall the ROCm wheel:

pip uninstall -y torch torchvision torchaudio
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm7.2
python -c "import torch; print(torch.__version__, torch.cuda.is_available())"

It should print a +rocm7.2 suffix and True. (Under HIP, ROCm exposes the GPU through the cuda device namespace, so torch.cuda.* is the correct API even on AMD.) If it still reports False, confirm your system ROCm stack is installed (rocminfo should list gfx1101) and that your user is in the render/video groups.

A library ships only gfx1100 kernels and won't load on the 7800 XT

The 7800 XT is gfx1101 (Navi 32), while the flagship 7900 XTX is gfx1100 (Navi 31). Most of the ROCm stack ships kernels for both, but occasionally a library or prebuilt extension only carries gfx1100 kernels and refuses to run on gfx1101 with a "no kernel image is available" / missing-gfx1101-kernel error. The standard Linux-only fallback is to mask the card as gfx1100 at runtime:

HSA_OVERRIDE_GFX_VERSION=11.0.0 python run_gradio.py \
  --model-config models/Foundation-1/model_config.json \
  --ckpt-path models/Foundation-1/Foundation_1.safetensors

This is a legacy fallback, not a default — current PyTorch on the stable ROCm wheel runs Foundation-1 natively on gfx1101 without it. Only reach for it if a specific library refuses to load on the 7800 XT's gfx1101 target.

Dependency resolution failures on Python 3.11+

The RC fork's README explicitly notes to use Python 3.10: "Newer versions (e.g. 3.11+) can fail dependency resolution due to pinned packages (notably older SciPy wheels)." Use a Python 3.10 venv as in step 2.

Prefer ComfyUI over Gradio

Two community ComfyUI custom nodes exist:

  • Saganaki22/ComfyUI-Foundation-1 — auto-downloads weights into ComfyUI/models/stable_audio/Foundation-1/, ships example workflows. Install via ComfyUI Manager (recommended) or git clone into ComfyUI/custom_nodes/ then python install.py. The install script uses pip install stable-audio-tools --no-deps to protect your ComfyUI environment from the upstream's aggressive pandas==2.0.2 pin.
  • SanDiegoDude/scg_Foundation-1-comfyUI — install via ComfyUI Manager recommended; weights land at ComfyUI/models/audio_checkpoints/Foundation-1/.

If you go the ComfyUI route on this card, make sure ComfyUI's own PyTorch is the ROCm build (--index-url https://download.pytorch.org/whl/rocm7.2), exactly as in step 3 — the same ROCm-not-CUDA rule applies. The same ~7 GB VRAM envelope and 8 GB minimum apply regardless of front-end; on the 16 GB 7800 XT both nodes have comfortable headroom.

Want to share the card with a larger model? Enable INT4 / Low-VRAM Mode (TorchAO)

You do not need this on a 16 GB card — the default BF16 path fits Foundation-1's ~7 GB footprint with room to spare — but the RC fork ships an optional INT4 weight-only mode (via TorchAO) you can use if you want to run Foundation-1 alongside a larger model:

pip install torchao

The fork notes INT4 inference "can be very slow on Windows because Triton fast-kernels are usually unavailable (falls back to slower paths)." On Linux/ROCm the Triton-ROCm backend is present, but INT4 weight-only is still a memory-sharing convenience, not a speed win — on a 7800 XT you'd normally leave it off and run BF16. Note that RDNA3 has no FP8/FP4 hardware, so INT4 (which maps to the WMMA INT4 path) is the only sub-16-bit weight format that makes sense here; there is no FP8 option to consider. If TorchAO isn't installed, the INT4 toggle stays hidden in the UI.

Prompt produces drift or incoherent phrases

Per the model card's Limitations section, if generation duration doesn't match the prompt's bar/BPM structure (e.g. requesting an 8-bar loop but capping output at 5 seconds), output coherence degrades. The RC fork handles this alignment automatically — if you're using bare stable-audio-tools or a third-party UI, set the audio duration manually to match the bars × (60 / BPM) × 4 formula. Also: keep prompts in the documented tag order, use 1–3 timbre descriptors, and always include both Bars and BPM.

Percussion or drum prompts produce garbage

By design. The card lists "Percussion and drum sounds are outside the scope of this release." Use a different tool (e.g. a drum sample library or a percussion-specific model) for drum loops.

No widely-reported issues on RX 7800 XT specifically — if you hit one, report it via the submission form.

common questions
How much VRAM does Foundation-1 need?

About 8 GB — the minimum this recipe targets.

Which GPUs is Foundation-1 tested on?

RX 7800 XT (16 GB).

How hard is this setup?

Intermediate — follow the steps above.