ACE-Step 1.5 XL on RTX 5070: Text-to-Music Generation via the 8 GB Optimization Path

What You'll Build

A working text-to-music pipeline that turns a text prompt + optional lyrics into a full song (vocals, instruments, up to ~4 minutes) on a single RTX 5070, either through the official Gradio app or as a ComfyUI custom node — using the model's documented memory-optimized launch path so it fits the 12 GB card comfortably.

Hardware data: RTX 5070 (12 GB VRAM) · text-to-music, lyric-aligned vocals, top-10 supported languages per HF model card · See benchmark data

⚠️ On a 12 GB card, use the optimization flags — they are not optional. At default precision ACE-Step peaks at 11.7 GB (measured on a 12 GB RTX 3060, see Results). A 12 GB RTX 5070 with a display attached only exposes roughly 10.5–11.3 GB of usable VRAM, so the default path would OOM. Lead with the --cpu_offload --torch_compile --overlapped_decode combination that the ACE-Step team documents as an 8 GB floor — that is the path this recipe installs and runs.

ℹ️ Not a TTS model. ACE-Step generates music — instruments and lyric-aligned vocals — from a text description. It is filed under our tts vertical because the catalogue groups all audio-output models together, but it is not a text-to-speech engine. If you want spoken speech synthesis on this GPU, see Kokoro or VoxCPM. If you want sung vocals over generated backing, you are in the right place.

Requirements

Component	Minimum	Tested
GPU	8 GB VRAM with the optimization flags (the path this recipe uses)	RTX 5070 (12 GB)
RAM	16 GB (CPU offload streams weights through system RAM)	—
Storage	~8.3 GB for the 3.5B weights + DCAE VAE + vocoder + UMT5-base text encoder	8.28 GB on disk
Software	Python 3.10, PyTorch with CUDA (cu128 wheels for Blackwell sm_120), ComfyUI (optional)	—

The RTX 5070 is a Blackwell GB205 (sm_120) card — 6144 CUDA cores, ~672 GB/s memory bandwidth, 12 GB GDDR7 on a 192-bit bus, 250 W TGP. Two facts shape this recipe: (1) its 12 GB envelope does not clear the ~11.7 GB default-precision peak once a display is attached, so the optimization flags are required; and (2) sm_120 ships native FP8 tensor cores and the cu128 PyTorch wheel, so the install is a verbatim Blackwell setup.

Installation

1. Clone the repo and create the conda environment

git clone https://github.com/ace-step/ACE-Step.git
cd ACE-Step
conda create -n ace_step python=3.10 -y
conda activate ace_step

2. Install PyTorch with Blackwell (sm_120) kernels

The official README installs a cu126 PyTorch wheel. On a Blackwell RTX 5070 (sm_120), use the CUDA 12.8 index so the wheel ships sm_120 kernels:

pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128

(The README's stock command targets cu126; pin the CUDA 12.8 wheel so the build includes sm_120 kernels for your Blackwell driver. For other PyTorch installation options, refer to the official PyTorch website.)

3. Install the package

pip install -e .

This pulls in diffusers, transformers, accelerate, and the project's audio dependencies. Weights for ACE-Step/ACE-Step-v1-3.5B are downloaded automatically from HuggingFace on first launch (to ~/.cache/ace-step/checkpoints unless you pass --checkpoint_path).

4. (Optional) ComfyUI custom node

If you would rather drive the model from a ComfyUI workflow:

cd ComfyUI/custom_nodes
git clone https://github.com/billwuhao/ComfyUI_ACE-Step.git

Then download the weights into ComfyUI/models/TTS/ACE-Step-v1-3.5B/ — the folder layout is documented in the custom node README (needs ace_step_transformer, music_dcae_f8c8, music_vocoder, and umt5-base subdirectories).

Running

Gradio app (the 12 GB launch — optimization flags on)

This is the path you want on a 12 GB RTX 5070. The three flags together are what the ACE-Step team documents as the 8 GB floor:

acestep --torch_compile true --cpu_offload true --overlapped_decode true --port 7865

Per the official README, these flags "Reduced Max VRAM to 8GB, making it more compatible with consumer devices." --cpu_offload "Offload[s] model weights to CPU to save GPU memory" (the heaviest hitter — it streams transformer layers from system RAM into VRAM on demand), and --overlapped_decode pipelines VAE decoding with diffusion. The PCIe Gen5 x16 link on the 5070 keeps the offload-streaming overhead manageable.

Then open http://localhost:7865. In the Text2Music tab, enter descriptive tags (style, mood, instruments) and optional lyrics (use [verse], [chorus], [bridge] structure tags), set an audio duration, and generate. The app returns a downloadable audio file.

Python API (programmatic)

ACE-Step is a music model, so the HuggingFace card's pipeline_tag is text-to-audio and the card carries no copy-pasteable code snippet — do not reach for a generic diffusers image/text pipeline, and do not expect an .images[0] attribute or a .text2music() method (neither exists). The canonical entry point, per the repo's infer.py, is the ACEStepPipeline class, instantiated and then called directly (text2music is only the Gradio tab label and the internal task-string default):

from acestep.pipeline_ace_step import ACEStepPipeline

pipe = ACEStepPipeline(
    checkpoint_dir="",          # "" → auto-download to ~/.cache/ace-step/checkpoints
    dtype="bfloat16",
    torch_compile=True,
    cpu_offload=True,           # required on a 12 GB card
    overlapped_decode=True,
)

pipe(
    audio_duration=60,
    prompt="upbeat synthwave, driving bass, retro 80s",
    lyrics="[verse]\nNeon city lights\n[chorus]\nWe ride tonight",
    infer_step=60,
    guidance_scale=15.0,
    save_path="output.wav",
)

The generated audio is written to save_path. See infer.py for the full argument list (scheduler type, CFG type, seeds, guidance interval, ERG flags).

Results

Speed: No RTX 5070 benchmark is cited yet, so no speed figure is claimed for this card. The RTX 5070 differs materially from the nearer Blackwell siblings on both memory bandwidth and compute, so a figure from another card would not transfer honestly — speed is omitted by design. Once a community benchmark lands it will appear at /check/acestep-1-5-xl/rtx-5070. If you run ACE-Step on an RTX 5070, please contribute your numbers via the submission form.
VRAM usage: The official minimum is 8 GB with cpu_offload + torch_compile + overlapped_decode enabled, confirmed by the ACE-Step team (member xushengyuan) in HF discussion #4: "The minimum VRAM requirement for full-length generation is now just 8 GB. We tested it on an RTX 4060, and it delivers decent performance beyond our expectations (1.16it/s)." That is the path this recipe installs, and it fits the RTX 5070 with several GB to spare. For contrast, default half precision (no flags) peaks at 11.7 GB / 12 GB on an RTX 3060, per community user akande in the same thread (May 2025): "For me it runs on my 3060 and consumes 11.7GB / 12GB vram. […] Because i don't use any arguments other then --port to start." On a 12 GB RTX 5070 with a display attached (~10.5–11.3 GB usable), that 11.7 GB default peak would OOM — which is exactly why the optimization flags are the lead path here. See /check/acestep-1-5-xl/rtx-5070.
Quality notes: Performs best in the top 10 supported languages; rare instruments may render imperfectly; outputs beyond ~5 minutes can lose structural coherence; the model is highly seed-sensitive ("gacha-style" results, per the HF card's Limitations section).

For the full benchmark data, see /check/acestep-1-5-xl/rtx-5070.

Troubleshooting

Out of memory at 12 GB without the flags

The RTX 5070's 12 GB does not clear the cited 11.7 GB default-precision peak once a desktop session is using a slice of VRAM. Always launch with the optimization combination:

acestep --torch_compile true --cpu_offload true --overlapped_decode true --port 7865

cpu_offload is the heaviest hitter — it streams transformer layers from RAM into VRAM on demand. Combined with overlapped_decode (which pipelines VAE decoding with diffusion) you hit the official 8 GB floor that the ACE-Step team measured on an RTX 4060 per HF discussion #4. Because cpu_offload moves weights through system RAM, make sure you have at least 16 GB of RAM free.

The HuggingFace Quick-start snippet does not work

ACE-Step's HF card has pipeline_tag: text-to-audio and ships no usable inference snippet — if you paste a generic diffusers pipeline(...) call (or an .images[0]-style snippet from a similarly-templated card) it will not produce music. Use the ACEStepPipeline class from the repo's infer.py (shown in the Python API section above): instantiate ACEStepPipeline(...) and call it directly with prompt / lyrics / audio_duration / save_path. There is no .text2music() method and no .images[0] attribute. The Gradio CLI (acestep) is the simpler path if you do not need the library API.

`acestep` command not found after `pip install -e .`

The -e (editable) install registers the acestep entry point in your conda env. If the shell can't find it, you are probably in a different env — re-activate with conda activate ace_step and verify with which acestep. The GitHub repo README documents the entry point.

Blackwell (sm_120) kernel errors at first inference

If the first generation call crashes with a CUDA kernel or "no kernel image is available" error, your PyTorch build lacks sm_120 kernels. Reinstall the cu128 wheel (Installation step 2) — the README's default cu126 command does not always carry Blackwell kernels. ACE-Step's attention path uses the standard PyTorch backends, so there is no FlashAttention-2 sm_120 wheel gap to work around here.

Generations sound unstructured past ~5 minutes

This is a documented limitation, not a bug. The model card calls it out under "Limitations" — the model loses long-range structural coherence beyond ~5 minutes. Either keep prompts inside that window or use the repaint/extend operations on shorter segments and stitch them.

If you hit something not covered here, please report via the submission form so we can add it to the catalogue.