How much VRAM does ACE-Step 1.5 XL need?

About 12 GB — the minimum this recipe targets.

How hard is this setup?

Intermediate — follow the steps above.

ACE-Step 1.5 XL on RTX 5060 Ti: Text-to-Music Generation in ComfyUI

What You'll Build

A working text-to-music pipeline that turns a prompt + optional lyrics into a full song (vocals, instruments, up to ~4 minutes) on a single RTX 5060 Ti, either through the official Gradio app or as a ComfyUI custom node.

Hardware data: RTX 5060 Ti (16 GB VRAM) · text-to-music, lyric-aligned vocals, 17 languages per HF model card · See benchmark data

ℹ️ Not a TTS model. ACE-Step generates music — instruments and lyric-aligned vocals — from a text description. It's filed under our tts vertical because the catalogue groups all audio-output models together, but it is not a text-to-speech engine. If you want spoken speech synthesis on this GPU, see Kokoro or VoxCPM. If you want sung vocals over generated backing, you're in the right place.

Requirements

Component	Minimum	Tested
GPU	8 GB VRAM with optimizations enabled, 12 GB at default precision	RTX 5060 Ti (16 GB)
RAM	16 GB	—
Storage	~8 GB for the 3.5B weights + VAE + vocoder + UMT5-base text encoder	—
Software	Python 3.10, PyTorch with CUDA 12.6, ComfyUI (optional)	—

Installation

1. Clone the repo and create the conda environment

git clone https://github.com/ace-step/ACE-Step.git
cd ACE-Step
conda create -n ace_step python=3.10 -y
conda activate ace_step

2. Install PyTorch (Windows GPU users only — Linux can skip)

pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu126

3. Install the package

pip install -e .

This pulls in diffusers, transformers, accelerate, and the project's audio dependencies. Weights for ACE-Step/ACE-Step-v1-3.5B are downloaded automatically from HuggingFace on first launch.

4. (Optional) ComfyUI custom node

If you'd rather drive the model from a ComfyUI workflow:

cd ComfyUI/custom_nodes
git clone https://github.com/billwuhao/ComfyUI_ACE-Step.git

Then download the weights into ComfyUI/models/TTS/ACE-Step-v1-3.5B/ — the folder layout is documented in the custom node README (needs ace_step_transformer, music_dcae_f8c8, music_vocoder, and umt5-base subdirectories).

Running

Gradio app (official)

acestep --port 7865 --bf16 true

Then open http://localhost:7865. Enter a text prompt (style, mood, instruments) and optional lyrics; the app returns a downloadable audio file.

Memory-optimized launch (if you want headroom for other work)

acestep --torch_compile true --cpu_offload true --overlapped_decode true --port 7865

These three flags together are what get the model down to the official 8 GB floor — useful if the 5060 Ti is also driving a display or running other workloads.

Diffusers (programmatic)

import torch
from diffusers import DiffusionPipeline

pipe = DiffusionPipeline.from_pretrained(
    "ACE-Step/ACE-Step-v1-3.5B",
    dtype=torch.bfloat16,
    device_map="cuda",
)

(See the HF model card for the music-pipeline call signature — the README's image-pipeline snippet is a templating artifact.)

Results

Speed: No 5060 Ti benchmark cited yet. For comparison, the official GitHub README reports an RTX 4090 generates one minute of music in 1.74 s (RTF 34.48×, 27 steps) and an RTX 3090 in 4.70 s (RTF 12.76×). The 5060 Ti's Blackwell architecture lands between these two in raw compute; expect the same order of magnitude. Once a community measurement lands it'll appear at /check/acestep-1-5-xl/rtx-5060-ti.
VRAM usage: Cited 11.7 GB / 12 GB on an RTX 3060 at default half precision per HF discussion #4 (user akande, May 2025). Official minimum is 8 GB with cpu_offload + torch_compile + overlapped_decode enabled, confirmed by the ACE-Step team (user xushengyuan) in the same thread: "The minimum VRAM requirement for full-length generation is now just 8 GB. We tested it on an RTX 4060, and it delivers decent performance beyond our expectations (1.16 it/s)." Plenty of headroom on a 16 GB 5060 Ti at default precision.
Quality notes: Performs best in the top 10 supported languages; rare instruments may render imperfectly; outputs >5 minutes can lose structural coherence; the model is highly seed-sensitive ("gacha-style" results, per the HF card's Limitations section).

For the full benchmark data, see /check/acestep-1-5-xl/rtx-5060-ti.

Troubleshooting

Out of memory on a 12 GB card at default precision

Early reports on the HF discussion thread hit OOM on a 12 GB RTX 3060 before the May 2025 memory-optimization update. The fix is the launch flag combination above:

acestep --torch_compile true --cpu_offload true --overlapped_decode true

cpu_offload is the heaviest hitter — it streams transformer layers from RAM into VRAM on demand. Combined with overlapped_decode (which pipelines VAE decoding with diffusion) you hit the official 8 GB floor.

`acestep` command not found after `pip install -e .`

The -e (editable) install registers the acestep entry point in your conda env. If the shell can't find it, you're probably in a different env — re-activate with conda activate ace_step and verify with which acestep. The GitHub repo README documents the entry point.

Generations sound unstructured past ~5 minutes

This is a documented limitation, not a bug. The model card calls it out under "Limitations" — the model loses long-range structural coherence beyond ~5 minutes. Either keep prompts inside that window or use the repaint/extend operations on shorter segments and stitch them.

Vocals sound coarse / lyrics are mispronounced

The model's "Vocal Quality" limitation (per the HF card) — synthesis is functional but lacks fine nuance, especially for low-resource languages outside the top 10. For polished vocals, the model's RapMachine LoRA fine-tune is one option; see the official GitHub repo for the LoRA loading documentation.

If you hit something not covered here, please report via the submission form so we can add it to the catalogue.