What You'll Build
A working text-to-music pipeline that turns a text prompt + optional lyrics into a full song (vocals, instruments, up to ~4 minutes) on a single RTX 4060 Ti 16GB, either through the official Gradio app or as a ComfyUI custom node.
Hardware data: RTX 4060 Ti 16GB · text-to-music, lyric-aligned vocals, top-10 supported languages per HF model card · See benchmark data
ℹ️ Not a TTS model. ACE-Step generates music — instruments and lyric-aligned vocals — from a text description. It is filed under our
ttsvertical because the catalogue groups all audio-output models together, but it is not a text-to-speech engine. If you want spoken speech synthesis on this GPU, see Kokoro or VoxCPM. If you want sung vocals over generated backing, you're in the right place.
Requirements
| Component | Minimum | Tested |
|---|---|---|
| GPU | 8 GB VRAM with optimization flags, 12 GB at default precision | RTX 4060 Ti (16 GB) |
| RAM | 16 GB | — |
| Storage | ~8 GB for the 3.5B weights + VAE + vocoder + UMT5-base text encoder | — |
| Software | Python 3.10, PyTorch with CUDA 12.6, ComfyUI (optional) | — |
Installation
1. Clone the repo and create the conda environment
git clone https://github.com/ace-step/ACE-Step.git
cd ACE-Step
conda create -n ace_step python=3.10 -y
conda activate ace_step
2. Install PyTorch (Windows GPU users only — Linux can skip)
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu126
The default pip install torch already ships with sm_89 (Ada Lovelace) kernels — no special wheel selection is required for the 4060 Ti. The cu126 index is only needed when overriding the system CUDA on Windows.
3. Install the package
pip install -e .
This pulls in diffusers, transformers, accelerate, and the project's audio dependencies. Weights for ACE-Step/ACE-Step-v1-3.5B are downloaded automatically from HuggingFace on first launch.
4. (Optional) ComfyUI custom node
If you would rather drive the model from a ComfyUI workflow:
cd ComfyUI/custom_nodes
git clone https://github.com/billwuhao/ComfyUI_ACE-Step.git
Then download the weights into ComfyUI/models/TTS/ACE-Step-v1-3.5B/ — the folder layout is documented in the custom node README (needs ace_step_transformer, music_dcae_f8c8, music_vocoder, and umt5-base subdirectories).
Running
Gradio app (official)
acestep --port 7865 --bf16 true
Then open http://localhost:7865. Enter a text prompt (style, mood, instruments) and optional lyrics; the app returns a downloadable audio file.
Memory-optimized launch (free headroom for other workloads)
acestep --torch_compile true --cpu_offload true --overlapped_decode true --port 7865
These three flags together are what get the model down to the official 8 GB floor — useful if the 4060 Ti is also driving a display, running a browser, or sharing the card with other inference workloads. On a 16 GB card you do not strictly need them, but they cut peak resident VRAM substantially.
Diffusers (programmatic)
import torch
from diffusers import DiffusionPipeline
pipe = DiffusionPipeline.from_pretrained(
"ACE-Step/ACE-Step-v1-3.5B",
dtype=torch.bfloat16,
device_map="cuda",
)
(See the HF model card for the music-pipeline call signature — the README's generic image-pipeline snippet is a template artifact.)
Results
- Speed: No RTX 4060 Ti 16GB benchmark cited yet. For comparison, the official ACE-Step GitHub README reports an RTX 4090 generates one minute of music in 1.74 s (RTF 34.48×, 27 steps) and an RTX 3090 in 4.70 s (RTF 12.76×). The 4060 Ti shares Ada Lovelace architecture (sm_89) with the 4090 but has ~1/3 the FP16 throughput, so expect generation times closer to the 3090's order of magnitude or slightly slower. The ACE-Step maintainer (
xushengyuan) separately confirmed in HF discussion #4 that an RTX 4060 (the 4060 Ti's smaller sibling) sustains1.16 it/swith the memory-optimization flags enabled. Once a community measurement on the 4060 Ti 16GB lands it will appear at /check/acestep-1-5-xl/rtx-4060-ti-16gb. - VRAM usage: Cited 11.7 GB / 12 GB on an RTX 3060 at default half precision per user
akandein HF discussion #4 (May 2025): "For me it runs on my 3060 and consumes 11.7GB / 12GB vram... I don't use any arguments other than --port to start." Official minimum is 8 GB withcpu_offload + torch_compile + overlapped_decodeenabled, confirmed by the ACE-Step team (userxushengyuan) in the same thread: "The minimum VRAM requirement for full-length generation is now just 8 GB. We tested it on an RTX 4060." A separate community user in the same thread confirms working operation on an RTX 4060 Ti after the memory-optimization update. The 16 GB envelope of the 4060 Ti 16GB leaves ~4 GB of comfortable headroom at default precision over the 11.7 GB peak observed on the 3060. - Quality notes: Performs best in the top 10 supported languages; rare instruments may render imperfectly; outputs beyond ~5 minutes can lose structural coherence; the model is highly seed-sensitive ("gacha-style" results, per the HF card's Limitations section).
For the full benchmark data, see /check/acestep-1-5-xl/rtx-4060-ti-16gb.
Troubleshooting
Out of memory at default precision
The 16 GB envelope of the 4060 Ti 16GB should clear the cited 11.7 GB peak comfortably, but if you are also running a desktop session or other CUDA workloads, enable the launch-flag combination:
acestep --torch_compile true --cpu_offload true --overlapped_decode true
cpu_offload is the heaviest hitter — it streams transformer layers from RAM into VRAM on demand. Combined with overlapped_decode (which pipelines VAE decoding with diffusion) you hit the official 8 GB floor that the ACE-Step team measured on the RTX 4060 (the 4060 Ti's smaller sibling — same Ada Lovelace generation, half the VRAM) per HF discussion #4.
acestep command not found after pip install -e .
The -e (editable) install registers the acestep entry point in your conda env. If the shell can't find it, you're probably in a different env — re-activate with conda activate ace_step and verify with which acestep. The GitHub repo README documents the entry point.
Generations sound unstructured past ~5 minutes
This is a documented limitation, not a bug. The model card calls it out under "Limitations" — the model loses long-range structural coherence beyond ~5 minutes. Either keep prompts inside that window or use the repaint/extend operations on shorter segments and stitch them.
Vocals sound coarse / lyrics are mispronounced
The model's "Vocal Quality" limitation (per the HF card) — synthesis is functional but lacks fine nuance, especially for low-resource languages outside the top 10. For polished vocals, the model's RapMachine LoRA fine-tune is one option; see the official GitHub repo for the LoRA loading documentation.
If you hit something not covered here, please report via the submission form so we can add it to the catalogue.