self-hosted/ai
§01·recipe · tts

ACE-Step 1.5 XL on RTX 3060: Text-to-Music Generation via the 8 GB Optimization Path

ttsintermediate8GB+ VRAMJun 14, 2026

This intermediate recipe sets up ACE-Step 1.5 XL on the RTX 3060, needing about 8 GB of VRAM.

models
tools
prerequisites
  • NVIDIA RTX 3060 (12GB VRAM) — the 12 GB GA106 variant, not the 8 GB cut-down
  • Python 3.10 (conda recommended)
  • ComfyUI installed (optional, for the node workflow)
  • ~9 GB free disk for model weights

What You'll Build

A working text-to-music pipeline that turns a text prompt + optional lyrics into a full song (vocals, instruments, up to ~4 minutes) on a single RTX 3060 12 GB, either through the official Gradio app or as a ComfyUI custom node — using the model's documented memory-optimized launch path so it fits the 12 GB card comfortably.

Hardware data: RTX 3060 (12 GB VRAM) · text-to-music, lyric-aligned vocals, top-10 supported languages per HF model card · See benchmark data

⚠️ On a 12 GB card, lead with the optimization flags. At default precision ACE-Step peaks at 11.7 GB — and that figure was measured on a 12 GB RTX 3060 itself (community user akande, see Results). A 12 GB RTX 3060 with a display attached only exposes roughly 10.5–11.3 GB of usable VRAM, so the default path is right at the edge and can OOM once the desktop is using a slice of the card. This recipe leads with the --cpu_offload --torch_compile --overlapped_decode combination that the ACE-Step team documents as an 8 GB floor — that is the path it installs and runs. Default precision is kept as a "headroom permitting" note below.

ℹ️ Not a TTS model. ACE-Step generates music — instruments and lyric-aligned vocals — from a text description. It is filed under our tts vertical because the catalogue groups all audio-output models together, but it is not a text-to-speech engine. If you want spoken speech synthesis on this GPU, see Kokoro or VoxCPM. If you want sung vocals over generated backing, you are in the right place.

Requirements

ComponentMinimumTested
GPU8 GB VRAM with the optimization flags (the path this recipe uses)RTX 3060 (12 GB)
RAM16 GB (CPU offload streams weights through system RAM)
Storage~8.3 GB for the 3.5B weights + DCAE VAE + vocoder + UMT5-base text encoder8.28 GB on disk
SoftwarePython 3.10, PyTorch with CUDA, ComfyUI (optional)

The four weight components on the HF Files tab total ~8.3 GB on disk: the ace_step_transformer diffusion model (6.61 GB), the music_dcae_f8c8 autoencoder (0.31 GB), the music_vocoder (0.21 GB), and the umt5-base text encoder (1.13 GB plus tokenizer files).

The RTX 3060 is an Ampere GA106 (sm_86) card — 3584 CUDA cores, 112 third-generation Tensor cores, 360 GB/s memory bandwidth (GDDR6 on a 192-bit bus at 15 Gbps), 12 GB on the GA106-300 die, PCIe Gen4 x16, 170 W TGP (TechPowerUp GPU spec database). Two facts shape this recipe: (1) its 12 GB envelope is right at the edge of the 11.7 GB default-precision peak once a display is attached, so the optimization flags are the lead path; and (2) Ampere sm_86 ships full FlashAttention-2 and stock kernel coverage in the default PyTorch wheel, so no special CUDA-12.8 wheel and no attention-implementation override is required. Note that Ampere sm_86 has no FP8 tensor cores (FP8 acceleration first shipped on Ada sm_89 / Hopper sm_90), but that is moot here — this recipe's escape hatch is CPU-offload, not an FP8 weight path, so no FP8 dequantization penalty applies.

Installation

1. Clone the repo and create the conda environment

git clone https://github.com/ace-step/ACE-Step.git
cd ACE-Step
conda create -n ace_step python=3.10 -y
conda activate ace_step

2. Install PyTorch (only if overriding system CUDA)

The default pip install torch already ships with sm_86 (Ampere) kernels — no special wheel selection is required for the RTX 3060. If you need to pin a CUDA version (e.g. on Windows, or to match an existing driver), the README's stock command targets the cu126 index:

pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu126

Adjust cu126 to your CUDA version, or refer to the official PyTorch website. Unlike Blackwell GPUs (sm_120), the RTX 3060 needs no special wheel — the standard install already carries Ampere kernels and FlashAttention-2 works out of the box on sm_86.

3. Install the package

pip install -e .

This installs the acestep command-line entry point along with diffusers, transformers, accelerate, and the project's audio dependencies. Weights for ACE-Step/ACE-Step-v1-3.5B are downloaded automatically from HuggingFace on first launch (to ~/.cache/ace-step/checkpoints unless you pass --checkpoint_path).

4. (Optional) ComfyUI custom node

If you would rather drive the model from a ComfyUI workflow:

cd ComfyUI/custom_nodes
git clone https://github.com/billwuhao/ComfyUI_ACE-Step.git

Then download the weights into ComfyUI/models/TTS/ACE-Step-v1-3.5B/ — the folder layout is documented in the custom node README (needs ace_step_transformer, music_dcae_f8c8, music_vocoder, and umt5-base subdirectories).

Running

Gradio app (the 12 GB launch — optimization flags on)

This is the path you want on a 12 GB RTX 3060. The three flags together are what the ACE-Step team documents as the 8 GB floor:

acestep --torch_compile true --cpu_offload true --overlapped_decode true --port 7865

Per the official README, the 2025-05-10 Memory Optimization Update "Reduced Max VRAM to 8GB, making it more compatible with consumer devices" via exactly this launch combination. --cpu_offload offloads model weights to CPU to save GPU memory (the heaviest hitter — it streams transformer layers from system RAM into VRAM on demand), and --overlapped_decode uses overlapped decoding to speed up inference by pipelining VAE decode with diffusion. The RTX 3060's PCIe Gen4 x16 host link carries the CPU↔GPU offload traffic — the offloaded portion streams from system RAM rather than living resident in VRAM, so the path fits in 8 GB while system RAM holds the spilled weights.

Then open http://localhost:7865. In the Text2Music tab, enter descriptive tags (style, mood, instruments) and optional lyrics (use [verse], [chorus], [bridge] structure tags), set an audio duration, and generate. The app returns a downloadable audio file.

Default precision (headroom permitting)

If the RTX 3060 is headless or you have closed other GPU workloads, you can run default half precision without the offload flags:

acestep --port 7865 --bf16 true

The catch is the 11.7 GB peak (see Results) sits right at the edge of the 12 GB card's usable envelope — fine on a headless box, risky with a desktop session attached. When in doubt, use the optimized launch above.

Python API (programmatic)

ACE-Step is a music model, so the HuggingFace card's pipeline_tag is text-to-audio and the card ships no copy-pasteable code snippet — do not reach for a generic diffusers image/text pipeline, and do not expect an .images[0] attribute or a .text2music() method (neither exists). The canonical entry point, per the repo's acestep/pipeline_ace_step.py, is the ACEStepPipeline class, instantiated and then called directly (the only "text2music"-named method, text2music_diffusion_process, is an internal diffusion step, and "text2music" is otherwise just the Gradio tab label and the __call__ default task value):

from acestep.pipeline_ace_step import ACEStepPipeline

pipe = ACEStepPipeline(
    checkpoint_dir=None,         # None → auto-download to ~/.cache/ace-step/checkpoints
    dtype="bfloat16",
    torch_compile=True,
    cpu_offload=True,            # lead path on a 12 GB card
    overlapped_decode=True,
)

pipe(
    audio_duration=60,
    prompt="upbeat synthwave, driving bass, retro 80s",
    lyrics="[verse]\nNeon city lights\n[chorus]\nWe ride tonight",
    infer_step=60,
    guidance_scale=15.0,
    save_path="output.wav",
)

The generated audio is written to save_path. See infer.py and the __call__ signature in pipeline_ace_step.py for the full argument list (scheduler type, CFG type, seeds, guidance interval, ERG flags).

Results

  • Speed: No RTX 3060 benchmark is published yet (the live /check/acestep-1-5-xl/rtx-3060 verdict is unknown). The official ACE-Step GitHub README and HF card publish a per-device RTF (real-time factor) table that names the RTX 4090, RTX 3090, NVIDIA A100, and M2 Max — but not the RTX 3060, which has far fewer CUDA cores and roughly a third of the 3090's memory bandwidth. The 3060 is the weakest card in the 12 GB tier, so we do not extrapolate a generation-time figure from the much faster cards in that table. If you measure generation time on an RTX 3060, please contribute it via the submission form and it will appear at /check/acestep-1-5-xl/rtx-3060.
  • VRAM usage: The official minimum is 8 GB with cpu_offload + torch_compile + overlapped_decode enabled, confirmed by the ACE-Step team (org member xushengyuan, Shengyuan Xu — a paper author) in HF discussion #4: "The minimum VRAM requirement for full-length generation is now just 8 GB. We tested it on an RTX 4060, and it delivers decent performance beyond our expectations (1.16it/s)." That is the path this recipe installs, and it fits the RTX 3060's 12 GB with several GB to spare. For contrast, default half precision (no flags) peaks at 11.7 GB on the RTX 3060 itself, per community user akande in the same thread (May 2025): "For me it runs on my 3060 and consumes 11.7GB / 12GB vram. Maybe it runs in half precision out of the box? Because i don't use any arguments other then --port to start." That is a first-party measurement on the exact card this recipe targets — on a 12 GB RTX 3060 with a display attached (~10.5–11.3 GB usable) the 11.7 GB default peak sits right at the edge, which is exactly why the optimization flags are the lead path here. See /check/acestep-1-5-xl/rtx-3060.
  • Quality notes: Performs best in the top 10 supported languages (17 are listed on the HF card); rare instruments may render imperfectly; outputs beyond ~5 minutes can lose structural coherence; the model is highly seed-sensitive ("gacha-style" results, per the HF card's Limitations section).

For the full benchmark data, see /check/acestep-1-5-xl/rtx-3060.

Troubleshooting

Out of memory at 12 GB without the flags

The RTX 3060's 12 GB does not clear the cited 11.7 GB default-precision peak with comfortable margin once a desktop session is using a slice of VRAM — and that 11.7 GB figure was measured on a 3060 specifically (see Results), so this is the expected behaviour on this exact card, not an extrapolation. Launch with the optimization combination:

acestep --torch_compile true --cpu_offload true --overlapped_decode true --port 7865

cpu_offload is the heaviest hitter — it streams transformer layers from RAM into VRAM on demand. Combined with overlapped_decode (which pipelines VAE decoding with diffusion) you reach the official 8 GB floor that the ACE-Step team measured on an RTX 4060 per HF discussion #4. Because cpu_offload moves weights through system RAM over the 3060's PCIe Gen4 link, make sure you have at least 16 GB of RAM free; the offloaded streaming is the trade for fitting the smaller envelope. On Windows, --torch_compile additionally requires pip install triton-windows, per the GitHub README.

The HuggingFace Quick-start snippet does not work

ACE-Step's HF card has pipeline_tag: text-to-audio and ships no usable inference snippet — if you paste a generic diffusers pipeline(...) call (or an .images[0]-style snippet from a similarly-templated card) it will not produce music. Use the ACEStepPipeline class from the repo's pipeline_ace_step.py (shown in the Python API section above): instantiate ACEStepPipeline(...) and call it directly with prompt / lyrics / audio_duration / save_path. There is no .text2music() method and no .images[0] attribute. The Gradio CLI (acestep) is the simpler path if you do not need the library API.

acestep command not found after pip install -e .

The -e (editable) install registers the acestep entry point in your conda env. If the shell can't find it, you are probably in a different env — re-activate with conda activate ace_step and verify with which acestep. The GitHub repo README documents the entry point and command-line arguments.

Generations sound unstructured past ~5 minutes

This is a documented limitation, not a bug. The model card calls it out under "Limitations" — the model loses long-range structural coherence beyond ~5 minutes. Either keep prompts inside that window or use the repaint/extend operations on shorter segments and stitch them.

If you hit something not covered here, please report via the submission form so we can add it to the catalogue.

common questions
How much VRAM does ACE-Step 1.5 XL need?

About 8 GB — the minimum this recipe targets.

Which GPUs is ACE-Step 1.5 XL tested on?

RTX 3060 (12 GB).

How hard is this setup?

Intermediate — follow the steps above.