self-hosted/ai
§01·recipe · tts

ACE-Step 1.5 XL on RTX 4070 Ti Super: Text-to-Music Generation in ComfyUI

ttsintermediate12GB+ VRAMJun 1, 2026
models
tools
prerequisites
  • NVIDIA RTX 4070 Ti Super (16GB VRAM) or any 12GB+ consumer card; 8 GB possible with optimization flags
  • Python 3.10 (conda recommended)
  • ComfyUI installed (optional, for the node workflow)
  • ~8 GB free disk for model weights

What You'll Build

A working text-to-music pipeline that turns a text prompt + optional lyrics into a full song (vocals, instruments, up to ~4 minutes) on a single RTX 4070 Ti Super, either through the official Gradio app or as a ComfyUI custom node.

Hardware data: RTX 4070 Ti Super (16GB VRAM) · text-to-music, lyric-aligned vocals, top-10 supported languages per HF model card · See benchmark data

ℹ️ Not a TTS model. ACE-Step generates music — instruments and lyric-aligned vocals — from a text description. It is filed under our tts vertical because the catalogue groups all audio-output models together, but it is not a text-to-speech engine. If you want spoken speech synthesis on this GPU, see Kokoro or VoxCPM. If you want sung vocals over generated backing, you're in the right place.

Requirements

ComponentMinimumTested
GPU8 GB VRAM with optimization flags, 12 GB at default precisionRTX 4070 Ti Super (16 GB)
RAM16 GB
Storage~8 GB for the 3.5B transformer + DCAE + vocoder + UMT5-base text encoder
SoftwarePython 3.10, PyTorch with CUDA, ComfyUI (optional)

The four weight components on the HF Files tab total ~8.3 GB on disk: the ace_step_transformer diffusion model (6.61 GB), the music_dcae_f8c8 autoencoder (0.31 GB), the music_vocoder (0.21 GB), and the umt5-base text encoder (1.13 GB).

Installation

1. Clone the repo and create the conda environment

git clone https://github.com/ace-step/ACE-Step.git
cd ACE-Step
conda create -n ace_step python=3.10 -y
conda activate ace_step

2. Install PyTorch (Windows GPU users only — Linux can skip)

pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu126

The default pip install torch already ships with sm_89 (Ada Lovelace) kernels — no special wheel selection is required for the RTX 4070 Ti Super. The cu126 index is only needed when overriding the system CUDA on Windows.

3. Install the package

pip install -e .

This installs the acestep command-line entry point along with diffusers, transformers, accelerate, and the project's audio dependencies. Weights for ACE-Step/ACE-Step-v1-3.5B download automatically from HuggingFace on first launch (to ~/.cache/ace-step/checkpoints by default).

4. (Optional) ComfyUI custom node

If you would rather drive the model from a ComfyUI workflow:

cd ComfyUI/custom_nodes
git clone https://github.com/billwuhao/ComfyUI_ACE-Step.git

Then download the weights into ComfyUI/models/TTS/ACE-Step-v1-3.5B/. Per the custom node README the folder must contain four subdirectories: ace_step_transformer, music_dcae_f8c8, music_vocoder, and umt5-base.

Running

Gradio app (official)

acestep --port 7865 --bf16 true

Then open http://localhost:7865. In the Text2Music tab, enter descriptive tags (style, mood, instruments), optional lyrics with structure markers like [verse] / [chorus], set the audio duration, and click Generate. The app returns a downloadable audio file. On a 16 GB RTX 4070 Ti Super the default bf16 path runs comfortably without any memory flags.

Memory-optimized launch (free headroom for other workloads)

acestep --torch_compile true --cpu_offload true --overlapped_decode true --port 7865

These three flags together are what bring resident VRAM down to the official 8 GB floor — useful if the RTX 4070 Ti Super is also driving a display, running a browser, or sharing the card with other inference workloads. On a 16 GB card you do not need them for a single generation, but they cut peak resident VRAM substantially. Each flag is documented under Command Line Arguments in the GitHub README: --cpu_offload offloads model weights to CPU to save GPU memory, and --overlapped_decode uses overlapped decoding to speed up inference. The RTX 4070 Ti Super's PCIe Gen4 x16 host link keeps CPU↔GPU offload streaming cheap, so the optimized path is never a bottleneck on this card.

Library / programmatic use

To integrate ACE-Step into your own Python project, install it directly from the repo and follow the inference scripts in the source tree (the README documents pip install git+https://github.com/ace-step/ACE-Step.git). The repository — not the auto-generated HF Hub snippet — is the source of truth for the call signature; see Troubleshooting below.

Results

  • Speed: No RTX 4070 Ti Super benchmark is published yet (the live /check/acestep-1-5-xl/rtx-4070-ti-super verdict is unknown). The official ACE-Step GitHub README publishes a per-device throughput table that names the RTX 4090 (RTF 34.48×, 1.74 s to render 1 minute of audio at 27 steps) and the RTX 3090 (RTF 12.76×, 4.70 s) — but not the RTX 4070 Ti Super, which has fewer CUDA cores and lower memory bandwidth than either. Because the 4070 Ti Super is not a close compute-sibling of the 4090 (24 GB, substantially higher FP16 throughput) nor of the 3090 (different Ampere architecture), we do not extrapolate a 4070 Ti Super figure from them. If you measure generation time on a 4070 Ti Super, please contribute it via the submission form and it will appear at /check/acestep-1-5-xl/rtx-4070-ti-super.
  • VRAM usage: At default half precision (no flags beyond --port), community user akande reported a peak on an RTX 3060 in HF discussion #4 (May 2025): "For me it runs on my 3060 and consumes 11.7GB / 12GB vram. Maybe it runs in half precision out of the box? Because i don't use any arguments other then --port to start." The RTX 4070 Ti Super's 16 GB envelope clears that ~11.7 GB peak with roughly 4 GB of headroom. The official minimum drops to 8 GB with cpu_offload + torch_compile + overlapped_decode enabled, confirmed by ACE-Step org member xushengyuan (Shengyuan Xu, an ACE-Step team member and paper author) in the same thread: "The minimum VRAM requirement for full-length generation is now just 8 GB. We tested it on an RTX 4060, and it delivers decent performance beyond our expectations (1.16it/s)."
  • Quality notes: Performs best in the top 10 supported languages (19 are supported in total per the GitHub README); rare instruments may render imperfectly; outputs beyond ~5 minutes can lose structural coherence; the model is highly seed-sensitive ("gacha-style" results, per the HF card's Limitations section).

For the full benchmark data, see /check/acestep-1-5-xl/rtx-4070-ti-super.

Troubleshooting

Out of memory while sharing the card with other workloads

The 16 GB envelope of the RTX 4070 Ti Super clears the cited ~11.7 GB default-precision peak comfortably for a standalone run, but if you are also running a desktop session or other CUDA workloads, enable the launch-flag combination:

acestep --torch_compile true --cpu_offload true --overlapped_decode true

cpu_offload is the heaviest hitter — it streams model weights from RAM into VRAM on demand. Combined with overlapped_decode you reach the official 8 GB floor that the ACE-Step team measured on the RTX 4060 per HF discussion #4. On Windows, torch_compile additionally requires pip install triton-windows, per the GitHub README.

HF Quick Start snippet doesn't match the real API

ACE-Step is a music-generation model, but the HuggingFace Hub auto-generates a generic diffusers snippet from the pipeline tag — it does not reflect a runnable music-generation call. The authoritative inference entry point is the acestep command-line / Gradio app (the repository's setup.py registers acestep = acestep.gui:main as the console script), or the ComfyUI custom node. There is no .text2music() Python method to call; for programmatic use, follow the inference code in the official GitHub repository rather than copy-pasting the Hub snippet.

acestep command not found after pip install -e .

The -e (editable) install registers the acestep entry point in your conda env. If the shell can't find it, you're probably in a different env — re-activate with conda activate ace_step and verify with which acestep. The GitHub repo README documents the entry point and command-line arguments.

Generations sound unstructured past ~5 minutes

This is a documented limitation, not a bug. The model card calls it out under "Limitations" — the model loses long-range structural coherence beyond ~5 minutes. Either keep prompts inside that window or use the repaint/extend operations on shorter segments and stitch them.

If you hit something not covered here, please report via the submission form so we can add it to the catalogue.