ACE-Step is an open-source music generation model combining speed, coherence, and control, generating 4-minute tracks in 20 seconds with fine-grained detail.
ACE-Step is a next-generation open-source foundation model for music generation, built to overcome the trade-offs between speed, coherence, and controllability found in current approaches. Unlike traditional LLM-based models, which are accurate but slow, or diffusion models, which are fast but structurally limited, ACE-Step offers the best of both worlds.
Designed for music creators, AI developers, and content producers, ACE-Step integrates diffusion-based generation, Sana’s Deep Compression AutoEncoder (DCAE), and a lightweight linear transformer to deliver fast and musically coherent results. Its advanced training stack—leveraging MERT, m-hubert, and REPA—ensures superior alignment of lyrics, melody, and rhythm.
The model generates up to 4 minutes of high-quality music in just 20 seconds on an A100 GPU—15× faster than LLM baselines—while preserving acoustic fidelity and offering fine-grained control. Users can perform tasks like voice cloning, lyric editing, track remixing, and singing-to-accompaniment generation, making it a powerful base for creative workflows.
Rather than being a fixed pipeline, ACE-Step is positioned as a flexible, general-purpose architecture for building sub-models and creative tools. It sets a new benchmark for scalable, expressive music AI—marking a potential “Stable Diffusion moment” for audio generation.
Kimi-Audio: A universal open-source audio foundation model handling ASR, AQA, AAC & more. Pre-trained on 13M hours for SOTA performance. Features hybrid architecture & low-latency inference.