1 of 27

AI Music Generation

ACE-Step

1.5 XL

Complete Manual

State-of-the-Art

Open Source

48kHz Stereo

TECHNICAL DOCUMENTATION | 2026 Edition | XL Model Update

2 of 27

Navigation

Contents

01

Overview & Architecture

Decoupled LM + DiT + VAE system with XL enhancements

02

Core Generation Parameters

Prompts, lyrics, duration, steps, guidance

03

Advanced Parameters

XL model variants and developer settings

04

Developer Advice

XL inference strategy and batch generation

05

Tips & Best Practices

XL tag crafting, lyrics, multilingual input

06

Demo Prompts

Electronic, rock, ambient, C-pop, K-pop examples

07

Quick Reference

Command cheat sheet and resources

08

What's New in XL

Migration guide from Base to XL

3 of 27

Chapter 01

Overview & Architecture

Decoupled system design for maximum flexibility with XL enhancements

4 of 27

ACE-Step 1.5 XL Architecture

Enhanced components for superior quality and extended duration

Language Model (LM)

Composer Agent based on Qwen3

Handles prompt understanding, structure planning, and lyric formatting with improved comprehension

Diffusion Transformer (DiT)

Acoustic Renderer - XL Enhanced

Generates high-fidelity 48kHz stereo audio with enhanced quality, coherence, and up to 360s duration

1D VAE

Pure Waveform-Domain Autoencoder

Not mel-spectrogram based for superior audio quality and faithful waveform reconstruction

Text-to-Music

Up to 6 minutes (360s) - Extended from 240s

Audio-to-Audio

Style transfer, remixing with XL precision

Cover Generation

Recreate existing songs with XL quality

Audio Repainting

Localized editing with better coherence

Track Extraction

Separate stems with improved separation

Vocal-to-BGM

Convert vocals to instrumental

LoRA Fine-tuning

Custom model training

Enhanced Coherence

Better structure retention for long-form

Improved Vocal Quality

Clearer articulation and pitch stability

5 of 27

Rivals Commercial Alternatives

Open-source with professional-grade output quality

ACE-Step 1.5 XL delivers quality comparable to:

SUNO

|

Udio

|

HeartMuLa

Open Source + Professional Quality

Full control, no subscriptions, local deployment

6 of 27

Chapter 02

Core Generation Parameters

Essential settings for music generation control

7 of 27

Core Generation Parameters

Essential settings for controlling XL music generation

Parameter

Description

Default

Range/Options

prompt / tags

Style description (comma-separated)

Required

e.g., "electronic, 120bpm, female voice"

lyrics

Structured lyrics with tags

Optional

See Lyric Structure section

duration

Output length in seconds

60

10-360 seconds (XL)

steps / num_steps

Inference steps

35-60

8 (turbo) to 150

seed

Random seed for reproducibility

Random

-1 for random, or specific integer

scheduler

Sampling scheduler

euler

euler, heun, er_sde

guidance_type

CFG type

apg

cfg, apg, cfg_star

guidance_scale / cfg

Main guidance strength

3.5-12

1-15 (lower = more creative)

tag_guidance_scale

Style adherence strength

5

1-15

lyric_guidance_scale

Lyric adherence strength

1.5

0.1-10

guidance_interval

When to apply guidance

0.5

0.0-1.0

granularity_scale

Reduces artifacts

10

Integer

instrumental

Generate without vocals

false

Boolean

Pro Tip: Lower guidance_scale values give more creative freedom, higher values enforce stricter adherence to prompts. For XL, 3.5-4.5 is the sweet spot (lower than base model).

8 of 27

Chapter 03

Advanced Parameters & Model Variants

Developer settings and model selection guide

9 of 27

XL Model Variants

Choose the right model for your use case

acestep-v15-xl-base

50-60 STEPS

Full features including extract, repaint, Lego. Enhanced long-form coherence.

Best For: Maximum flexibility with XL quality | VRAM: 24GB recommended

acestep-v15-xl-sft

50-60 STEPS

Highest quality, no editing features. Superior vocal and instrumental clarity.

Best For: Production-quality music generation | VRAM: 24GB recommended

acestep-v15-xl-turbo

12-16 STEPS

Fast iteration with XL architecture. Reduced CFG needed (3.0-5.0).

Best For: Quick experiments with XL benefits | VRAM: 20GB minimum

acestep-v15-xl-turbo-rl

COMING SOON

RL-optimized XL turbo variant. Best speed/quality ratio for XL.

Best For: Production speed with XL quality

10 of 27

XL-Specific Advanced Parameters

Fine-tune XL performance and memory usage

Parameter

Description

XL Default

Range/Options

xl_attention_mode

Attention optimization

flash

flash, standard, memory_efficient

xl_cascade_stages

Multi-resolution generation

3

2-4

xl_coherence_weight

Long-form structure strength

1.2

0.5-2.0

bf16

Use bfloat16 for faster inference

true

Boolean

torch_compile

Optimize with torch.compile()

false

Boolean

cpu_offload

Offload to CPU to save VRAM

false

Boolean

overlapped_decode

Speed up inference

false

Boolean

device_id

GPU device ID

0

Integer

shift

Dynamic shift for distillation

4-5

3 (turbo) to 6

denoise

Audio-to-audio strength

0.5

0.25-1.0

11 of 27

Chapter 04

Developer Advice

Optimize XL generation quality with proven techniques

12 of 27

The XL Golden Balance

CFG vs Steps Relationship (XL-Optimized)

Key Principles

CFG 3.5-4.5 is the XL sweet spot (Lower than base)

Going above 5.5 introduces artifacts (More sensitive than base)

Compensate low CFG with higher steps (60-80) for refinement

XL responds better to subtle guidance than base

Shift Parameter

shift: 4.5 # Sweet spot for XL compositional structure shift: 5.0 # For complex arrangements (orchestral, jazz) shift: 4.0 # For simpler structures (pop, electronic)

Sampler Recommendations (XL-Tested)

Recommended: Sampler: er_sde, Scheduler: linear_quadratic

Alternative: euler with shift: 5.0

Avoid: DPM++ family, uni_pc - These produce more artifacts in XL than base

Pro Tip: Use shift: 4.5-5.0 for better compositional structure — affects arrangement coherence, not just sound quality

13 of 27

XL Duration Recommendations & Batch Strategy

Optimize for XL's extended capabilities

Duration

Quality

Use Case

CFG

Steps

60-90s

Excellent

Short tracks, loops

4.5

45

90-180s

Optimal

Full songs

4.0

50-60

180-300s

Very Good

Extended tracks

3.5-4.0

60-80

300-360s

Good*

Epic/ambient

3.5

70-80

XL-Specific Warnings: Avoid CFG > 6.0 (harsh artifacts appear faster). Avoid Steps < 30 (unless using turbo model). Avoid very short durations < 45s (XL is optimized for longer form).

Recommended XL Workflow

batch_size: 6-10 # Lower than base due to VRAM duration: 150-180s # XL's sweet spot steps: 50-60 guidance_scale: 4.0

Success Rate (XL): Usually 1 excellent result per 2-3 generations (Improved from base)

14 of 27

Chapter 05

Tips & Best Practices

Master XL-optimized prompt engineering

15 of 27

XL Tag Crafting Examples

Optimized prompts for XL's enhanced understanding

Electronic (XL-Optimized)

"progressive trance, female ethereal vocals, lush pads, plucky arpeggios, deep bass, 138bpm, cathedral reverb, emotional breakdown, euphoric buildup"

Cinematic Rock (XL Excels Here)

"cinematic rock, powerful male baritone, orchestral strings, heavy drums, electric guitar, epic, anthemic, 85bpm, wide stereo, dynamic range, emotional crescendo"

Jazz Fusion (Complex - XL Advantage)

"jazz fusion, saxophone lead, electric piano, walking bass, brushed drums, 140bpm, smoky atmosphere, sophisticated, improvisational feel, vintage tone"

XL-Specific Tag Benefits

Better compound understanding: "melancholic yet hopeful" - XL captures both moods

Enhanced spatial tags: "wide stereo field, intimate center vocal" - XL renders more accurately

Improved temporal tags: "slow build, sudden drop" - XL follows structure better

16 of 27

Lyric Structure Tags

Control song structure with bracket notation

[intro]

Opening section

[verse]

Main vocal sections

[verse 1] / [verse 2]

Numbered verses

[pre-chorus]

Build-up to chorus

[chorus]

Main hook/refrain

[bridge]

Contrasting middle

[hook]

Catchy phrase

[refrain]

Recurring line

[interlude]

Instrumental break

[breakdown]

XL handles better

[outro] / [post-outro]

Extended outros

[ad-lib]

Improvised vocals

[inst]

Instrumental section

XL Lyric Structure Best Practices

Always use brackets [ ] for structure tags

Tags are case-insensitive but standardize for readability

Empty lines between sections help the model parse structure

Repetition markers like "(Repeat x2)" are supported

17 of 27

Chapter 06

Demo Prompts

XL-optimized examples showcasing capabilities

18 of 27

Cinematic Pop & Progressive Rock

Complete working examples for popular genres

Cinematic Pop (XL Showcase)

Tags

cinematic pop, powerful female vocals, orchestral strings, piano, dramatic drums, emotional, 95bpm, wide stereo field, dynamic range, intimate verses, explosive chorus

Lyrics Sample

[intro - piano only]

[verse 1]

In the silence of the night

I hear echoes of your voice

[pre-chorus]

And I'm rising, slowly rising

[chorus]

I am stronger now, stronger now!

Through the storm I found my way

XL Settings: Duration: 180s, Steps: 55, CFG: 4.0, Shift: 4.5

Progressive Rock (XL Complexity)

Tags

progressive rock, complex time signatures, virtuoso guitar solo, hammond organ, thunderous bass, precise drums, 7/8 time, 145bpm, vintage analog tone, dynamic shifts, epic arrangement

XL Settings: Duration: 300s, Steps: 70, CFG: 3.8, Shift: 5.0

19 of 27

Ambient & Multilingual Examples

Long-form and international music generation

Ambient Soundscape (XL Long-Form)

Tags

ambient, atmospheric, ethereal pads, field recordings, no drums, glacial pace, deep sub frequencies, evolving textures, meditative, 60bpm, binaural, spatial audio, healing frequencies

XL Settings: Duration: 360s, Steps: 80, CFG: 3.0, Shift: 3.5 | Note: XL's extended coherence shines in long ambient pieces

Mandarin C-Pop Ballad (XL Enhanced)

Tags

c-pop ballad, emotional female vocals, piano, strings, cinematic, 72bpm, mandarin chinese, tear-jerking, powerful vocal runs, dramatic builds

Lyrics Sample

[verse 1]

[zh]zou3 guo4 jie1 jiao3

[zh]kan4 guo4 xi1 yang2

[chorus]

[zh]wo3 ai4 ni3 ai4 de shen1

XL Settings: Duration: 195s, Steps: 55, CFG: 4.3 | XL Advantage: Better tonal accuracy for Mandarin

20 of 27

Chapter 07

Quick Reference

Command cheat sheet and essential links

21 of 27

XL Command Cheat Sheet

Quick settings optimized for XL models

Task

Key Settings (XL-Optimized)

Quick Draft

Model: xl-turbo, Steps: 12, CFG: 4.0, Duration: 60s

High Quality

Model: xl-sft, Steps: 55, CFG: 4.0, Duration: 180s

Maximum Quality

Model: xl-sft, Steps: 70, CFG: 3.8, Duration: 240s

Long Form

Model: xl-sft, Steps: 70, CFG: 3.5, Duration: 300-360s

Instrumental

Lyrics: [inst], instrumental flag, Duration: 180s

Style Transfer

Audio2Audio, Denoise: 0.5, New tags, CFG: 4.0

Lyric Change

Audio2Audio, Same audio, New lyrics, Denoise: 0.4

Batch (24GB)

Batch size: 6, Duration: 150s, Steps: 55

Batch (16GB)

Batch size: 2, cpu_offload: True, Duration: 120s

Memory Profiles

# High VRAM (40GB+) model = "acestep-v15-xl-sft" batch_size = 10 cpu_offload = False bf16 = True torch_compile = True# Standard VRAM (24GB) model = "acestep-v15-xl-sft" batch_size = 4 cpu_offload = False bf16 = True# Low VRAM (16GB) model = "acestep-v15-xl-turbo" batch_size = 2 cpu_offload = True bf16 = True overlapped_decode = True

22 of 27

Resources

Essential links and documentation

ACE-Step 1.5 XL Official - Updated XL documentation and samples

ace-step/ACE-Step-1.5-XL - XL model code and notebooks

graydient/ACE-Step-1.5-XL - Model checkpoints - XL variants

Detailed XL workflow guides - XL-specific node configurations

Base vs XL benchmarks - Quality metrics and VRAM requirements

Manual synthesizes official documentation with community-tested best practices for XL model optimization.

23 of 27

Chapter 08

What's New in ACE-Step 1.5 XL

Major improvements and migration guide

24 of 27

XL Model Improvements

Architecture enhancements and quality improvements

Feature

v1.5 Base

v1.5 XL

Improvement

Max Duration

240s (4 min)

360s (6 min)

+50% longer

Model Size

Standard

XL (Larger DiT)

Better quality

Long-form Coherence

Good

Excellent

Less drift

Vocal Clarity

Good

Superior

Better articulation

Stereo Imaging

Standard

Enhanced

Wider soundstage

Genre Understanding

Strong

Stronger

More nuanced

Additional Quality Improvements

Reduced artifacts in high-frequency content

Better lyric-to-melody alignment for sung vocals

Improved compositional structure for 3+ minute tracks

Enhanced instrument separation in dense mixes

More stable pitch across vocal performances

Better dynamics between quiet and loud sections

25 of 27

Migration Guide: Base to XL

Updated defaults and breaking changes

Parameter

v1.5 Base

v1.5 XL

Notes

steps

27-50

35-60

XL benefits from more steps

guidance_scale

4.0-15

3.5-12

Lower CFG often better

shift

3 (turbo)

4-5

Higher for better structure

duration sweet spot

90-120s

120-180s

XL excels at longer form

Recommended Settings for XL

# Optimal XL Configuration model = "acestep-v15-xl-sft" steps = 50 guidance_scale = 4.5 shift = 4.5 duration = 180# 3 minutes sampler = "er_sde" scheduler = "linear_quadratic"

Breaking Changes: Memory Requirements: Base model: 16GB VRAM minimum. XL model: 24GB VRAM recommended. Use cpu_offload=True for lower VRAM (slower). Inference Time: XL is approximately 1.3x slower than base. Use turbo models for faster iteration.

26 of 27

Version Comparison Summary

Choose the right model for your needs

Feature

v1.5 Base

v1.5 XL

Max Duration

240s

360s

Sweet Spot

90-120s

120-180s

VRAM (min)

16GB

20GB

VRAM (recommended)

20GB

24GB

CFG Range

4.0-6.0

3.5-4.5

Steps Range

27-50

35-60

Coherence

Good

Excellent

Vocal Quality

Good

Superior

Artifact Reduction

Standard

Enhanced

Batch Size (24GB)

8

4-6

Inference Speed

1x

0.77x

Use Base (v1.5) when:

Limited VRAM (16-20GB)

Need maximum speed

Shorter tracks (< 120s)

Quick iterations

Use XL (v1.5 XL) when:

24GB+ VRAM available

Quality is priority

Longer tracks (120s+)

Complex arrangements

Final production

27 of 27

ACE-STEP 1.5 XL

State-of-the-Art Music Generation

Start Creating

Generate professional music with AI

© 2026 Graydient AI