2 of 27

Navigation

Contents

Overview & Architecture

Decoupled LM + DiT + VAE system with XL enhancements

Core Generation Parameters

Prompts, lyrics, duration, steps, guidance

Advanced Parameters

XL model variants and developer settings

Developer Advice

XL inference strategy and batch generation

Tips & Best Practices

XL tag crafting, lyrics, multilingual input

Demo Prompts

Electronic, rock, ambient, C-pop, K-pop examples

Quick Reference

Command cheat sheet and resources

What's New in XL

Migration guide from Base to XL

3 of 27

Chapter 01

Overview & Architecture

Decoupled system design for maximum flexibility with XL enhancements

4 of 27

ACE-Step 1.5 XL Architecture

Enhanced components for superior quality and extended duration

Language Model (LM)

Composer Agent based on Qwen3

Handles prompt understanding, structure planning, and lyric formatting with improved comprehension

Diffusion Transformer (DiT)

Acoustic Renderer - XL Enhanced

Generates high-fidelity 48kHz stereo audio with enhanced quality, coherence, and up to 360s duration

1D VAE

Pure Waveform-Domain Autoencoder

Not mel-spectrogram based for superior audio quality and faithful waveform reconstruction

Text-to-Music

Up to 6 minutes (360s) - Extended from 240s

Audio-to-Audio

Style transfer, remixing with XL precision

Cover Generation

Recreate existing songs with XL quality

Audio Repainting

Localized editing with better coherence

Track Extraction

Separate stems with improved separation

Vocal-to-BGM

Convert vocals to instrumental

LoRA Fine-tuning

Custom model training

Enhanced Coherence

Better structure retention for long-form

Improved Vocal Quality

Clearer articulation and pitch stability

5 of 27

Rivals Commercial Alternatives

Open-source with professional-grade output quality

ACE-Step 1.5 XL delivers quality comparable to:

SUNO

Udio

HeartMuLa

Open Source + Professional Quality

Full control, no subscriptions, local deployment

6 of 27

Chapter 02

Core Generation Parameters

Essential settings for music generation control

7 of 27

Core Generation Parameters

Essential settings for controlling XL music generation

Parameter	Description	Default	Range/Options
prompt / tags	Style description (comma-separated)	Required	e.g., "electronic, 120bpm, female voice"
lyrics	Structured lyrics with tags	Optional	See Lyric Structure section
duration	Output length in seconds	60	10-360 seconds (XL)
steps / num_steps	Inference steps	35-60	8 (turbo) to 150
seed	Random seed for reproducibility	Random	-1 for random, or specific integer
scheduler	Sampling scheduler	euler	euler, heun, er_sde
guidance_type	CFG type	apg	cfg, apg, cfg_star
guidance_scale / cfg	Main guidance strength	3.5-12	1-15 (lower = more creative)
tag_guidance_scale	Style adherence strength	5	1-15
lyric_guidance_scale	Lyric adherence strength	1.5	0.1-10
guidance_interval	When to apply guidance	0.5	0.0-1.0
granularity_scale	Reduces artifacts	10	Integer
instrumental	Generate without vocals	false	Boolean

Pro Tip: Lower guidance_scale values give more creative freedom, higher values enforce stricter adherence to prompts. For XL, 3.5-4.5 is the sweet spot (lower than base model).

8 of 27

Chapter 03

Advanced Parameters & Model Variants

Developer settings and model selection guide

9 of 27

XL Model Variants

Choose the right model for your use case

acestep-v15-xl-base

50-60 STEPS

Full features including extract, repaint, Lego. Enhanced long-form coherence.

Best For: Maximum flexibility with XL quality | VRAM: 24GB recommended

acestep-v15-xl-sft

50-60 STEPS

Highest quality, no editing features. Superior vocal and instrumental clarity.

Best For: Production-quality music generation | VRAM: 24GB recommended

acestep-v15-xl-turbo

12-16 STEPS

Fast iteration with XL architecture. Reduced CFG needed (3.0-5.0).

Best For: Quick experiments with XL benefits | VRAM: 20GB minimum

acestep-v15-xl-turbo-rl

COMING SOON

RL-optimized XL turbo variant. Best speed/quality ratio for XL.

Best For: Production speed with XL quality

10 of 27

XL-Specific Advanced Parameters

Fine-tune XL performance and memory usage

Parameter	Description	XL Default	Range/Options
xl_attention_mode	Attention optimization	flash	flash, standard, memory_efficient
xl_cascade_stages	Multi-resolution generation	3	2-4
xl_coherence_weight	Long-form structure strength	1.2	0.5-2.0
bf16	Use bfloat16 for faster inference	true	Boolean
torch_compile	Optimize with torch.compile()	false	Boolean
cpu_offload	Offload to CPU to save VRAM	false	Boolean
overlapped_decode	Speed up inference	false	Boolean
device_id	GPU device ID	0	Integer
shift	Dynamic shift for distillation	4-5	3 (turbo) to 6
denoise	Audio-to-audio strength	0.5	0.25-1.0

11 of 27

Chapter 04

Developer Advice

Optimize XL generation quality with proven techniques

12 of 27

The XL Golden Balance

CFG vs Steps Relationship (XL-Optimized)

Key Principles

CFG 3.5-4.5 is the XL sweet spot (Lower than base)

Going above 5.5 introduces artifacts (More sensitive than base)

Compensate low CFG with higher steps (60-80) for refinement

XL responds better to subtle guidance than base

Shift Parameter

shift: 4.5 # Sweet spot for XL compositional structure shift: 5.0 # For complex arrangements (orchestral, jazz) shift: 4.0 # For simpler structures (pop, electronic)

Sampler Recommendations (XL-Tested)

Recommended: Sampler: er_sde, Scheduler: linear_quadratic

Alternative: euler with shift: 5.0

Avoid: DPM++ family, uni_pc - These produce more artifacts in XL than base

Pro Tip: Use shift: 4.5-5.0 for better compositional structure — affects arrangement coherence, not just sound quality

13 of 27

XL Duration Recommendations & Batch Strategy

Optimize for XL's extended capabilities

Duration	Quality	Use Case	CFG	Steps
60-90s	Excellent	Short tracks, loops	4.5	45
90-180s	Optimal	Full songs	4.0	50-60
180-300s	Very Good	Extended tracks	3.5-4.0	60-80
300-360s	Good*	Epic/ambient	3.5	70-80

XL-Specific Warnings: Avoid CFG > 6.0 (harsh artifacts appear faster). Avoid Steps < 30 (unless using turbo model). Avoid very short durations < 45s (XL is optimized for longer form).

Recommended XL Workflow

batch_size: 6-10 # Lower than base due to VRAM duration: 150-180s # XL's sweet spot steps: 50-60 guidance_scale: 4.0

Success Rate (XL): Usually 1 excellent result per 2-3 generations (Improved from base)

14 of 27

Chapter 05

Tips & Best Practices

Master XL-optimized prompt engineering

15 of 27

XL Tag Crafting Examples

Optimized prompts for XL's enhanced understanding

Electronic (XL-Optimized)

"progressive trance, female ethereal vocals, lush pads, plucky arpeggios, deep bass, 138bpm, cathedral reverb, emotional breakdown, euphoric buildup"

Cinematic Rock (XL Excels Here)

"cinematic rock, powerful male baritone, orchestral strings, heavy drums, electric guitar, epic, anthemic, 85bpm, wide stereo, dynamic range, emotional crescendo"

Jazz Fusion (Complex - XL Advantage)

"jazz fusion, saxophone lead, electric piano, walking bass, brushed drums, 140bpm, smoky atmosphere, sophisticated, improvisational feel, vintage tone"

XL-Specific Tag Benefits

Better compound understanding: "melancholic yet hopeful" - XL captures both moods

Enhanced spatial tags: "wide stereo field, intimate center vocal" - XL renders more accurately

Improved temporal tags: "slow build, sudden drop" - XL follows structure better

16 of 27

Lyric Structure Tags

Control song structure with bracket notation

[intro]

Opening section

[verse]

Main vocal sections

[verse 1] / [verse 2]

Numbered verses

[pre-chorus]

Build-up to chorus

[chorus]

Main hook/refrain

[bridge]

Contrasting middle

[hook]

Catchy phrase

[refrain]

Recurring line

[interlude]

Instrumental break

[breakdown]

XL handles better

[outro] / [post-outro]

Extended outros

[ad-lib]

Improvised vocals

[inst]

Instrumental section

XL Lyric Structure Best Practices

Always use brackets [ ] for structure tags

Tags are case-insensitive but standardize for readability

Empty lines between sections help the model parse structure

Repetition markers like "(Repeat x2)" are supported

17 of 27

Chapter 06

Demo Prompts

XL-optimized examples showcasing capabilities

18 of 27

Cinematic Pop & Progressive Rock

Complete working examples for popular genres

Cinematic Pop (XL Showcase)

Tags

cinematic pop, powerful female vocals, orchestral strings, piano, dramatic drums, emotional, 95bpm, wide stereo field, dynamic range, intimate verses, explosive chorus

Lyrics Sample

[intro - piano only]

[verse 1]

In the silence of the night

I hear echoes of your voice

[pre-chorus]

And I'm rising, slowly rising

[chorus]

I am stronger now, stronger now!

Through the storm I found my way

XL Settings: Duration: 180s, Steps: 55, CFG: 4.0, Shift: 4.5

Progressive Rock (XL Complexity)

Tags

progressive rock, complex time signatures, virtuoso guitar solo, hammond organ, thunderous bass, precise drums, 7/8 time, 145bpm, vintage analog tone, dynamic shifts, epic arrangement

XL Settings: Duration: 300s, Steps: 70, CFG: 3.8, Shift: 5.0

19 of 27

Ambient & Multilingual Examples

Long-form and international music generation

Ambient Soundscape (XL Long-Form)

Tags

ambient, atmospheric, ethereal pads, field recordings, no drums, glacial pace, deep sub frequencies, evolving textures, meditative, 60bpm, binaural, spatial audio, healing frequencies

XL Settings: Duration: 360s, Steps: 80, CFG: 3.0, Shift: 3.5 | Note: XL's extended coherence shines in long ambient pieces

Mandarin C-Pop Ballad (XL Enhanced)

Tags

c-pop ballad, emotional female vocals, piano, strings, cinematic, 72bpm, mandarin chinese, tear-jerking, powerful vocal runs, dramatic builds

Lyrics Sample

[verse 1]

[zh]zou3 guo4 jie1 jiao3

[zh]kan4 guo4 xi1 yang2

[chorus]

[zh]wo3 ai4 ni3 ai4 de shen1

XL Settings: Duration: 195s, Steps: 55, CFG: 4.3 | XL Advantage: Better tonal accuracy for Mandarin

20 of 27

Chapter 07

Quick Reference

Command cheat sheet and essential links

21 of 27

XL Command Cheat Sheet

Quick settings optimized for XL models

Task	Key Settings (XL-Optimized)
Quick Draft	Model: xl-turbo, Steps: 12, CFG: 4.0, Duration: 60s
High Quality	Model: xl-sft, Steps: 55, CFG: 4.0, Duration: 180s
Maximum Quality	Model: xl-sft, Steps: 70, CFG: 3.8, Duration: 240s
Long Form	Model: xl-sft, Steps: 70, CFG: 3.5, Duration: 300-360s
Instrumental	Lyrics: [inst], instrumental flag, Duration: 180s
Style Transfer	Audio2Audio, Denoise: 0.5, New tags, CFG: 4.0
Lyric Change	Audio2Audio, Same audio, New lyrics, Denoise: 0.4
Batch (24GB)	Batch size: 6, Duration: 150s, Steps: 55
Batch (16GB)	Batch size: 2, cpu_offload: True, Duration: 120s

Memory Profiles

# High VRAM (40GB+) model = "acestep-v15-xl-sft" batch_size = 10 cpu_offload = False bf16 = True torch_compile = True# Standard VRAM (24GB) model = "acestep-v15-xl-sft" batch_size = 4 cpu_offload = False bf16 = True# Low VRAM (16GB) model = "acestep-v15-xl-turbo" batch_size = 2 cpu_offload = True bf16 = True overlapped_decode = True

22 of 27

Resources

Essential links and documentation

Project Page

ACE-Step 1.5 XL Official - Updated XL documentation and samples

GitHub

ace-step/ACE-Step-1.5-XL - XL model code and notebooks

Hugging Face

graydient/ACE-Step-1.5-XL - Model checkpoints - XL variants

ComfyUI Wiki

Detailed XL workflow guides - XL-specific node configurations

Model Comparison

Base vs XL benchmarks - Quality metrics and VRAM requirements

Manual synthesizes official documentation with community-tested best practices for XL model optimization.

23 of 27

Chapter 08

What's New in ACE-Step 1.5 XL

Major improvements and migration guide

24 of 27

XL Model Improvements

Architecture enhancements and quality improvements

Feature	v1.5 Base	v1.5 XL	Improvement
Max Duration	240s (4 min)	360s (6 min)	+50% longer
Model Size	Standard	XL (Larger DiT)	Better quality
Long-form Coherence	Good	Excellent	Less drift
Vocal Clarity	Good	Superior	Better articulation
Stereo Imaging	Standard	Enhanced	Wider soundstage
Genre Understanding	Strong	Stronger	More nuanced

Additional Quality Improvements

Reduced artifacts in high-frequency content

Better lyric-to-melody alignment for sung vocals

Improved compositional structure for 3+ minute tracks

Enhanced instrument separation in dense mixes

More stable pitch across vocal performances

Better dynamics between quiet and loud sections

25 of 27

Migration Guide: Base to XL

Updated defaults and breaking changes

Parameter	v1.5 Base	v1.5 XL	Notes
steps	27-50	35-60	XL benefits from more steps
guidance_scale	4.0-15	3.5-12	Lower CFG often better
shift	3 (turbo)	4-5	Higher for better structure
duration sweet spot	90-120s	120-180s	XL excels at longer form

Recommended Settings for XL

# Optimal XL Configuration model = "acestep-v15-xl-sft" steps = 50 guidance_scale = 4.5 shift = 4.5 duration = 180# 3 minutes sampler = "er_sde" scheduler = "linear_quadratic"

Breaking Changes: Memory Requirements: Base model: 16GB VRAM minimum. XL model: 24GB VRAM recommended. Use cpu_offload=True for lower VRAM (slower). Inference Time: XL is approximately 1.3x slower than base. Use turbo models for faster iteration.

26 of 27

Version Comparison Summary

Choose the right model for your needs

Feature	v1.5 Base	v1.5 XL
Max Duration	240s	360s
Sweet Spot	90-120s	120-180s
VRAM (min)	16GB	20GB
VRAM (recommended)	20GB	24GB
CFG Range	4.0-6.0	3.5-4.5
Steps Range	27-50	35-60
Coherence	Good	Excellent
Vocal Quality	Good	Superior
Artifact Reduction	Standard	Enhanced
Batch Size (24GB)	8	4-6
Inference Speed	1x	0.77x

Use Base (v1.5) when:

Limited VRAM (16-20GB)

Need maximum speed

Shorter tracks (< 120s)

Quick iterations

Use XL (v1.5 XL) when:

24GB+ VRAM available

Quality is priority

Longer tracks (120s+)

Complex arrangements

Final production

1 of 27

2 of 27

3 of 27

4 of 27

5 of 27

6 of 27

7 of 27

8 of 27

9 of 27

10 of 27

11 of 27

12 of 27

13 of 27

14 of 27

15 of 27

16 of 27

17 of 27

18 of 27

19 of 27

20 of 27

21 of 27

22 of 27

23 of 27

24 of 27

25 of 27

26 of 27

27 of 27