AI Music Generation
ACE-Step
1.5 XL
Complete Manual
State-of-the-Art
Open Source
48kHz Stereo
TECHNICAL DOCUMENTATION | 2026 Edition | XL Model Update
Navigation
Contents
01
Overview & Architecture
Decoupled LM + DiT + VAE system with XL enhancements
02
Core Generation Parameters
Prompts, lyrics, duration, steps, guidance
03
Advanced Parameters
XL model variants and developer settings
04
Developer Advice
XL inference strategy and batch generation
05
Tips & Best Practices
XL tag crafting, lyrics, multilingual input
06
Demo Prompts
Electronic, rock, ambient, C-pop, K-pop examples
07
Quick Reference
Command cheat sheet and resources
08
What's New in XL
Migration guide from Base to XL
Chapter 01
Overview & Architecture
Decoupled system design for maximum flexibility with XL enhancements
ACE-Step 1.5 XL Architecture
Enhanced components for superior quality and extended duration
Language Model (LM)
Composer Agent based on Qwen3
Handles prompt understanding, structure planning, and lyric formatting with improved comprehension
Diffusion Transformer (DiT)
Acoustic Renderer - XL Enhanced
Generates high-fidelity 48kHz stereo audio with enhanced quality, coherence, and up to 360s duration
1D VAE
Pure Waveform-Domain Autoencoder
Not mel-spectrogram based for superior audio quality and faithful waveform reconstruction
Text-to-Music
Up to 6 minutes (360s) - Extended from 240s
Audio-to-Audio
Style transfer, remixing with XL precision
Cover Generation
Recreate existing songs with XL quality
Audio Repainting
Localized editing with better coherence
Track Extraction
Separate stems with improved separation
Vocal-to-BGM
Convert vocals to instrumental
LoRA Fine-tuning
Custom model training
Enhanced Coherence
Better structure retention for long-form
Improved Vocal Quality
Clearer articulation and pitch stability
Rivals Commercial Alternatives
Open-source with professional-grade output quality
ACE-Step 1.5 XL delivers quality comparable to:
SUNO
|
Udio
|
HeartMuLa
Open Source + Professional Quality
Full control, no subscriptions, local deployment
Chapter 02
Core Generation Parameters
Essential settings for music generation control
Core Generation Parameters
Essential settings for controlling XL music generation
Parameter | Description | Default | Range/Options |
prompt / tags | Style description (comma-separated) | Required | e.g., "electronic, 120bpm, female voice" |
lyrics | Structured lyrics with tags | Optional | See Lyric Structure section |
duration | Output length in seconds | 60 | 10-360 seconds (XL) |
steps / num_steps | Inference steps | 35-60 | 8 (turbo) to 150 |
seed | Random seed for reproducibility | Random | -1 for random, or specific integer |
scheduler | Sampling scheduler | euler | euler, heun, er_sde |
guidance_type | CFG type | apg | cfg, apg, cfg_star |
guidance_scale / cfg | Main guidance strength | 3.5-12 | 1-15 (lower = more creative) |
tag_guidance_scale | Style adherence strength | 5 | 1-15 |
lyric_guidance_scale | Lyric adherence strength | 1.5 | 0.1-10 |
guidance_interval | When to apply guidance | 0.5 | 0.0-1.0 |
granularity_scale | Reduces artifacts | 10 | Integer |
instrumental | Generate without vocals | false | Boolean |
Pro Tip: Lower guidance_scale values give more creative freedom, higher values enforce stricter adherence to prompts. For XL, 3.5-4.5 is the sweet spot (lower than base model).
Chapter 03
Advanced Parameters & Model Variants
Developer settings and model selection guide
XL Model Variants
Choose the right model for your use case
acestep-v15-xl-base
50-60 STEPS
Full features including extract, repaint, Lego. Enhanced long-form coherence.
Best For: Maximum flexibility with XL quality | VRAM: 24GB recommended
acestep-v15-xl-sft
50-60 STEPS
Highest quality, no editing features. Superior vocal and instrumental clarity.
Best For: Production-quality music generation | VRAM: 24GB recommended
acestep-v15-xl-turbo
12-16 STEPS
Fast iteration with XL architecture. Reduced CFG needed (3.0-5.0).
Best For: Quick experiments with XL benefits | VRAM: 20GB minimum
acestep-v15-xl-turbo-rl
COMING SOON
RL-optimized XL turbo variant. Best speed/quality ratio for XL.
Best For: Production speed with XL quality
XL-Specific Advanced Parameters
Fine-tune XL performance and memory usage
Parameter | Description | XL Default | Range/Options |
xl_attention_mode | Attention optimization | flash | flash, standard, memory_efficient |
xl_cascade_stages | Multi-resolution generation | 3 | 2-4 |
xl_coherence_weight | Long-form structure strength | 1.2 | 0.5-2.0 |
bf16 | Use bfloat16 for faster inference | true | Boolean |
torch_compile | Optimize with torch.compile() | false | Boolean |
cpu_offload | Offload to CPU to save VRAM | false | Boolean |
overlapped_decode | Speed up inference | false | Boolean |
device_id | GPU device ID | 0 | Integer |
shift | Dynamic shift for distillation | 4-5 | 3 (turbo) to 6 |
denoise | Audio-to-audio strength | 0.5 | 0.25-1.0 |
Chapter 04
Developer Advice
Optimize XL generation quality with proven techniques
The XL Golden Balance
CFG vs Steps Relationship (XL-Optimized)
Key Principles
CFG 3.5-4.5 is the XL sweet spot (Lower than base)
Going above 5.5 introduces artifacts (More sensitive than base)
Compensate low CFG with higher steps (60-80) for refinement
XL responds better to subtle guidance than base
Shift Parameter
shift: 4.5 # Sweet spot for XL compositional structure shift: 5.0 # For complex arrangements (orchestral, jazz) shift: 4.0 # For simpler structures (pop, electronic)
Sampler Recommendations (XL-Tested)
Recommended: Sampler: er_sde, Scheduler: linear_quadratic
Alternative: euler with shift: 5.0
Avoid: DPM++ family, uni_pc - These produce more artifacts in XL than base
Pro Tip: Use shift: 4.5-5.0 for better compositional structure — affects arrangement coherence, not just sound quality
XL Duration Recommendations & Batch Strategy
Optimize for XL's extended capabilities
Duration | Quality | Use Case | CFG | Steps |
60-90s | Excellent | Short tracks, loops | 4.5 | 45 |
90-180s | Optimal | Full songs | 4.0 | 50-60 |
180-300s | Very Good | Extended tracks | 3.5-4.0 | 60-80 |
300-360s | Good* | Epic/ambient | 3.5 | 70-80 |
XL-Specific Warnings: Avoid CFG > 6.0 (harsh artifacts appear faster). Avoid Steps < 30 (unless using turbo model). Avoid very short durations < 45s (XL is optimized for longer form).
Recommended XL Workflow
batch_size: 6-10 # Lower than base due to VRAM duration: 150-180s # XL's sweet spot steps: 50-60 guidance_scale: 4.0
Success Rate (XL): Usually 1 excellent result per 2-3 generations (Improved from base)
Chapter 05
Tips & Best Practices
Master XL-optimized prompt engineering
XL Tag Crafting Examples
Optimized prompts for XL's enhanced understanding
Electronic (XL-Optimized)
"progressive trance, female ethereal vocals, lush pads, plucky arpeggios, deep bass, 138bpm, cathedral reverb, emotional breakdown, euphoric buildup"
Cinematic Rock (XL Excels Here)
"cinematic rock, powerful male baritone, orchestral strings, heavy drums, electric guitar, epic, anthemic, 85bpm, wide stereo, dynamic range, emotional crescendo"
Jazz Fusion (Complex - XL Advantage)
"jazz fusion, saxophone lead, electric piano, walking bass, brushed drums, 140bpm, smoky atmosphere, sophisticated, improvisational feel, vintage tone"
XL-Specific Tag Benefits
Better compound understanding: "melancholic yet hopeful" - XL captures both moods
Enhanced spatial tags: "wide stereo field, intimate center vocal" - XL renders more accurately
Improved temporal tags: "slow build, sudden drop" - XL follows structure better
Lyric Structure Tags
Control song structure with bracket notation
[intro]
Opening section
[verse]
Main vocal sections
[verse 1] / [verse 2]
Numbered verses
[pre-chorus]
Build-up to chorus
[chorus]
Main hook/refrain
[bridge]
Contrasting middle
[hook]
Catchy phrase
[refrain]
Recurring line
[interlude]
Instrumental break
[breakdown]
XL handles better
[outro] / [post-outro]
Extended outros
[ad-lib]
Improvised vocals
[inst]
Instrumental section
XL Lyric Structure Best Practices
Always use brackets [ ] for structure tags
Tags are case-insensitive but standardize for readability
Empty lines between sections help the model parse structure
Repetition markers like "(Repeat x2)" are supported
Chapter 06
Demo Prompts
XL-optimized examples showcasing capabilities
Cinematic Pop & Progressive Rock
Complete working examples for popular genres
Cinematic Pop (XL Showcase)
Tags
cinematic pop, powerful female vocals, orchestral strings, piano, dramatic drums, emotional, 95bpm, wide stereo field, dynamic range, intimate verses, explosive chorus
Lyrics Sample
[intro - piano only]
[verse 1]
In the silence of the night
I hear echoes of your voice
[pre-chorus]
And I'm rising, slowly rising
[chorus]
I am stronger now, stronger now!
Through the storm I found my way
XL Settings: Duration: 180s, Steps: 55, CFG: 4.0, Shift: 4.5
Progressive Rock (XL Complexity)
Tags
progressive rock, complex time signatures, virtuoso guitar solo, hammond organ, thunderous bass, precise drums, 7/8 time, 145bpm, vintage analog tone, dynamic shifts, epic arrangement
XL Settings: Duration: 300s, Steps: 70, CFG: 3.8, Shift: 5.0
Ambient & Multilingual Examples
Long-form and international music generation
Ambient Soundscape (XL Long-Form)
Tags
ambient, atmospheric, ethereal pads, field recordings, no drums, glacial pace, deep sub frequencies, evolving textures, meditative, 60bpm, binaural, spatial audio, healing frequencies
XL Settings: Duration: 360s, Steps: 80, CFG: 3.0, Shift: 3.5 | Note: XL's extended coherence shines in long ambient pieces
Mandarin C-Pop Ballad (XL Enhanced)
Tags
c-pop ballad, emotional female vocals, piano, strings, cinematic, 72bpm, mandarin chinese, tear-jerking, powerful vocal runs, dramatic builds
Lyrics Sample
[verse 1]
[zh]zou3 guo4 jie1 jiao3
[zh]kan4 guo4 xi1 yang2
[chorus]
[zh]wo3 ai4 ni3 ai4 de shen1
XL Settings: Duration: 195s, Steps: 55, CFG: 4.3 | XL Advantage: Better tonal accuracy for Mandarin
Chapter 07
Quick Reference
Command cheat sheet and essential links
XL Command Cheat Sheet
Quick settings optimized for XL models
Task | Key Settings (XL-Optimized) |
Quick Draft | Model: xl-turbo, Steps: 12, CFG: 4.0, Duration: 60s |
High Quality | Model: xl-sft, Steps: 55, CFG: 4.0, Duration: 180s |
Maximum Quality | Model: xl-sft, Steps: 70, CFG: 3.8, Duration: 240s |
Long Form | Model: xl-sft, Steps: 70, CFG: 3.5, Duration: 300-360s |
Instrumental | Lyrics: [inst], instrumental flag, Duration: 180s |
Style Transfer | Audio2Audio, Denoise: 0.5, New tags, CFG: 4.0 |
Lyric Change | Audio2Audio, Same audio, New lyrics, Denoise: 0.4 |
Batch (24GB) | Batch size: 6, Duration: 150s, Steps: 55 |
Batch (16GB) | Batch size: 2, cpu_offload: True, Duration: 120s |
Memory Profiles
# High VRAM (40GB+) model = "acestep-v15-xl-sft" batch_size = 10 cpu_offload = False bf16 = True torch_compile = True# Standard VRAM (24GB) model = "acestep-v15-xl-sft" batch_size = 4 cpu_offload = False bf16 = True# Low VRAM (16GB) model = "acestep-v15-xl-turbo" batch_size = 2 cpu_offload = True bf16 = True overlapped_decode = True
Resources
Essential links and documentation
ACE-Step 1.5 XL Official - Updated XL documentation and samples
ace-step/ACE-Step-1.5-XL - XL model code and notebooks
graydient/ACE-Step-1.5-XL - Model checkpoints - XL variants
Detailed XL workflow guides - XL-specific node configurations
Base vs XL benchmarks - Quality metrics and VRAM requirements
Manual synthesizes official documentation with community-tested best practices for XL model optimization.
Chapter 08
What's New in ACE-Step 1.5 XL
Major improvements and migration guide
XL Model Improvements
Architecture enhancements and quality improvements
Feature | v1.5 Base | v1.5 XL | Improvement |
Max Duration | 240s (4 min) | 360s (6 min) | +50% longer |
Model Size | Standard | XL (Larger DiT) | Better quality |
Long-form Coherence | Good | Excellent | Less drift |
Vocal Clarity | Good | Superior | Better articulation |
Stereo Imaging | Standard | Enhanced | Wider soundstage |
Genre Understanding | Strong | Stronger | More nuanced |
Additional Quality Improvements
Reduced artifacts in high-frequency content
Better lyric-to-melody alignment for sung vocals
Improved compositional structure for 3+ minute tracks
Enhanced instrument separation in dense mixes
More stable pitch across vocal performances
Better dynamics between quiet and loud sections
Migration Guide: Base to XL
Updated defaults and breaking changes
Parameter | v1.5 Base | v1.5 XL | Notes |
steps | 27-50 | 35-60 | XL benefits from more steps |
guidance_scale | 4.0-15 | 3.5-12 | Lower CFG often better |
shift | 3 (turbo) | 4-5 | Higher for better structure |
duration sweet spot | 90-120s | 120-180s | XL excels at longer form |
Recommended Settings for XL
# Optimal XL Configuration model = "acestep-v15-xl-sft" steps = 50 guidance_scale = 4.5 shift = 4.5 duration = 180# 3 minutes sampler = "er_sde" scheduler = "linear_quadratic"
Breaking Changes: Memory Requirements: Base model: 16GB VRAM minimum. XL model: 24GB VRAM recommended. Use cpu_offload=True for lower VRAM (slower). Inference Time: XL is approximately 1.3x slower than base. Use turbo models for faster iteration.
Version Comparison Summary
Choose the right model for your needs
Feature | v1.5 Base | v1.5 XL |
Max Duration | 240s | 360s |
Sweet Spot | 90-120s | 120-180s |
VRAM (min) | 16GB | 20GB |
VRAM (recommended) | 20GB | 24GB |
CFG Range | 4.0-6.0 | 3.5-4.5 |
Steps Range | 27-50 | 35-60 |
Coherence | Good | Excellent |
Vocal Quality | Good | Superior |
Artifact Reduction | Standard | Enhanced |
Batch Size (24GB) | 8 | 4-6 |
Inference Speed | 1x | 0.77x |
Use Base (v1.5) when:
Limited VRAM (16-20GB)
Need maximum speed
Shorter tracks (< 120s)
Quick iterations
Use XL (v1.5 XL) when:
24GB+ VRAM available
Quality is priority
Longer tracks (120s+)
Complex arrangements
Final production
ACE-STEP 1.5 XL
State-of-the-Art Music Generation
Start Creating
Generate professional music with AI
© 2026 Graydient AI