1 of 9

Invariant Audio Prints �for Music Indexing and Alignment

Rémi Mignot¹, Geoffroy Peeters²

¹ STMS Lab – IRCAM, Sorbonne Université, CNRS (UMR-9912), Paris, France

² LTCI - Télécom Paris, Institut Polytechnique de Paris, Palaiseau, France

CBMI 2024

21st International Conference on Content-based Multimedia Indexing, �September 18-20, Reykjavik, Iceland

test

2 of 9

Introduction

Audio Indexing

Find the “reference song” from a music catalog based on the signal content of a given audio excerpt

+ metadata

Robustness: find a solution which still works when the query is transformed / degraded

→ time stretching, pitch shifting, noise addition, distortion, audio effects, and different instruments (for alignment)

query

catalog/ database / table

output

Different instruments & pitch shifting

Audio-to-audio Alignment

Search the time mapping between two occurrences of the same music (covers e.g.)

Original

Modified �tempo

Original

Degraded

Remark: we use the same approach for both tasks.

R. Mignot & G. Peeters, “Invariant Audio Prints for Music Indexing and Alignment,” CBMI 2024

3 of 9

HD-Key

Reduction

Hashing

Time �alignment

(2) Dimension reduction (40) → learning of a linear projection robust to degradations,

Method overview

Process chain

Derivation of codes that are: ✓ robust to transformations / degradations and

✓ relevant to the musical content (unlike spectrogram peak-pairs methods)

x(t)

Audio �signal

query

f_n

High-dimensional

descriptors

v_n

Reduced

descriptors

h_n

Hash

codes

h_n

Query

h^*_n(found reference)

Reference:

- metadata

- time position

- stretch factor

- time mapping

(1)

(2)

(3)

(4)

Hash table

briefly presented here

(1) High-dimensional audio keys (1056) → design of audio descriptors robust to some transformations,

(4) Time alignment → DTW-based alignment to estimate the time mapping.

(3) Hashing → hash codes tolerant to bit corruption (LSH-based),

R. Mignot & G. Peeters, “Invariant Audio Prints for Music Indexing and Alignment,” CBMI 2024

4 of 9

High-dimensional Audio Keys

Audio descriptors

relevant to the musical content �→ inspired by audio classification (modulation spectrum).

(1)

Manipulations of sub-spectrograms

Log. scale of frequencies and time (d),
Frequency band splitting (e),
Amplitude transformation (f)
Magnitude of 2D-Discrete Fourier Transform (g).

Based on properties of:

Logarithmic function,
Shift invariance of | DFT |,
Amplitude change,

→ The descriptors are robust by design to:

Pitch and time changes, and noise, filtering,

→ dimension 1056…

robust to transformations by design :

R. Mignot & G. Peeters, “Invariant Audio Prints for Music Indexing and Alignment,” CBMI 2024

5 of 9

Dimension reduction 1056 → 40, � with output variables v_n with properties:

centered, normalized, and mutually uncorrelated,
robust to transformations/degradations, and
discriminant to the original signal.

Reduction of the dimensions

Chain of linear transformations :

Discriminant analyses, or Independent Component Analyses, and Orthogonal projections,

Robust dimensional reduction

(2)

Dimensions:

1056

1026

ICCR

LDA

ICA

HD-Key

x(t)

v_n

OMPCA

ICCR

LDA

ICA

xⁱ(t)

HD-Key

Audio

Effects

OMPCA

Training dataset: � → many transformed versions of music examples (~data augmentation).

R. Mignot & G. Peeters, “Invariant Audio Prints for Music Indexing and Alignment,” CBMI 2024

6 of 9

Audio-to-audio time alignment of synthesized MIDI covers �→ with time varying tempo, pitch shifting, � changed instruments and removed drums. �→ use of local distances computed for the derived audio codes.

Two experiments (see the paper)

Segmentation and indexation of music medleys�→ with time and pitch changes + degradations, � and within a reference catalog of ~40 000 songs.

Original medley

Transformed medley

+ realigned medley (right channel)

Original synthesized MIDI song

Transformed MIDI song

+ realigned MIDI (right channel)

See the paper for all the quantitative results

Overall conclusions:

Audio indexing: �not as good as other approaches for some degradations (e.g. noise), but still robust, especially for pitch and time changes.

Audio-to-audio time alignment: �The results prove that the derived audio codes are quite �robust to transformations and �they are representative to the musical content, �even with different instruments.

R. Mignot & G. Peeters, “Invariant Audio Prints for Music Indexing and Alignment,” CBMI 2024

7 of 9

Bonus experiment

Time alignment of an acoustic guitar cover of Little Wing (Jimi Hendrix).

Acoustic guitar + voice cover

“Little Wing” by Corey Heuvel

Tempo: ~60 BPM

with accelerations/decelerations,

some longer transitions,

quite different scores

but respected structure (at the beginning)

Original recording

“Little Wing” (Jimi Hendrix)

Tempo: ~70 BPM

R. Mignot & G. Peeters, “Invariant Audio Prints for Music Indexing and Alignment,” CBMI 2024

8 of 9

Bonus experiment

Time alignment of an acoustic guitar cover of Little Wing (Jimi Hendrix).

left channel: unchanged cover sound

right channel: synchronized original recording

all channels: synchronized original recording

Processing: Time alignment between the two recordings based on the derived audio prints and DTW.
Realignment: The original recording of Hendrix is then realigned to the cover, and inserted into the video.

Remark: longer transitions, e.g. between the 2^nd verse and the solo, at 1:40.

→ the original is strongly stretched

R. Mignot & G. Peeters, “Invariant Audio Prints for Music Indexing and Alignment,” CBMI 2024

1 of 9

2 of 9

3 of 9

4 of 9

5 of 9

6 of 9

7 of 9

8 of 9

9 of 9