1 of 9

Invariant Audio Prints for Music Indexing and Alignment

Rémi Mignot¹, Geoffroy Peeters²

¹ STMS Lab – IRCAM, Sorbonne Université, CNRS (UMR-9912), Paris, France

² LTCI - Télécom Paris, Institut Polytechnique de Paris, Palaiseau, France

CBMI 2024

21st International Conference on Content-based Multimedia Indexing, September 18-20, Reykjavik, Iceland

test

2 of 9

Introduction

  • Audio Indexing

Find the “reference song” from a music catalog based on the signal content of a given audio excerpt

+ metadata

  • Robustness: find a solution which still works when the query is transformed / degraded

→ time stretching, pitch shifting, noise addition, distortion, audio effects, and different instruments (for alignment)

query

catalog/ database / table

output

Different instruments & pitch shifting

  • Audio-to-audio Alignment

Search the time mapping between two occurrences of the same music (covers e.g.)

Original

Modified tempo

Original

Degraded

  • Remark: we use the same approach for both tasks.

R. Mignot & G. Peeters, “Invariant Audio Prints for Music Indexing and Alignment,” CBMI 2024

3 of 9

HD-Key

Reduction

Hashing

Search

Time alignment

(2) Dimension reduction (40) learning of a linear projection robust to degradations,

Method overview

  • Process chain

Derivation of codes that are: ✓ robust to transformations / degradations and

relevant to the musical content (unlike spectrogram peak-pairs methods)

x(t)

Audio signal

query

fn

High-dimensional

descriptors

vn

Reduced

descriptors

hn

Hash

codes

hn

Query

h*n (found reference)

Reference:

- metadata

- time position

- stretch factor

- time mapping

(1)

(2)

(3)

(4)

Hash table

briefly presented here

(1) High-dimensional audio keys (1056) → design of audio descriptors robust to some transformations,

(4) Time alignment → DTW-based alignment to estimate the time mapping.

(3) Hashing → hash codes tolerant to bit corruption (LSH-based),

R. Mignot & G. Peeters, “Invariant Audio Prints for Music Indexing and Alignment,” CBMI 2024

4 of 9

High-dimensional Audio Keys

  • Audio descriptors
    • relevant to the musical content → inspired by audio classification (modulation spectrum).

(1)

    • Manipulations of sub-spectrograms
      • Log. scale of frequencies and time (d),
      • Frequency band splitting (e),
      • Amplitude transformation (f)
      • Magnitude of 2D-Discrete Fourier Transform (g).

    • Based on properties of:
      • Logarithmic function,
      • Shift invariance of | DFT |,
      • Amplitude change,

→ The descriptors are robust by design to:

Pitch and time changes, and noise, filtering,

→ dimension 1056…

    • robust to transformations by design :

R. Mignot & G. Peeters, “Invariant Audio Prints for Music Indexing and Alignment,” CBMI 2024

5 of 9

    • Dimension reduction 1056 → 40, with output variables vn with properties:
      • centered, normalized, and mutually uncorrelated,
      • robust to transformations/degradations, and
      • discriminant to the original signal.
  • Reduction of the dimensions
    • Chain of linear transformations :
      • Discriminant analyses, or Independent Component Analyses, and Orthogonal projections,

Robust dimensional reduction

(2)

Dimensions:

1056

1026

80

80

40

40

ICCR

LDA

ICA

HT

HD-Key

x(t)

vn

OMPCA

ICCR

LDA

ICA

xi(t)

HD-Key

Audio

Effects

OMPCA

    • Training dataset: → many transformed versions of music examples (~data augmentation).

R. Mignot & G. Peeters, “Invariant Audio Prints for Music Indexing and Alignment,” CBMI 2024

6 of 9

  • Audio-to-audio time alignment of synthesized MIDI covers → with time varying tempo, pitch shifting, changed instruments and removed drums. → use of local distances computed for the derived audio codes.

Two experiments (see the paper)

  • Segmentation and indexation of music medleys→ with time and pitch changes + degradations, and within a reference catalog of ~40 000 songs.

Original medley

Transformed medley

+ realigned medley (right channel)

Original synthesized MIDI song

Transformed MIDI song

+ realigned MIDI (right channel)

  • See the paper for all the quantitative results
  • Overall conclusions:
    • Audio indexing: not as good as other approaches for some degradations (e.g. noise), but still robust, especially for pitch and time changes.
    • Audio-to-audio time alignment: The results prove that the derived audio codes are quite robust to transformations and they are representative to the musical content, even with different instruments.

R. Mignot & G. Peeters, “Invariant Audio Prints for Music Indexing and Alignment,” CBMI 2024

7 of 9

Bonus experiment

  • Time alignment of an acoustic guitar cover of Little Wing (Jimi Hendrix).

Acoustic guitar + voice cover

“Little Wing” by Corey Heuvel

Tempo: ~60 BPM

with accelerations/decelerations,

some longer transitions,

quite different scores

but respected structure (at the beginning)

Original recording

“Little Wing” (Jimi Hendrix)

Tempo: ~70 BPM

R. Mignot & G. Peeters, “Invariant Audio Prints for Music Indexing and Alignment,” CBMI 2024

8 of 9

Bonus experiment

  • Time alignment of an acoustic guitar cover of Little Wing (Jimi Hendrix).

left channel: unchanged cover sound

right channel: synchronized original recording

all channels: synchronized original recording

  1. Processing: Time alignment between the two recordings based on the derived audio prints and DTW.
  2. Realignment: The original recording of Hendrix is then realigned to the cover, and inserted into the video.

Remark: longer transitions, e.g. between the 2nd verse and the solo, at 1:40.

→ the original is strongly stretched

R. Mignot & G. Peeters, “Invariant Audio Prints for Music Indexing and Alignment,” CBMI 2024

9 of 9

THANK YOU

For more details:

    • Questions ?
    • Read the paper #107:
      • Rémi Mignot & Geoffroy Peeters,
      • “Invariant Audio Prints for Music Indexing and Alignment”
    • See the poster this afternoon.