1 of 20

2 of 20

3 of 20

4 of 20

5 of 20

6 of 20

7 of 20

8 of 20

9 of 20

10 of 20

11 of 20

12 of 20

13 of 20

14 of 20

15 of 20

16 of 20

How Wav2Vec2 Understands Time

Given audio: 2000 numbers (representing 2 seconds of audio)

Output: 2 phonemes

How do know which numbers in the audio represent the phonemes?

Consider saying the sound “ah” for 1 second

1 second = 1,000 ms

0ms

40ms

60ms … 1000ms

ɑ

ɑ

ɑ

ɑ

We represent this in 20ms chunks

And collapse these duplicate chunk

INPUT: [72, 83, 937, 283…]

Output: həloʊ wɝld

17 of 20

Architecture Efficiency & Alignment

Quantization:�

  • Shrinks model size and speeds up inference.�
  • Critical for running feedback in real time on devices like phones.�

Model Size Tradeoffs:�

  • Small models = faster and lighter, but may lose nuance.�
  • Ideal: balance between phonetic accuracy and deployment speed.

18 of 20

Cool Application

A S K vs A K S

0.02

0.01

0.03

0.01

0.02

0.03

WhisperX Architecture:

wav2vec2

19 of 20

Comparing Sequences of Phonemes

20 of 20

Architectural Differences

Whisper Vs Wav2Vec2 vs Conformer

Conformer: models local and global dependencies; moderately fast; efficient when fine-tuned.�

Whisper: trained on massive multilingual data; high ASR accuracy, but slow inference.

wav2vec2 (XLSR-60): excels at phonetic detail, fast inference, and works well in low-resource settings.�

Pronunciation Feedback Suitability: wav2vec2 is best suited; Conformer is viable with tuning; Whisper is less ideal due to latency and not being primed for phonetic transcription