1 of 28

Introduction� To� Natural Language Processing

Sreyan Ghosh

Deep Learning Solutions Architect @ NVIDIA

Researcher @ MIDAS Labs, IIIT Delhi ; Speech Lab, IIT Madras

2 of 28

Phases of Innovation in Artificial Intelligence

3 of 28

The Various Domains of AI

NLP lies at the intersection of computational linguistics and machine learning.

4 of 28

Natural Language Processing (NLP)

Human language is special for several reasons. It is specifically constructed to convey the speaker/writer's meaning. It is a complex system, although little children can learn it pretty quickly.

1. Ambiguity

2. Scale

3. Sparsity

4. Variation

5. Expressivity

6. Unmodeled Variables

7. Unknown representations

Ambiguity at multiple levels:

Word senses: bank (finance or river ?)
Part of speech: chair (noun or verb ?)
Syntactic structure: I can see a man with a telescope

Why is NLP Difficult?

5 of 28

Autocorrect and Autocomplete

Grammar Check
Spelling Check

Language Translator
Social Media Monitoring
Chatbots
Survey Analysis
Targeted Advertising
Hiring and Recruitment
Email Filtering
Voice Assistants
Smart Home Applications
Smart Car Applications
Subtitle Generation
Teaching
Minutes of Meeting

Applications of Text and Speech Processing

Level Of Linguistic Knowledge

6 of 28

Question Answering

Text Classification

Text Summarization

Language Modelling

Machine Translation

Sequence Tagging

Data Augmentation

Information Extraction

NER

Abstractive

Extractive

Sentiment Classification

Toxicity Classification

Text Processing

7 of 28

Syntactic analysis, also referred to as syntax analysis or parsing, is the process of analyzing natural language with the rules of a formal grammar. Grammatical rules are applied to categories and groups of words, not individual words. Syntactic analysis basically assigns a semantic structure to text.

For example, a sentence includes a subject and a predicate where the subject is a noun phrase and the predicate is a verb phrase. Take a look at the following sentence: “The dog (noun phrase) went away (verb phrase).” Note how we can combine every noun phrase with a verb phrase. Again, it's important to reiterate that a sentence can be syntactically correct but not make sense.

Parsing refers to the formal analysis of a sentence by a computer into its constituents, which results in a parse tree showing their syntactic relation to one another in visual form, which can be used for further processing and understanding.

Semantic analysis is the process of understanding the meaning and interpretation of words, signs and sentence structure. This lets computers partly understand natural language the way humans do. I say partly because semantic analysis is one of the toughest parts of NLP and it's not fully solved yet.

Syntactic and Semantic Analysis

A Parse Tree

8 of 28

NLP Before Deep Learning

Count Vectorizer

https://www.kdnuggets.com/2020/01/decision-tree-algorithm-explained.html

Random Forests

TF-IDF

9 of 28

A Basic Neural Network

LSTM

Neural Networks, Deep Learning and NLP

10 of 28

Word Embeddings

Why are word embeddings important?

Example: In web search, if user searches for “Seattle motel”, we would like to match documents containing “Seattle hotel”.

But:

motel = [0 0 0 0 0 0 0 0 0 0 1 0 0 0 0]

hotel = [0 0 0 0 0 0 0 1 0 0 0 0 0 0 0]

These two vectors are orthogonal.

There is no natural notion of similarity for one-hot vectors!

Solution: Learn to encode similarity in vectors themselves.

Distributional semantics: A word’s meaning is given by the words that frequently appear close-by • “You shall know a word by the company it keeps” (J. R. Firth 1957: 11).

When a word w appears in a text, its context is the set of words that appear nearby (within a fixed-size window).Use the many contexts of w to build up a representation of w.

11 of 28

Representing Word Vectors by their Context

https://jalammar.github.io/illustrated-word2vec/�https://machinelearninginterview.com/topics/natural-language-processing/what-is-the-difference-between-word2vec-and-glove/

Continuous Bag of Words (CBOW) and Skip-gram are two ways Word2Vec is trained.

Word2Vec and GloVe; Two popular ways of training word embeddings

Word2Vec: Words closer in meaning in general domain English are closer to each other in the multi-dimensional vector space.

12 of 28

Language Models

RNN-based Language Model

Transformer-based Language Model

Language Models are used for various use-cases including Speech Recognition, Sentence Scoring, Novel Sequence Generation, and very recently word embedding generation

13 of 28

Attention: The attention-mechanism looks at an input sequence and decides at each step which other parts of the sequence are important. It sounds abstract but let me clarify with an easy example: When reading this text, you always focus on the word you read but at the same time your mind still holds the important keywords of the text in memory in order to provide context. Same goes for machine translation.

The Attention Mechanism

14 of 28

Summary of Popular Attention Mechanisms

https://lilianweng.github.io/posts/2018-06-24-attention/

15 of 28

Transformers and Modern Day NLP

Vaswani, Ashish, et al. "Attention is all you need." Advances in neural information processing systems 30 (2017).

https://ai.stackexchange.com/questions/20075/why-does-the-transformer-do-better-than-rnn-and-lstm-in-long-range-context-depen

Visual Representation of the Transformer Model

16 of 28

Self Attention Mechanism

The Mantra: “you compare the ‘query’ with the ‘keys’ and get scores/weights for the ‘values.’ Each score/weight is in short the relevance between the ‘query’ and each ‘key’. And you reweight the ‘values’ with the scores/weights, and take the summation of the reweighted ‘values’.”

Visual Representation of how Query, Keys and Values Interact

https://data-science-blog.com/blog/2021/04/07/multi-head-attention-mechanism/

17 of 28

Self Attention Mechanism Continued….

Visual Representation of Scaled Dot-Product Attention

Visual Representation of Matrix Multiplication Operations in Scaled Dot-Product Attention

18 of 28

BERT – Bidirectional Encoder Representations from Transformers

BERT follows a two-step process; First Self-Supervised Learning using unlabeled data and next fine-tuning on task specific labelled data

Masked Language Modelling (MLM)

Devlin, Jacob, et al. "Bert: Pre-training of deep bidirectional transformers for language understanding." arXiv preprint arXiv:1810.04805 (2018).

You are currently looking at a slide. You can also slide down the stairs or play on a slide in a park. Word2Vec which learned only one “meaning” for a word. In this sense, Word2Vec produces static definitions (or embeddings which are just vectors representing the word in question), since they only have one meaning which doesn’t change depending on the context in which it’s used.

19 of 28

Question Answering

Speech Synthesis

Speech Enhancement

Speaker Diarization

Speech Recognition

Speaker Verification

Speech Classification

Trigger Word Detection

Speaker Identity Verification

Emotion Recognition

Keyword Recognition

Noise Reduction

Natural TTS

Spoken Language Processing

20 of 28

Traditional Speech Recognition

Visual Representation of HMM-GMM based Speech Recognition

https://jonathan-hui.medium.com/speech-recognition-gmm-hmm-8bb5eff8b196

The arrows on the HMM model represent phone transitions or links to observables. To model the audio features that we observe, we learn a GMM model from the training data.

21 of 28

End-to-End Speech Recognition

Visual Representation of DeepSpeech2 Model Architecture [1]

Visual Representation of Wav2Vec-2.0 Model Architecture [2] based on self-attention layers. Similar to BERT, Wav2Vec-2.0 also follows pre-training on unlabeled data using SSL followed by fine-tuning with labeled data regime.

[1] Hannun, Awni, et al. "Deep speech: Scaling up end-to-end speech recognition." arXiv preprint arXiv:1412.5567 (2014).

[2] Baevski, Alexei, et al. "wav2vec 2.0: A framework for self-supervised learning of speech representations." Advances in Neural Information Processing Systems 33 (2020): 12449-12460.

22 of 28

Text-based Language Models for Speech Recognition?

A Language Model scores the output transcript

24 of 28

Useful Resources

coursera.com
udemy.com
medium.com
towardsdatascience.com
Google Scholar
arXiv
ACL Anthology
Books:

Primary text: Jurafsky and Martin, Speech and Language Processing, 2nd or 3rd Edition (https://web.stanford.edu/~jurafsky/slp3/)
Deep Learning for Interviews: (https://arxiv.org/ftp/arxiv/papers/2201/2201.00650.pdf)
Also: Eisenstein, Natural Language Processing https://github.com/jacobeisenstein/gt-nlp-class/blob/master/notes/eisenstein-nlp-notes.pdf

25 of 28

Mastering NLP

GET YOUR FUNDAMENTALS OF RIGHT.

LEARN A SINGLE MACHINE LEARNING AND DEEP LEARNING FRAMEWORK.

AI IS VAST, TRY TO MASTER A SINGLE DOMAIN.

READ MORE AND MORE PAPERS. TRY FOCUSING ON QUALITY OVER QUANTITY.

IDENTIFY A PROBLEM, READ A LOT OF LITERATURE AND PAST WORK RELATED TO IT.

https://research.com/conference-rankings/computer-science

26 of 28

Recent Trends in NLP Research

Self-Supervised Learning

ASR and NLU

Disfluency Detection