1 of 55

CSCI-SHU 376: Natural Language Processing

Hua Shen

2026-02-05

Spring 2026

Lecture 5: Sequence Modeling and Neural Networks

2 of 55

Plans

  • Part-of-speech Tagging
  • Named Entity Recognition
  • Hidden Markov Model (HMM)

3 of 55

Standard NLP Pipelines

4 of 55

Sequence Modeling

5 of 55

What are part of speech Tags?

  • Word classes or syntactic categories
  • Reveal useful information about a word and its neighbors

6 of 55

Part of Speech

  • Different words have different functions
  • Closed Class: fixed membership, function words
    • E.g., prepositions (in, on, of), determiners (the, a)
  • Open Class: new words get added frequently
    • E.g., nouns, verbs, adjectives etc

7 of 55

Part of Speech Tagging

  • Tag each word in a sentence with its part of speech
  • Disambiguation task: each word might have different functions in different contexts

8 of 55

Part of Speech Tagging: A simple Baseline

  • Many words are easy to tag
  • Most frequent class: assign each word to the class it occurred most in the training set (e.g., man/NN)
  • Accuracy tags 92.34% of word tokens on wall street journal (WSJ)!
  • State of the art: 97%
  • Better to make decisions on entire sentences over individual words
  • POS Tagging still not solved yet!

9 of 55

Sequence Modeling

10 of 55

Named Entities

  • Named entity means anything that can be referred to with a proper name
  • NER: 1) finding spans of text that constitute proper names; 2) tagging the type of the entity
  • Most common tags:
    • PER (Person)
    • LOC (Location)
    • ORG (Organization)
    • MISC (Miscellaneous)

11 of 55

Named Entity Recognition

12 of 55

Named Entity Recognition: BIO Tagging

  • B: token that begins a span
  • I: tokens that inside a span
  • O: tokens outside of a span

13 of 55

Plans

  • Part-of-speech Tagging
  • Named Entity Recognition
  • Hidden Markov Model (HMM)

14 of 55

Markov Chains

 

15 of 55

Markov Chains

16 of 55

Hidden Markov Model

  • We do not see sequences of POS tags in text
  • However, we do observe the words
  • HMM allows us to jointly reason over both hidden and observed events
    • Assume each position has a tag that generates a word

17 of 55

Hidden Markov Model

 

18 of 55

Assumptions

 

19 of 55

Sequence likelihood

 

Transition probability

Emission probability

20 of 55

Sequence likelihood (Example)

 

21 of 55

Sequence likelihood (Example)

 

 

22 of 55

Learning HMM

 

23 of 55

Learning Examples

 

 

24 of 55

Decoding with HMMs

  • Task: Find the most probable states given observations
  • E.g., POS Tagging

25 of 55

Greedy Decoding

  • Decode one state at a time

26 of 55

Greedy Decoding

  • It does not guarantee to produce the overall optimal sequence
  • Decode one state at a time

27 of 55

Viterbi Decoding

 

28 of 55

Viterbi Decoding

29 of 55

Viterbi Decoding

30 of 55

Viterbi Decoding

 

31 of 55

Beam Search

  • If K (number of hidden states) is too large, Viterbi is too expensive!!

32 of 55

Beam Search

  • If K (number of hidden states) is too large, Viterbi is too expensive!!

33 of 55

Beam Search

 

34 of 55

Beam Search

 

35 of 55

Beam Search

 

36 of 55

Beam Search

 

37 of 55

Beam Search

  • Beam search is a very common decoding method for any language generation tasks (e.g., machine translation)

38 of 55

Today’s Plan

  • Neural Networks
  • Recurrent neural networks (RNNs)

39 of 55

Neural Network

  • A network of small computing units
  • Deep learning: Modern neural network (have many layers)
  • Possible to learn any function

40 of 55

Neural Network: Units

 

41 of 55

Neural Network: Units

  • Tanh function
  • Sigmoid function
  • ReLU function

42 of 55

Neural Network: Units

43 of 55

Feedforward Neural Networks

  • Sometimes called multi-layer perceptrons (MLPs)
  • Input units, hidden units, output units
  • Fully-connected: each unit in each layer takes input from all units in the previous layer

44 of 55

Feedforward Neural Networks

  • Feedforward NN with a single hidden layer
  • Input vector x, output probability distribution y

45 of 55

XOR Problem

  • Cannot be solved by a linear separator (e.g., perceptron)

46 of 55

XOR Problem

  • Input: [0, 0]
  • Linear transformation: [0, -1] -> ReLU [0, 0]
  • Y [0]
  • [0, 1] OR [1, 0] -> 1; [0, 0] OR [1, 1] -> 0

47 of 55

Feedforward NN for text classification

  • Input: n input words / tokens
  • Output: positive, negative and neutral

48 of 55

Training Neural Network

  • Multi-class Classification:

  • Goal: Learn the parameter w and b from training data
  • Loss function: Cross Entropy
  • Binary Classification:

49 of 55

Training Neural Network: Gradient

 

50 of 55

Training Neural Network: Gradient

  • Computation Graph: mathematical expressions with a directed graph
  • Use Graph to compute gradients

51 of 55

Training Neural Network: Gradient

  • Computation Graph: mathematical expressions with a directed graph
  • Use Graph to compute gradients

52 of 55

Training Neural Network: Gradient

53 of 55

Recap: n-gram language models

  • N-gram models:

  • Issues: Unable to model bigger context..

54 of 55

Feedforward neural language models

  • Approximate the probability based on the previous m (e.g., 5) words

55 of 55

Feedforward neural language models

  • Limitations: The models learn separate patterns for different positions