1 of 55

CSCI-SHU 376: Natural Language Processing

Hua Shen

Course Agenda: 2026 Spring-NLP-[CSCI-SHU-376]-Class Schedule

2026-02-05

Spring 2026

Lecture 5: Sequence Modeling and Neural Networks

2 of 55

Plans

Part-of-speech Tagging
Named Entity Recognition
Hidden Markov Model (HMM)

3 of 55

Standard NLP Pipelines

4 of 55

Sequence Modeling

5 of 55

What are part of speech Tags?

Word classes or syntactic categories
Reveal useful information about a word and its neighbors

6 of 55

Part of Speech

Different words have different functions
Closed Class: fixed membership, function words

E.g., prepositions (in, on, of), determiners (the, a)

Open Class: new words get added frequently

E.g., nouns, verbs, adjectives etc

7 of 55

Part of Speech Tagging

Tag each word in a sentence with its part of speech
Disambiguation task: each word might have different functions in different contexts

8 of 55

Part of Speech Tagging: A simple Baseline

Many words are easy to tag
Most frequent class: assign each word to the class it occurred most in the training set (e.g., man/NN)
Accuracy tags 92.34% of word tokens on wall street journal (WSJ)!
State of the art: 97%
Better to make decisions on entire sentences over individual words
POS Tagging still not solved yet!

9 of 55

Sequence Modeling

10 of 55

Named Entities

Named entity means anything that can be referred to with a proper name
NER: 1) finding spans of text that constitute proper names; 2) tagging the type of the entity
Most common tags:

PER (Person)
LOC (Location)
ORG (Organization)
MISC (Miscellaneous)

11 of 55

Named Entity Recognition

12 of 55

Named Entity Recognition: BIO Tagging

B: token that begins a span
I: tokens that inside a span
O: tokens outside of a span

13 of 55

Plans

Part-of-speech Tagging
Named Entity Recognition
Hidden Markov Model (HMM)

14 of 55

Markov Chains

15 of 55

Markov Chains

16 of 55

Hidden Markov Model

We do not see sequences of POS tags in text
However, we do observe the words
HMM allows us to jointly reason over both hidden and observed events

Assume each position has a tag that generates a word

17 of 55

Hidden Markov Model

18 of 55

Assumptions

19 of 55

Sequence likelihood

Transition probability

Emission probability

20 of 55

Sequence likelihood (Example)

21 of 55

Sequence likelihood (Example)

22 of 55

Learning HMM

23 of 55

Learning Examples

24 of 55

Decoding with HMMs

Task: Find the most probable states given observations
E.g., POS Tagging

25 of 55

Greedy Decoding

Decode one state at a time

26 of 55

Greedy Decoding

It does not guarantee to produce the overall optimal sequence

Decode one state at a time

27 of 55

Viterbi Decoding

28 of 55

Viterbi Decoding

29 of 55

Viterbi Decoding

30 of 55

Viterbi Decoding

31 of 55

Beam Search

If K (number of hidden states) is too large, Viterbi is too expensive!!

32 of 55

Beam Search

If K (number of hidden states) is too large, Viterbi is too expensive!!

33 of 55

Beam Search

34 of 55

Beam Search

35 of 55

Beam Search

36 of 55

Beam Search

37 of 55

Beam Search

Beam search is a very common decoding method for any language generation tasks (e.g., machine translation)

38 of 55

Today’s Plan

Neural Networks
Recurrent neural networks (RNNs)

39 of 55

Neural Network

A network of small computing units
Deep learning: Modern neural network (have many layers)
Possible to learn any function

40 of 55

Neural Network: Units

41 of 55

Neural Network: Units

Tanh function

Sigmoid function

ReLU function

42 of 55

Neural Network: Units

43 of 55

Feedforward Neural Networks

Sometimes called multi-layer perceptrons (MLPs)
Input units, hidden units, output units
Fully-connected: each unit in each layer takes input from all units in the previous layer

44 of 55

Feedforward Neural Networks

Feedforward NN with a single hidden layer
Input vector x, output probability distribution y

45 of 55

XOR Problem

Cannot be solved by a linear separator (e.g., perceptron)

46 of 55

XOR Problem

Input: [0, 0]
Linear transformation: [0, -1] -> ReLU [0, 0]
Y [0]
[0, 1] OR [1, 0] -> 1; [0, 0] OR [1, 1] -> 0

47 of 55

Feedforward NN for text classification

Input: n input words / tokens
Output: positive, negative and neutral

48 of 55

Training Neural Network

Multi-class Classification:

Goal: Learn the parameter w and b from training data
Loss function: Cross Entropy
Binary Classification:

49 of 55

Training Neural Network: Gradient

50 of 55

Training Neural Network: Gradient

Computation Graph: mathematical expressions with a directed graph
Use Graph to compute gradients

51 of 55

Training Neural Network: Gradient

Computation Graph: mathematical expressions with a directed graph
Use Graph to compute gradients

52 of 55

Training Neural Network: Gradient

53 of 55

Recap: n-gram language models

N-gram models:

Issues: Unable to model bigger context..

54 of 55

Feedforward neural language models

Approximate the probability based on the previous m (e.g., 5) words

55 of 55

Feedforward neural language models

Limitations: The models learn separate patterns for different positions