Natural Language Processing
Tokenization/Segmentation
Minimum Edit Distance
Outline
2
7/17/2015
Tokenization
What’s a word?
How many words?
Practical Idea
6
7/17/2015
Issues in Tokenization
Finland? Finlands? Finland’s
Slide from Chris Manning
Tokenization: language issues
Slide from Chris Manning
Tokenization: language issues
フォーチュン500社は情報不足のため時間あた$500K(約6,000万円)
Katakana
Hiragana
Kanji
Romaji
End-user can express query entirely in hiragana!
Slide from Chris Manning
Word Segmentation in Chinese
Maximum Matching�Word Segmentation Algorithm
English failure example (Palmer 00)
Normalization
Slide from Chris Manning
Case folding
Slide from Chris Manning
Lemmatization
Slide from Chris Manning
Stemming
for example compressed
and compression are both
accepted as equivalent to
compress.
for exampl compress and
compress ar both accept
as equival to compress
Slide from Chris Manning
Porter’s algorithm
Slide from Chris Manning
More on Morphology
Dealing with complex morphology is sometimes necessary
+ tir `cause’ + ama `not able’
+ dik `past’ + lar ‘plural’
+ imiz ‘p1pl’ + dan ‘abl’
+ mis ‘past’ + siniz ‘2pl’ + casina ‘as if’
Sentence Segmentation
Determining if a word is end-of-utterance: a Decision Tree
More sophisticated decision tree features
1/5/07
From Richard Sproat slides
Learning Decision Trees
II. Minimum Edit Distance
Non-word error detection
Isolated word error correction
Edit Distance
Minimum Edit Distance
Minimum Edit Distance
Edit transcript
30
7/17/2015
Defining Min Edit Distance
31
7/17/2015
Defining Min Edit Distance
D(i-1,j) + 1
D(i-1,j-1) + 1; if S1(i) ≠ S2(j)
0; if S1(i) = S2(j)
32
7/17/2015
Dynamic Programming
33
7/17/2015
The Edit Distance Table
N | 9 | | | | | | | | | |
O | 8 | | | | | | | | | |
I | 7 | | | | | | | | | |
T | 6 | | | | | | | | | |
N | 5 | | | | | | | | | |
E | 4 | | | | | | | | | |
T | 3 | | | | | | | | | |
N | 2 | | | | | | | | | |
I | 1 | | | | | | | | | |
# | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
| # | E | X | E | C | U | T | I | O | N |
N | 9 | | | | | | | | | |
O | 8 | | | | | | | | | |
I | 7 | | | | | | | | | |
T | 6 | | | | | | | | | |
N | 5 | | | | | | | | | |
E | 4 | | | | | | | | | |
T | 3 | | | | | | | | | |
N | 2 | | | | | | | | | |
I | 1 | | | | | | | | | |
# | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
| # | E | X | E | C | U | T | I | O | N |
N | 9 | 8 | 9 | 10 | 11 | 12 | 11 | 10 | 9 | 8 |
O | 8 | 7 | 8 | 9 | 10 | 11 | 10 | 9 | 8 | 9 |
I | 7 | 6 | 7 | 8 | 9 | 10 | 9 | 8 | 9 | 10 |
T | 6 | 5 | 6 | 7 | 8 | 9 | 8 | 9 | 10 | 11 |
N | 5 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 10 |
E | 4 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 9 |
T | 3 | 4 | 5 | 6 | 7 | 8 | 7 | 8 | 9 | 8 |
N | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 7 | 8 | 7 |
I | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 6 | 7 | 8 |
# | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
| # | E | X | E | C | U | T | I | O | N |
Suppose we want the alignment too
Backtrace
N | 9 | 8 | 9 | 10 | 11 | 12 | 11 | 10 | 9 | 8 |
O | 8 | 7 | 8 | 9 | 10 | 11 | 10 | 9 | 8 | 9 |
I | 7 | 6 | 7 | 8 | 9 | 10 | 9 | 8 | 9 | 10 |
T | 6 | 5 | 6 | 7 | 8 | 9 | 8 | 9 | 10 | 11 |
N | 5 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 10 |
E | 4 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 9 |
T | 3 | 4 | 5 | 6 | 7 | 8 | 7 | 8 | 9 | 8 |
N | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 7 | 8 | 7 |
I | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 6 | 7 | 8 |
# | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
| # | E | X | E | C | U | T | I | O | N |
Adding Backtrace to MinEdit
D(i-1,j) + 1
D(i-1,j-1) + 1; if S1(i) ≠ S2(j)
0; if S1(i) = S2(j)
LEFT
ptr(i,j) DOWN
DIAG
Case 1
Case 2
Case 3
Case 1
Case 2
Case 3
MinEdit with Backtrace
40
7/17/2015
Performance
O(nm)
O(nm)
O(n+m)
Weighted Edit Distance
Confusion matrix
43
7/17/2015
44
7/17/2015
Weighted Minimum Edit Distance
45
7/17/2015
Why “Dynamic Programming”
“I spent the Fall quarter (of 1950) at RAND. My first task was to find a name for multistage decision processes. An interesting question is, Where did the name, dynamic programming, come from? The 1950s were not good years for mathematical research. We had a very interesting gentleman in Washington named Wilson. He was Secretary of Defense, and he actually had a pathological fear and hatred of the word, research. I’m not using the term lightly; I’m using it precisely. His face would suffuse, he would turn red, and he would get violent if people used the term, research, in his presence. You can imagine how he felt, then, about the term, mathematical. The RAND Corporation was employed by the Air Force, and the Air Force had Wilson as its boss, essentially. Hence, I felt I had to do something to shield Wilson and the Air Force from the fact that I was really doing mathematics inside the RAND Corporation. What title, what name, could I choose? In the first place I was interested in planning, in decision making, in thinking. But planning, is not a good word for various reasons. I decided therefore to use the word, “programming” I wanted to get across the idea that this was dynamic, this was multistage, this was time-varying I thought, lets kill two birds with one stone. Lets take a word that has an absolutely precise meaning, namely dynamic, in the classical physical sense. It also has a very interesting property as an adjective, and that is its impossible to use the word, dynamic, in a pejorative sense. Try thinking of some combination that will possibly give it a pejorative meaning. Its impossible. Thus, I thought dynamic programming was a good name. It was something not even a Congressman could object to. So I used it as an umbrella for my activities.”
Richard Bellman, “Eye of the Hurricane: an autobiography” 1984.
Other uses of Edit Distance in text processing
R Spokesman confirms senior government adviser was shot
H Spokesman said the senior adviser was shot dead
S I D I
Edit distance in computationl genomics
48
7/17/2015
The Cell
© 1997-2005 Coriell Institute for Medical Research
Chromosomes
telomere
centromere
nucleosome
chromatin
H1
DNA
H2A, H2B, H3, H4
~146bp
© 1997-2005 Coriell Institute for Medical Research
Nucleotide (base)
O
C
C
C
C
H
H
H
H
H
H
H
C
O
P
O
O-
O
to next nucleotide
to previous nucleotide
to base
3’
5’
Adenine (A)
Cytosine (C)
Guanine (G)
Thymine (T)
“AGACC”!
pyrimidines
purines
© 1997-2005 Coriell Institute for Medical Research
“AGACC” (backbone)
© 1997-2005 Coriell Institute for Medical Research
“AGACC” (DNA)
5’
5’
3’
3’
© 1997-2005 Coriell Institute for Medical Research
Genes & Proteins
3’
5’
5’
3’
TAGGATCGACTATATGGGATTACAAAGCATTTAGGGA...TCACCCTCTCTAGACTAGCATCTATATAAAACAGAA
ATCCTAGCTGATATACCCTAATGTTTCGTAAATCCCT...AGTGGGAGAGATCTGATCGTAGATATATTTTGTCTT
AUGGGAUUACAAAGCAUUUAGGGA...UCACCCUCUCUAGACUAGCAUCUAUAUAA
(transcription)
(translation)
Single-stranded RNA
protein
Double-stranded DNA
Proteins are chains of amino-acids
There are 20 amino-acids
© 1997-2005 Coriell Institute for Medical Research
Sequence Alignment
Slide from Serafim Batzoglou
-AGGCTATCACCTGACCTCCAGGCCGA--TGCCC---
TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC
Definition
Given two strings x = x1x2...xM, y = y1y2…yN,
an alignment is an assignment of gaps to positions
0,…, N in x, and 0,…, N in y, so as to line up each
letter in one sequence with either a letter, or a gap
in the other sequence
AGGCTATCACCTGACCTCCAGGCCGATGCCC
TAGCTATCACGACCGCGGTCGATTTGCCCGAC
Why sequence alignment?
DNA sequencing
How we obtain the sequence of nucleotides of a species
Slide from Serafim Batzoglou
…ACGTGACTGAGGACCGTG
CGACTGAGACTGACTGGGT
CTAGCTAGACTACGTTTTA
TATATATATACGTCGTCGT
ACTGATGACTAGATTACAG
ACTGATTTAGATACCTGAC
TGATTTTAAAAAAATATT…
Whole Genome Shotgun Sequencing
Slide from Serafim Batzoglou
cut many times at random (Shotgun)
genomic segment
Get one or two reads from each segment
~900 bp
~900 bp
Fragment Assembly
Slide from Serafim Batzoglou
Steps to Assemble a Genome
Slide from Serafim Batzoglou
1. Find overlapping reads
4. Derive consensus sequence
..ACGATTACAATAGGTT..
2. Merge some “good” pairs of reads into longer contigs
3. Link contigs to form supercontigs
Some Terminology
read a 500-900 long word that comes
out of sequencer
mate pair a pair of reads from two ends
of the same insert fragment
contig a contiguous sequence formed
by several overlapping reads
with no gaps
supercontig an ordered and oriented set
(scaffold) of contigs, usually by mate
pairs
consensus sequence derived from the
sequene multiple alignment of reads
in a contig
Find Overlapping Reads
Create local multiple alignments from the pairwise read alignments
Slide from Serafim Batzoglou
TAGATTACACAGATTACTGA
TAGATTACACAGATTACTGA
TAG TTACACAGATTATTGA
TAGATTACACAGATTACTGA
TAGATTACACAGATTACTGA
TAGATTACACAGATTACTGA
TAG TTACACAGATTATTGA
TAGATTACACAGATTACTGA
Derive Consensus Sequence
Slide from Serafim Batzoglou
TAGATTACACAGATTACTGA TTGATGGCGTAA CTA
TAGATTACACAGATTACTGACTTGATGGCGTAAACTA
TAG TTACACAGATTATTGACTTCATGGCGTAA CTA
TAGATTACACAGATTACTGACTTGATGGCGTAA CTA
TAGATTACACAGATTACTGACTTGATGGGGTAA CTA
TAGATTACACAGATTACTGACTTGATGGCGTAA CTA
Can derive each consensus base by weighted voting, etc.
Evolution at the DNA level
Slide from Serafim Batzoglou
…ACGGTGCAGTTACCA…
…AC----CAGTCCACCA…
Mutation
SEQUENCE EDITS
REARRANGEMENTS
Deletion
Inversion
Translocation
Duplication
Evolutionary Rates
Slide from Serafim Batzoglou
OK
OK
OK
X
X
Still OK?
next generation
Sequence conservation implies function
Slide from Serafim Batzoglou
Alignment is the key to
Sequence Alignment
Slide from Serafim Batzoglou
-AGGCTATCACCTGACCTCCAGGCCGA--TGCCC---
TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC
Definition
Given two strings x = x1x2...xM, y = y1y2…yN,
an alignment is an assignment of gaps to positions
0,…, N in x, and 0,…, N in y, so as to line up each
letter in one sequence with either a letter, or a gap
in the other sequence
AGGCTATCACCTGACCTCCAGGCCGATGCCC
TAGCTATCACGACCGCGGTCGATTTGCCCGAC
Alignments in two fields
Scoring Alignments
Rough intuition:
What is a good alignment?
AGGCTAGTT, AGCGAAGTTT
AGGCTAGTT- 6 matches, 3 mismatches, 1 gap
AGCGAAGTTT
AGGCTA-GTT- 7 matches, 1 mismatch, 3 gaps
AG-CGAAGTTT
AGGC-TA-GTT- 7 matches, 0 mismatches, 5 gaps
AG-CG-AAGTTT
Slide from Serafim Batzoglou
Scoring Function
AGGCCTC
Scoring Function:
Match: +m
Mismatch: -s
Gap: -d
Score F = (# matches) × m - (# mismatches) × s – (#gaps) × d
Slide from Serafim Batzoglou
Example
x = AGTA m = 1
y = ATA s = -1
d = -1
Slide from Serafim Batzoglou
G
-
| | A | G | T | A |
| 0 | -1 | -2 | -3 | -4 |
A | -1 | 1 | 0 | -1 | -2 |
T | -2 | 0 | 0 | 1 | 0 |
A | -3 | -1 | -1 | 0 | 2 |
F(i,j) i = 0 1 2 3 4
j = 0
1
2
3
F(1, 1) =
max{F(0,0) + s(A, A),
F(0, 1) – d,
F(1, 0) – d} =
max{0 + 1,
-1 – 1,
-1 – 1} = 1
A
A
T
T
A
A
The Needleman-Wunsch Matrix
Slide from Serafim Batzoglou
x1 ……………………………… xM
y1 ……………………………… yN
Every nondecreasing path
from (0,0) to (M, N)
corresponds to
an alignment
of the two sequences
An optimal alignment is composed of optimal subalignments
The Needleman-Wunsch Algorithm
For each j = 1……N
F(i-1,j-1) + s(xi, yj) [case 1]
F(i, j) = max F(i-1, j) – d [case 2]
F(i, j-1) – d [case 3]
DIAG, if [case 1]
Ptr(i,j) = LEFT, if [case 2]
UP, if [case 3]
from Ptr(M, N) can trace back optimal alignment
Slide from Serafim Batzoglou
A variant of the basic algorithm:
Slide from Serafim Batzoglou
----------CTATCACCTGACCTCCAGGCCGATGCCCCTTCCGGC
GCGAGTTCATCTATCAC--GACCGC--GGTCG--------------
Different types of overlaps
Slide from Serafim Batzoglou
Example:
2 overlapping“reads” from a
sequencing project
Example:
Search for a mouse gene
within a human chromosome
The Overlap Detection variant
Changes:
For all i, j,
F(i, 0) = 0
F(0, j) = 0
maxi F(i, N)
FOPT = max
maxj F(M, j)
Slide from Serafim Batzoglou
x1 ……………………………… xM
y1 ……………………………… yN
The local alignment problem
Given two strings x = x1……xM,
y = y1……yN
Find substrings x’, y’ whose similarity
(optimal global alignment value)
is maximum
x = aaaacccccggggtta
y = ttcccgggaaccaacc
Slide from Serafim Batzoglou
Why local alignment – examples
Slide from Serafim Batzoglou
Cross-species genome similarity
Slide from Serafim Batzoglou
hum_a : GTTGACAATAGAGGGTCTGGCAGAGGCTC--------------------- @ 57331/400001
mus_a : GCTGACAATAGAGGGGCTGGCAGAGGCTC--------------------- @ 78560/400001
rat_a : GCTGACAATAGAGGGGCTGGCAGAGACTC--------------------- @ 112658/369938
fug_a : TTTGTTGATGGGGAGCGTGCATTAATTTCAGGCTATTGTTAACAGGCTCG @ 36008/68174
hum_a : CTGGCCGCGGTGCGGAGCGTCTGGAGCGGAGCACGCGCTGTCAGCTGGTG @ 57381/400001
mus_a : CTGGCCCCGGTGCGGAGCGTCTGGAGCGGAGCACGCGCTGTCAGCTGGTG @ 78610/400001
rat_a : CTGGCCCCGGTGCGGAGCGTCTGGAGCGGAGCACGCGCTGTCAGCTGGTG @ 112708/369938
fug_a : TGGGCCGAGGTGTTGGATGGCCTGAGTGAAGCACGCGCTGTCAGCTGGCG @ 36058/68174
hum_a : AGCGCACTCTCCTTTCAGGCAGCTCCCCGGGGAGCTGTGCGGCCACATTT @ 57431/400001
mus_a : AGCGCACTCG-CTTTCAGGCCGCTCCCCGGGGAGCTGAGCGGCCACATTT @ 78659/400001
rat_a : AGCGCACTCG-CTTTCAGGCCGCTCCCCGGGGAGCTGCGCGGCCACATTT @ 112757/369938
fug_a : AGCGCTCGCG------------------------AGTCCCTGCCGTGTCC @ 36084/68174
hum_a : AACACCATCATCACCCCTCCCCGGCCTCCTCAACCTCGGCCTCCTCCTCG @ 57481/400001
mus_a : AACACCGTCGTCA-CCCTCCCCGGCCTCCTCAACCTCGGCCTCCTCCTCG @ 78708/400001
rat_a : AACACCGTCGTCA-CCCTCCCCGGCCTCCTCAACCTCGGCCTCCTCCTCG @ 112806/369938
fug_a : CCGAGGACCCTGA------------------------------------- @ 36097/68174
“atoh” enhancer in human, mouse, rat, fugu fish
The Smith-Waterman algorithm
Idea: Ignore badly aligning regions
Modifications to Needleman-Wunsch:
Initialization: F(0, j) = F(i, 0) = 0
0
Iteration: F(i, j) = max F(i – 1, j) – d
F(i, j – 1) – d
F(i – 1, j – 1) + s(xi, yj)
Slide from Serafim Batzoglou
The Smith-Waterman algorithm
Termination:
FOPT = maxi,j F(i, j)
Find FOPT and trace back
?? For all i, j find F(i, j) > t, and trace back?
Complicated by overlapping local alignments
Slide from Serafim Batzoglou
Local Alignment Example
Slide from Hasan Oğul
s = TAATA
t = ATCTAA
Local Alignment Example
Slide from Hasan Oğul
s = TAATA
t = TACTAA
Local Alignment Example
s = TAATA
t = TACTAA
Slide from Hasan Oğul
Local Alignment Example
Slide from Hasan Oğul
s = TAATA
t = TACTAA
Summary