JavaScript isn't enabled in your browser, so this file can't be opened. Enable and reload.

1 of 86

Natural Language Processing

Tokenization/Segmentation

Minimum Edit Distance

2 of 86

Outline

Tokenization

Word Tokenization
Normalization

Lemmatization and stemming

Sentence Tokenization

Minimum Edit Distance

Levenshtein distance
Needleman-Wunsch
Smith-Waterman

7/17/2015

3 of 86

Tokenization

Information retrieval
Information extraction (detecting named entities, etc.)
Spell-checking

3 tasks

Segmenting/tokenizing words in running text
Normalizing word formats
Segmenting sentences in running text

Why not just periods and white-space?

Mr. Sherwood said reaction to Sea Containers’ proposal has been "very positive." In New York Stock Exchange composite trading yesterday, Sea Containers closed at $62.625, up 62.5 cents.
“I said, ‘what’re you? Crazy?’ “ said Sadowsky. “I can’t afford to do that.’’

4 of 86

What’s a word?

I do uh main- mainly business data processing

Fragments
Filled pauses

Are cat and cats the same word?
Some terminology

Lemma: a set of lexical forms having the same stem, major part of speech, and rough word sense

Cat and cats = same lemma

Wordform: the full inflected surface form.

Cat and cats = different wordforms

Token/Type

5 of 86

How many words?

they picnicked by the pool then lay back on the grass and looked at the stars

16 tokens
14 types

The Switchboard corpus of American telephone conversation:

2.4 million wordform tokens
~20,000 wordform types

Brown et al (1992) large corpus of text

583 million wordform tokens
293,181 wordform types

Shakespeare:

884,647 wordform tokens
31,534 wordform types

Let N = number of tokens, V = vocabulary = number of types
General wisdom: V > O(sqrt(N))

6 of 86

Practical Idea

Less text.txt | less
tr –sc ‘A-Za-z’ ‘\n’ < text.txt | less
tr –sc ‘A-Za-z’ ‘\n’ < text.txt | sort | less
tr –sc ‘A-Za-z’ ‘\n’ < text.txt | sort | uniq –c | less
tr –sc ‘A-Za-z’ ‘\n’ < text.txt | sort | uniq –c | sort –n -r

7/17/2015

7 of 86

Issues in Tokenization

Finland’s capital →

Finland? Finlands? Finland’s

what’re, I’m, isn’t->

What are, I am, is not

Hewlett-Packard →

Hewlett and Packard as two tokens?

state-of-the-art:

Break up?

lowercase, lower-case, lower case ?

San Francisco, New York: one token or two?
Words with punctuation

m.p.h., PhD.

Slide from Chris Manning

8 of 86

Tokenization: language issues

French

L'ensemble → one token or two?

L ? L’ ? Le ?
Want l’ensemble to match with un ensemble

German noun compounds are not segmented

Lebensversicherungsgesellschaftsangestellter
‘life insurance company employee’
German retrieval systems benefit greatly from a compound splitter module

Slide from Chris Manning

9 of 86

Tokenization: language issues

Chinese and Japanese no spaces between words:

莎拉波娃现在居住在美国东南部的佛罗里达。
莎拉波娃现在居住在美国东南部的佛罗里达
Sharapova now lives in US southeastern Florida

Further complicated in Japanese, with multiple alphabets intermingled

Dates/amounts in multiple formats

フォーチュン500社は情報不足のため時間あた$500K(約6,000万円)

Katakana

Hiragana

Kanji

Romaji

End-user can express query entirely in hiragana!

Slide from Chris Manning

10 of 86

Word Segmentation in Chinese

Words composed of characters
Characters are generally 1 syllable and 1 morpheme.
Average word is 2.4 characters long.
Standard segmentation algorithm:

Maximum Matching

(also called Greedy)

11 of 86

Maximum Matching�Word Segmentation Algorithm

Given a wordlist of Chinese, and a string.
Start a pointer at the beginning of the string
Find the longest word in dictionary that matches the string starting at pointer
Move the pointer over the word in string
Go to 2

12 of 86

English failure example (Palmer 00)

the table down there
thetabledownthere
Theta bled own there

But works astonishingly well in Chinese

莎拉波娃现在居住在美国东南部的佛罗里达。
莎拉波娃现在居住在美国东南部的佛罗里达

Modern algorithms better still:

probabilistic segmentation
Using “sequence models” like HMMs that we’ll see in 2 weeks

13 of 86

Normalization

Need to “normalize” terms

For IR, indexed text & query terms must have same form.

We want to match U.S.A. and USA

We most commonly implicitly define equivalence classes of terms

e.g., by deleting periods in a term

Alternative is to do asymmetric expansion:

Enter: window Search: window, windows
Enter: windows Search: Windows, windows, window
Enter: Windows Search: Windows

Potentially more powerful, but less efficient

Slide from Chris Manning

14 of 86

Case folding

For IR: Reduce all letters to lower case

exception: upper case in mid-sentence?

e.g., General Motors
Fed vs. fed
SAIL vs. sail

Often best to lower case everything, since users will use lowercase regardless of ‘correct’ capitalization…

For sentiment analysis, MT, Info extraction

Case is helpful (“US” versus “us” is important)

Slide from Chris Manning

15 of 86

Lemmatization

Reduce inflectional/variant forms to base form
E.g.,

am, are, is → be
car, cars, car's, cars' → car

the boy's cars are different colors → the boy car be different color
Lemmatization implies doing “proper” reduction to dictionary headword form

Slide from Chris Manning

16 of 86

Stemming

Reduce terms to their “roots” before indexing
“Stemming” is crude chopping of “affixes”

language dependent
e.g., automate(s), automatic, automation all reduced to automat.

for example compressed

and compression are both

accepted as equivalent to

compress.

for exampl compress and

compress ar both accept

as equival to compress

Slide from Chris Manning

17 of 86

Porter’s algorithm

Commonest algorithm for stemming English
A sequence of phases
each phase consists of a set of rules

sses → ss
ies → i
ational → ate
tional → tion
Some rules only apply to multi-syllable words

(syl > 1) EMENT → ø
replacement → replac
cement → cement

Slide from Chris Manning

18 of 86

19 of 86

Dealing with complex morphology is sometimes necessary

Machine translation

Need to know that the Spanish words quiero and quieres are both related to querer ‘want’

Other languages requires segmenting morphemes

Turkish
Uygarlastiramadiklarimizdanmissinizcasina
`(behaving) as if you are among those whom we could not civilize’
Uygar `civilized’ + las `become’

+ tir `cause’ + ama `not able’

+ dik `past’ + lar ‘plural’

+ imiz ‘p1pl’ + dan ‘abl’

+ mis ‘past’ + siniz ‘2pl’ + casina ‘as if’

20 of 86

Sentence Segmentation

!, ? relatively unambiguous
Period “.” is quite ambiguous

Sentence boundary
Abbreviations like Inc. or Dr.

General idea:

Build a binary classifier:

Looks at a “.”
Decides EndOfSentence/NotEOS
Could be hand-written rules, sequences of regular expressions, or machine-learning

21 of 86

Determining if a word is end-of-utterance: a Decision Tree

22 of 86

More sophisticated decision tree features

Prob(word with “.” occurs at end-of-s)
Prob(word after “.” occurs at begin-of-s)
Length of word with “.”
Length of word after “.”
Case of word with “.”: Upper, Lower, Cap, Number
Case of word after “.”: Upper, Lower, Cap, Number
Punctuation after “.” (if any)
Abbreviation class of word with “.” (month name, unit-of-measure, title, address name, etc)

1/5/07

From Richard Sproat slides

23 of 86

Learning Decision Trees

DTs are rarely built by hand
Hand-building only possible for very simple features, domains
Lots of algorithms for DT induction

24 of 86

II. Minimum Edit Distance

Spell-checking

Non-word error detection:

detecting “graffe”

Non-word error correction:

figuring out that “graffe” should be “giraffe”

Context-dependent error detection and correction:

Figuring out that “war and piece” should be peace

25 of 86

Non-word error detection

Any word not in a dictionary
Assume it’s a spelling error
Need a big dictionary!

26 of 86

Isolated word error correction

How do I fix “graffe”?

Search through all words:

graf
craft
grail
giraffe

Pick the one that’s closest to graffe
What does “closest” mean?
We need a distance metric.
The simplest one: edit distance.

(More sophisticated probabilistic ones: noisy channel)

27 of 86

Edit Distance

The minimum edit distance between two strings
Is the minimum number of editing operations

Insertion
Deletion
Substitution

Needed to transform one into the other

28 of 86

Minimum Edit Distance

29 of 86

Minimum Edit Distance

If each operation has cost of 1

Distance between these is 5

If substitutions cost 2 (Levenshtein)

Distance between them is 8

30 of 86

Edit transcript

7/17/2015

31 of 86

Defining Min Edit Distance

For two strings S₁ of len n, S₂ of len m

distance(i,j) or D(i,j)

means the edit distance of S₁[1..i] and S₂[1..j]
i.e., the minimum number of edit operations need to transform the first i characters of S₁ into the first j characters of S₂
The edit distance of S₁, S₂ is D(n,m)

We compute D(n,m) by computing D(i,j) for all i (0 < i < n) and j (0 < j < m)

7/17/2015

32 of 86

Defining Min Edit Distance

Base conditions:

D(i,0) = i
D(0,j) = j

Recurrence Relation:

D(i-1,j) + 1

D(i,j) = min D(i,j-1) + 1

D(i-1,j-1) + 1; if S₁(i) ≠ S₂(j)

0; if S₁(i) = S₂(j)

7/17/2015

33 of 86

Dynamic Programming

A tabular computation of D(n,m)
Bottom-up

We compute D(i,j) for small i,j
And compute increase D(i,j) based on previously computed smaller values

7/17/2015

34 of 86

The Edit Distance Table

N	9
O	8
I	7
T	6
N	5
E	4
T	3
N	2
I	1
#	0	1	2	3	4	5	6	7	8	9
	#	E	X	E	C	U	T	I	O	N

35 of 86

N	9
O	8
I	7
T	6
N	5
E	4
T	3
N	2
I	1
#	0	1	2	3	4	5	6	7	8	9
	#	E	X	E	C	U	T	I	O	N

36 of 86

N	9	8	9	10	11	12	11	10	9	8
O	8	7	8	9	10	11	10	9	8	9
I	7	6	7	8	9	10	9	8	9	10
T	6	5	6	7	8	9	8	9	10	11
N	5	4	5	6	7	8	9	10	11	10
E	4	3	4	5	6	7	8	9	10	9
T	3	4	5	6	7	8	7	8	9	8
N	2	3	4	5	6	7	8	7	8	7
I	1	2	3	4	5	6	7	6	7	8
#	0	1	2	3	4	5	6	7	8	9
	#	E	X	E	C	U	T	I	O	N

37 of 86

Suppose we want the alignment too

We can keep a “backtrace”
Every time we enter a cell, remember where we came from
Then when we reach the end, we can trace back from the upper right corner to get an alignment

38 of 86

Backtrace

N	9	8	9	10	11	12	11	10	9	8
O	8	7	8	9	10	11	10	9	8	9
I	7	6	7	8	9	10	9	8	9	10
T	6	5	6	7	8	9	8	9	10	11
N	5	4	5	6	7	8	9	10	11	10
E	4	3	4	5	6	7	8	9	10	9
T	3	4	5	6	7	8	7	8	9	8
N	2	3	4	5	6	7	8	7	8	7
I	1	2	3	4	5	6	7	6	7	8
#	0	1	2	3	4	5	6	7	8	9
	#	E	X	E	C	U	T	I	O	N

39 of 86

Adding Backtrace to MinEdit

Base conditions:

D(i,0) = i
D(0,j) = j

Recurrence Relation:

D(i-1,j) + 1

D(i,j) = min D(i,j-1) + 1

D(i-1,j-1) + 1; if S₁(i) ≠ S₂(j)

0; if S₁(i) = S₂(j)

LEFT

ptr(i,j) DOWN

DIAG

Case 1

Case 2

Case 3

Case 1

Case 2

Case 3

40 of 86

MinEdit with Backtrace

7/17/2015

41 of 86

Performance

Time:

O(nm)

Space:

O(nm)

Backtrace

O(n+m)

42 of 86

Weighted Edit Distance

Why would we add weights to the computation?
How?

43 of 86

Confusion matrix

7/17/2015

44 of 86

7/17/2015

45 of 86

Weighted Minimum Edit Distance

7/17/2015

46 of 86

Why “Dynamic Programming”

“I spent the Fall quarter (of 1950) at RAND. My first task was to find a name for multistage decision processes. An interesting question is, Where did the name, dynamic programming, come from? The 1950s were not good years for mathematical research. We had a very interesting gentleman in Washington named Wilson. He was Secretary of Defense, and he actually had a pathological fear and hatred of the word, research. I’m not using the term lightly; I’m using it precisely. His face would suffuse, he would turn red, and he would get violent if people used the term, research, in his presence. You can imagine how he felt, then, about the term, mathematical. The RAND Corporation was employed by the Air Force, and the Air Force had Wilson as its boss, essentially. Hence, I felt I had to do something to shield Wilson and the Air Force from the fact that I was really doing mathematics inside the RAND Corporation. What title, what name, could I choose? In the first place I was interested in planning, in decision making, in thinking. But planning, is not a good word for various reasons. I decided therefore to use the word, “programming” I wanted to get across the idea that this was dynamic, this was multistage, this was time-varying I thought, lets kill two birds with one stone. Lets take a word that has an absolutely precise meaning, namely dynamic, in the classical physical sense. It also has a very interesting property as an adjective, and that is its impossible to use the word, dynamic, in a pejorative sense. Try thinking of some combination that will possibly give it a pejorative meaning. Its impossible. Thus, I thought dynamic programming was a good name. It was something not even a Congressman could object to. So I used it as an umbrella for my activities.”

Richard Bellman, “Eye of the Hurricane: an autobiography” 1984.

47 of 86

Other uses of Edit Distance in text processing

Evaluating Machine Translation and speech recognition

R Spokesman confirms senior government adviser was shot

H Spokesman said the senior adviser was shot dead

S I D I

48 of 86

Edit distance in computationl genomics

7/17/2015

49 of 86

The Cell

50 of 86

Chromosomes

telomere

centromere

nucleosome

chromatin

DNA

H2A, H2B, H3, H4

~146bp

51 of 86

Nucleotide (base)

O^-

to next nucleotide

to previous nucleotide

to base

3’

5’

Adenine (A)

Cytosine (C)

Guanine (G)

Thymine (T)

“AGACC”!

pyrimidines

purines

52 of 86

“AGACC” (backbone)

53 of 86

“AGACC” (DNA)

5’

3’

54 of 86

Genes & Proteins

3’

5’

3’

TAGGATCGACTATATGGGATTACAAAGCATTTAGGGA...TCACCCTCTCTAGACTAGCATCTATATAAAACAGAA

ATCCTAGCTGATATACCCTAATGTTTCGTAAATCCCT...AGTGGGAGAGATCTGATCGTAGATATATTTTGTCTT

AUGGGAUUACAAAGCAUUUAGGGA...UCACCCUCUCUAGACUAGCAUCUAUAUAA

(transcription)

(translation)

Single-stranded RNA

protein

Double-stranded DNA

Proteins are chains of amino-acids

There are 20 amino-acids

55 of 86

Sequence Alignment

Slide from Serafim Batzoglou

-AGGCTATCACCTGACCTCCAGGCCGA--TGCCC---

TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC

Definition

Given two strings x = x₁x₂...x_M, y = y₁y₂…y_N,

an alignment is an assignment of gaps to positions

0,…, N in x, and 0,…, N in y, so as to line up each

letter in one sequence with either a letter, or a gap

in the other sequence

AGGCTATCACCTGACCTCCAGGCCGATGCCC

TAGCTATCACGACCGCGGTCGATTTGCCCGAC

56 of 86

Why sequence alignment?

Comparing genes or regions from different species

to find important regions
determine function
uncover evolutionary forces

Assembling fragments to sequence DNA
Compare individuals to looking for mutations

57 of 86

DNA sequencing

How we obtain the sequence of nucleotides of a species

Slide from Serafim Batzoglou

…ACGTGACTGAGGACCGTG

CGACTGAGACTGACTGGGT

CTAGCTAGACTACGTTTTA

TATATATATACGTCGTCGT

ACTGATGACTAGATTACAG

ACTGATTTAGATACCTGAC

TGATTTTAAAAAAATATT…

58 of 86

Whole Genome Shotgun Sequencing

Slide from Serafim Batzoglou

cut many times at random (Shotgun)

genomic segment

Get one or two reads from each segment

~900 bp

59 of 86

Fragment Assembly

Slide from Serafim Batzoglou

60 of 86

Steps to Assemble a Genome

Slide from Serafim Batzoglou

1. Find overlapping reads

4. Derive consensus sequence

..ACGATTACAATAGGTT..

2. Merge some “good” pairs of reads into longer contigs

3. Link contigs to form supercontigs

Some Terminology

read a 500-900 long word that comes

out of sequencer

mate pair a pair of reads from two ends

of the same insert fragment

contig a contiguous sequence formed

by several overlapping reads

with no gaps

supercontig an ordered and oriented set

(scaffold) of contigs, usually by mate

pairs

consensus sequence derived from the

sequene multiple alignment of reads

in a contig

61 of 86

Find Overlapping Reads

Create local multiple alignments from the pairwise read alignments

Slide from Serafim Batzoglou

TAGATTACACAGATTACTGA

TAG TTACACAGATTATTGA

TAGATTACACAGATTACTGA

TAG TTACACAGATTATTGA

TAGATTACACAGATTACTGA

62 of 86

Derive Consensus Sequence

Slide from Serafim Batzoglou

TAGATTACACAGATTACTGA TTGATGGCGTAA CTA

TAGATTACACAGATTACTGACTTGATGGCGTAAACTA

TAG TTACACAGATTATTGACTTCATGGCGTAA CTA

TAGATTACACAGATTACTGACTTGATGGCGTAA CTA

TAGATTACACAGATTACTGACTTGATGGGGTAA CTA

TAGATTACACAGATTACTGACTTGATGGCGTAA CTA

Can derive each consensus base by weighted voting, etc.

63 of 86

Evolution at the DNA level

Slide from Serafim Batzoglou

…ACGGTGCAGTTACCA…

…AC----CAGTCCACCA…

Mutation

SEQUENCE EDITS

REARRANGEMENTS

Deletion

Inversion

Translocation

Duplication

64 of 86

Evolutionary Rates

Slide from Serafim Batzoglou

Still OK?

next generation

65 of 86

Sequence conservation implies function

Slide from Serafim Batzoglou

Alignment is the key to

Finding important regions
Determining function
Uncovering the evolutionary forces

66 of 86

Sequence Alignment

Slide from Serafim Batzoglou

-AGGCTATCACCTGACCTCCAGGCCGA--TGCCC---

TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC

Definition

Given two strings x = x₁x₂...x_M, y = y₁y₂…y_N,

an alignment is an assignment of gaps to positions

0,…, N in x, and 0,…, N in y, so as to line up each

letter in one sequence with either a letter, or a gap

in the other sequence

AGGCTATCACCTGACCTCCAGGCCGATGCCC

TAGCTATCACGACCGCGGTCGATTTGCCCGAC

67 of 86

Alignments in two fields

In Natural Language Processing

We generally talk about distance (minimized)

And weights

In Computational Biology

We generally talk about similarity (maximized)

And scores

68 of 86

Scoring Alignments

Rough intuition:

Similar sequences evolved from a common ancestor
Evolution changed the sequences from this ancestral sequence by mutations:

Replacements: one letter replaced by another
Deletion: deletion of a letter
Insertion: insertion of a letter

Scoring of sequence similarity should examine how many operations took place

69 of 86

What is a good alignment?

AGGCTAGTT, AGCGAAGTTT

AGGCTAGTT- 6 matches, 3 mismatches, 1 gap

AGCGAAGTTT

AGGCTA-GTT- 7 matches, 1 mismatch, 3 gaps

AG-CGAAGTTT

AGGC-TA-GTT- 7 matches, 0 mismatches, 5 gaps

AG-CG-AAGTTT

Slide from Serafim Batzoglou

70 of 86

Scoring Function

Sequence edits:

AGGCCTC

Mutations AGGACTC
Insertions AGGGCCTC
Deletions AGG . CTC

Scoring Function:

Match: +m

Mismatch: -s

Gap: -d

Score F = (# matches) × m - (# mismatches) × s – (#gaps) × d

Slide from Serafim Batzoglou

71 of 86

Example

x = AGTA m = 1

y = ATA s = -1

d = -1

Slide from Serafim Batzoglou

		A	G	T	A
	0	-1	-2	-3	-4
A	-1	1	0	-1	-2
T	-2	0	0	1	0
A	-3	-1	-1	0	2

F(i,j) i = 0 1 2 3 4

j = 0

F(1, 1) =

max{F(0,0) + s(A, A),

F(0, 1) – d,

F(1, 0) – d} =

max{0 + 1,

-1 – 1,

-1 – 1} = 1

72 of 86

The Needleman-Wunsch Matrix

Slide from Serafim Batzoglou

x₁ ……………………………… x_M

y₁ ……………………………… y_N

Every nondecreasing path

from (0,0) to (M, N)

corresponds to

an alignment

of the two sequences

An optimal alignment is composed of optimal subalignments

73 of 86

The Needleman-Wunsch Algorithm

Initialization.

F(0, 0) = 0
F(0, j) = - j × d
F(i, 0) = - i × d

Main Iteration. Filling-in partial alignments

For each i = 1……M

For each j = 1……N

F(i-1,j-1) + s(x_i, y_j) [case 1]

F(i, j) = max F(i-1, j) – d [case 2]

F(i, j-1) – d [case 3]

DIAG, if [case 1]

Ptr(i,j) = LEFT, if [case 2]

UP, if [case 3]

Termination. F(M, N) is the optimal score, and

from Ptr(M, N) can trace back optimal alignment

Slide from Serafim Batzoglou

74 of 86

A variant of the basic algorithm:

Maybe it is OK to have an unlimited # of gaps in the beginning and end:

Slide from Serafim Batzoglou

----------CTATCACCTGACCTCCAGGCCGATGCCCCTTCCGGC

GCGAGTTCATCTATCAC--GACCGC--GGTCG--------------

If so, we don’t want to penalize gaps at the ends

75 of 86

Different types of overlaps

Slide from Serafim Batzoglou

Example:

2 overlapping“reads” from a

sequencing project

Example:

Search for a mouse gene

within a human chromosome

76 of 86

The Overlap Detection variant

Changes:

Initialization

For all i, j,

F(i, 0) = 0

F(0, j) = 0

Termination

max_i F(i, N)

F_OPT = max

max_j F(M, j)

Slide from Serafim Batzoglou

x₁ ……………………………… x_M

y₁ ……………………………… y_N

77 of 86

The local alignment problem

Given two strings x = x₁……x_M,

y = y₁……y_N

Find substrings x’, y’ whose similarity

(optimal global alignment value)

is maximum

x = aaaacccccggggtta

y = ttcccgggaaccaacc

Slide from Serafim Batzoglou

78 of 86

Why local alignment – examples

Genes are shuffled between genomes

Portions of proteins (domains) are often conserved

Slide from Serafim Batzoglou

79 of 86

Cross-species genome similarity

98% of genes are conserved between any two mammals
>70% average similarity in protein sequence

Slide from Serafim Batzoglou

hum_a : GTTGACAATAGAGGGTCTGGCAGAGGCTC--------------------- @ 57331/400001

mus_a : GCTGACAATAGAGGGGCTGGCAGAGGCTC--------------------- @ 78560/400001

rat_a : GCTGACAATAGAGGGGCTGGCAGAGACTC--------------------- @ 112658/369938

fug_a : TTTGTTGATGGGGAGCGTGCATTAATTTCAGGCTATTGTTAACAGGCTCG @ 36008/68174

hum_a : CTGGCCGCGGTGCGGAGCGTCTGGAGCGGAGCACGCGCTGTCAGCTGGTG @ 57381/400001

mus_a : CTGGCCCCGGTGCGGAGCGTCTGGAGCGGAGCACGCGCTGTCAGCTGGTG @ 78610/400001

rat_a : CTGGCCCCGGTGCGGAGCGTCTGGAGCGGAGCACGCGCTGTCAGCTGGTG @ 112708/369938

fug_a : TGGGCCGAGGTGTTGGATGGCCTGAGTGAAGCACGCGCTGTCAGCTGGCG @ 36058/68174

hum_a : AGCGCACTCTCCTTTCAGGCAGCTCCCCGGGGAGCTGTGCGGCCACATTT @ 57431/400001

mus_a : AGCGCACTCG-CTTTCAGGCCGCTCCCCGGGGAGCTGAGCGGCCACATTT @ 78659/400001

rat_a : AGCGCACTCG-CTTTCAGGCCGCTCCCCGGGGAGCTGCGCGGCCACATTT @ 112757/369938

fug_a : AGCGCTCGCG------------------------AGTCCCTGCCGTGTCC @ 36084/68174

hum_a : AACACCATCATCACCCCTCCCCGGCCTCCTCAACCTCGGCCTCCTCCTCG @ 57481/400001

mus_a : AACACCGTCGTCA-CCCTCCCCGGCCTCCTCAACCTCGGCCTCCTCCTCG @ 78708/400001

rat_a : AACACCGTCGTCA-CCCTCCCCGGCCTCCTCAACCTCGGCCTCCTCCTCG @ 112806/369938

fug_a : CCGAGGACCCTGA------------------------------------- @ 36097/68174

“atoh” enhancer in human, mouse, rat, fugu fish

80 of 86

The Smith-Waterman algorithm

Idea: Ignore badly aligning regions

Modifications to Needleman-Wunsch:

Initialization: F(0, j) = F(i, 0) = 0

Iteration: F(i, j) = max F(i – 1, j) – d

F(i, j – 1) – d

F(i – 1, j – 1) + s(x_i, y_j)

Slide from Serafim Batzoglou

81 of 86

The Smith-Waterman algorithm

Termination:

If we want the best local alignment…

F_OPT = max_i,j F(i, j)

Find F_OPT and trace back

If we want all local alignments scoring > t

?? For all i, j find F(i, j) > t, and trace back?

Complicated by overlapping local alignments

Slide from Serafim Batzoglou

82 of 86

Local Alignment Example

Slide from Hasan Oğul

s = TAATA

t = ATCTAA

83 of 86

Local Alignment Example

Slide from Hasan Oğul

s = TAATA

t = TACTAA

84 of 86

Local Alignment Example

s = TAATA

t = TACTAA

Slide from Hasan Oğul

85 of 86

Local Alignment Example

Slide from Hasan Oğul

s = TAATA

t = TACTAA

86 of 86

Summary

Tokenization

Word Tokenization
Normalization

Lemmatization and stemming

Sentence Tokenization

Minimum Edit Distance

Levenshtein distance
Needleman-Wunsch (weighted global alignment)
Smith-Waterman (local alignment)
Applications to:

spell correction, machine translation, speech recognition
DNA fragment combination, evolutationary similarity, mutation detection

N	9	8	9	10	11	12	11	10	9	8
O	8	7	8	9	10	11	10	9	8	9
I	7	6	7	8	9	10	9	8	9	10
T	6	5	6	7	8	9	8	9	10	11
N	5	4	5	6	7	8	9	10	11	10
E	4	3	4	5	6	7	8	9	10	9
T	3	4	5	6	7	8	7	8	9	8
N	2	3	4	5	6	7	8	7	8	7
I	1	2	3	4	5	6	7	6	7	8
#	0	1	2	3	4	5	6	7	8	9
	#	E	X	E	C	U	T	I	O	N

N	9	8	9	10	11	12	11	10	9	8
O	8	7	8	9	10	11	10	9	8	9
I	7	6	7	8	9	10	9	8	9	10
T	6	5	6	7	8	9	8	9	10	11
N	5	4	5	6	7	8	9	10	11	10
E	4	3	4	5	6	7	8	9	10	9
T	3	4	5	6	7	8	7	8	9	8
N	2	3	4	5	6	7	8	7	8	7
I	1	2	3	4	5	6	7	6	7	8
#	0	1	2	3	4	5	6	7	8	9
	#	E	X	E	C	U	T	I	O	N

N	9	8	9	10	11	12	11	10	9	8
O	8	7	8	9	10	11	10	9	8	9
I	7	6	7	8	9	10	9	8	9	10
T	6	5	6	7	8	9	8	9	10	11
N	5	4	5	6	7	8	9	10	11	10
E	4	3	4	5	6	7	8	9	10	9
T	3	4	5	6	7	8	7	8	9	8
N	2	3	4	5	6	7	8	7	8	7
I	1	2	3	4	5	6	7	6	7	8
#	0	1	2	3	4	5	6	7	8	9
	#	E	X	E	C	U	T	I	O	N

N	9	8	9	10	11	12	11	10	9	8
O	8	7	8	9	10	11	10	9	8	9
I	7	6	7	8	9	10	9	8	9	10
T	6	5	6	7	8	9	8	9	10	11
N	5	4	5	6	7	8	9	10	11	10
E	4	3	4	5	6	7	8	9	10	9
T	3	4	5	6	7	8	7	8	9	8
N	2	3	4	5	6	7	8	7	8	7
I	1	2	3	4	5	6	7	6	7	8
#	0	1	2	3	4	5	6	7	8	9
	#	E	X	E	C	U	T	I	O	N

N	9	8	9	10	11	12	11	10	9	8
O	8	7	8	9	10	11	10	9	8	9
I	7	6	7	8	9	10	9	8	9	10
T	6	5	6	7	8	9	8	9	10	11
N	5	4	5	6	7	8	9	10	11	10
E	4	3	4	5	6	7	8	9	10	9
T	3	4	5	6	7	8	7	8	9	8
N	2	3	4	5	6	7	8	7	8	7
I	1	2	3	4	5	6	7	6	7	8
#	0	1	2	3	4	5	6	7	8	9
	#	E	X	E	C	U	T	I	O	N

N	9	8	9	10	11	12	11	10	9	8
O	8	7	8	9	10	11	10	9	8	9
I	7	6	7	8	9	10	9	8	9	10
T	6	5	6	7	8	9	8	9	10	11
N	5	4	5	6	7	8	9	10	11	10
E	4	3	4	5	6	7	8	9	10	9
T	3	4	5	6	7	8	7	8	9	8
N	2	3	4	5	6	7	8	7	8	7
I	1	2	3	4	5	6	7	6	7	8
#	0	1	2	3	4	5	6	7	8	9
	#	E	X	E	C	U	T	I	O	N