1 of 86

Natural Language Processing

Tokenization/Segmentation

Minimum Edit Distance

2 of 86

Outline

  • Tokenization
    • Word Tokenization
    • Normalization
      • Lemmatization and stemming
    • Sentence Tokenization
  • Minimum Edit Distance
    • Levenshtein distance
    • Needleman-Wunsch
    • Smith-Waterman

2

7/17/2015

3 of 86

Tokenization

  • For
    • Information retrieval
    • Information extraction (detecting named entities, etc.)
    • Spell-checking
  • 3 tasks
    • Segmenting/tokenizing words in running text
    • Normalizing word formats
    • Segmenting sentences in running text
  • Why not just periods and white-space?
    • Mr. Sherwood said reaction to Sea Containers’ proposal has been "very positive." In New York Stock Exchange composite trading yesterday, Sea Containers closed at $62.625, up 62.5 cents.
    • “I said, ‘what’re you? Crazy?’ “ said Sadowsky. “I can’t afford to do that.’’

4 of 86

What’s a word?

  • I do uh main- mainly business data processing
    • Fragments
    • Filled pauses
  • Are cat and cats the same word?
  • Some terminology
    • Lemma: a set of lexical forms having the same stem, major part of speech, and rough word sense
      • Cat and cats = same lemma
    • Wordform: the full inflected surface form.
      • Cat and cats = different wordforms
    • Token/Type

5 of 86

How many words?

  • they picnicked by the pool then lay back on the grass and looked at the stars
    • 16 tokens
    • 14 types
  • The Switchboard corpus of American telephone conversation:
    • 2.4 million wordform tokens
    • ~20,000 wordform types
  • Brown et al (1992) large corpus of text
    • 583 million wordform tokens
    • 293,181 wordform types
  • Shakespeare:
    • 884,647 wordform tokens
    • 31,534 wordform types
  • Let N = number of tokens, V = vocabulary = number of types
  • General wisdom: V > O(sqrt(N))

6 of 86

Practical Idea

  • Less text.txt | less
  • tr –sc ‘A-Za-z’ ‘\n’ < text.txt | less
  • tr –sc ‘A-Za-z’ ‘\n’ < text.txt | sort | less
  • tr –sc ‘A-Za-z’ ‘\n’ < text.txt | sort | uniq –c | less
  • tr –sc ‘A-Za-z’ ‘\n’ < text.txt | sort | uniq –c | sort –n -r

6

7/17/2015

7 of 86

Issues in Tokenization

  • Finland’s capital →

Finland? Finlands? Finland’s

  • what’re, I’m, isn’t->
    • What are, I am, is not
  • Hewlett-Packard
      • Hewlett and Packard as two tokens?
    • state-of-the-art:
      • Break up?
    • lowercase, lower-case, lower case ?
  • San Francisco, New York: one token or two?
  • Words with punctuation
    • m.p.h., PhD.

Slide from Chris Manning

8 of 86

Tokenization: language issues

  • French
    • L'ensemble → one token or two?
      • L ? L’ ? Le ?
      • Want l’ensemble to match with un ensemble

  • German noun compounds are not segmented
    • Lebensversicherungsgesellschaftsangestellter
    • ‘life insurance company employee’
    • German retrieval systems benefit greatly from a compound splitter module

Slide from Chris Manning

9 of 86

Tokenization: language issues

  • Chinese and Japanese no spaces between words:
    • 莎拉波娃现在居住在美国东南部的佛罗里达。
    • 莎拉波娃 现在 居住 在 美国 东南部 的 佛罗里达
    • Sharapova now lives in US southeastern Florida
  • Further complicated in Japanese, with multiple alphabets intermingled
    • Dates/amounts in multiple formats

フォーチュン500社は情報不足のため時間あた$500K(約6,000万円)

Katakana

Hiragana

Kanji

Romaji

End-user can express query entirely in hiragana!

Slide from Chris Manning

10 of 86

Word Segmentation in Chinese

  • Words composed of characters
  • Characters are generally 1 syllable and 1 morpheme.
  • Average word is 2.4 characters long.
  • Standard segmentation algorithm:
    • Maximum Matching
      • (also called Greedy)

11 of 86

Maximum Matching�Word Segmentation Algorithm

  • Given a wordlist of Chinese, and a string.
  • Start a pointer at the beginning of the string
  • Find the longest word in dictionary that matches the string starting at pointer
  • Move the pointer over the word in string
  • Go to 2

12 of 86

English failure example (Palmer 00)

  • the table down there
  • thetabledownthere
  • Theta bled own there

  • But works astonishingly well in Chinese
    • 莎拉波娃现在居住在美国东南部的佛罗里达。
    • 莎拉波娃 现在 居住 在 美国 东南部 的 佛罗里达
  • Modern algorithms better still:
    • probabilistic segmentation
    • Using “sequence models” like HMMs that we’ll see in 2 weeks

13 of 86

Normalization

  • Need to “normalize” terms
    • For IR, indexed text & query terms must have same form.
      • We want to match U.S.A. and USA
  • We most commonly implicitly define equivalence classes of terms
    • e.g., by deleting periods in a term
  • Alternative is to do asymmetric expansion:
    • Enter: window Search: window, windows
    • Enter: windows Search: Windows, windows, window
    • Enter: Windows Search: Windows
  • Potentially more powerful, but less efficient

Slide from Chris Manning

14 of 86

Case folding

  • For IR: Reduce all letters to lower case
    • exception: upper case in mid-sentence?
      • e.g., General Motors
      • Fed vs. fed
      • SAIL vs. sail
    • Often best to lower case everything, since users will use lowercase regardless of ‘correct’ capitalization…
  • For sentiment analysis, MT, Info extraction
    • Case is helpful (“US” versus “us” is important)

Slide from Chris Manning

15 of 86

Lemmatization

  • Reduce inflectional/variant forms to base form
  • E.g.,
    • am, are, is be
    • car, cars, car's, cars'car
  • the boy's cars are different colorsthe boy car be different color
  • Lemmatization implies doing “proper” reduction to dictionary headword form

Slide from Chris Manning

16 of 86

Stemming

  • Reduce terms to their “roots” before indexing
  • “Stemming” is crude chopping of “affixes”
    • language dependent
    • e.g., automate(s), automatic, automation all reduced to automat.

for example compressed

and compression are both

accepted as equivalent to

compress.

for exampl compress and

compress ar both accept

as equival to compress

Slide from Chris Manning

17 of 86

Porter’s algorithm

  • Commonest algorithm for stemming English
  • A sequence of phases
  • each phase consists of a set of rules
    • ssesss
    • iesi
    • ationalate
    • tionaltion
    • Some rules only apply to multi-syllable words
      • (syl > 1) EMENT → ø
      • replacementreplac
      • cement cement

Slide from Chris Manning

18 of 86

More on Morphology

  • Morphology:
    • how words are built up from smaller meaningful units called morphemes:
    • Stems: The core meaning bearing units
    • Affixes: Bits and pieces that adhere to stems to change their meanings and grammatical functions

19 of 86

Dealing with complex morphology is sometimes necessary

  • Machine translation
    • Need to know that the Spanish words quiero and quieres are both related to querer ‘want’
  • Other languages requires segmenting morphemes
    • Turkish
    • Uygarlastiramadiklarimizdanmissinizcasina
    • `(behaving) as if you are among those whom we could not civilize’
    • Uygar `civilized’ + las `become’

+ tir `cause’ + ama `not able’

+ dik `past’ + lar ‘plural’

+ imiz ‘p1pl’ + dan ‘abl’

+ mis ‘past’ + siniz ‘2pl’ + casina ‘as if’

20 of 86

Sentence Segmentation

  • !, ? relatively unambiguous
  • Period “.” is quite ambiguous
    • Sentence boundary
    • Abbreviations like Inc. or Dr.
  • General idea:
    • Build a binary classifier:
      • Looks at a “.”
      • Decides EndOfSentence/NotEOS
      • Could be hand-written rules, sequences of regular expressions, or machine-learning

21 of 86

Determining if a word is end-of-utterance: a Decision Tree

22 of 86

More sophisticated decision tree features

  • Prob(word with “.” occurs at end-of-s)
  • Prob(word after “.” occurs at begin-of-s)
  • Length of word with “.”
  • Length of word after “.”
  • Case of word with “.”: Upper, Lower, Cap, Number
  • Case of word after “.”: Upper, Lower, Cap, Number
  • Punctuation after “.” (if any)
  • Abbreviation class of word with “.” (month name, unit-of-measure, title, address name, etc)

1/5/07

From Richard Sproat slides

23 of 86

Learning Decision Trees

  • DTs are rarely built by hand
  • Hand-building only possible for very simple features, domains
  • Lots of algorithms for DT induction

24 of 86

II. Minimum Edit Distance

  • Spell-checking
    • Non-word error detection:
      • detecting “graffe”
    • Non-word error correction:
      • figuring out that “graffe” should be “giraffe”
    • Context-dependent error detection and correction:
      • Figuring out that “war and piece” should be peace

25 of 86

Non-word error detection

  • Any word not in a dictionary
  • Assume it’s a spelling error
  • Need a big dictionary!

26 of 86

Isolated word error correction

  • How do I fix “graffe”?
    • Search through all words:
      • graf
      • craft
      • grail
      • giraffe
    • Pick the one that’s closest to graffe
    • What does “closest” mean?
    • We need a distance metric.
    • The simplest one: edit distance.
      • (More sophisticated probabilistic ones: noisy channel)

27 of 86

Edit Distance

  • The minimum edit distance between two strings
  • Is the minimum number of editing operations
    • Insertion
    • Deletion
    • Substitution
  • Needed to transform one into the other

28 of 86

Minimum Edit Distance

29 of 86

Minimum Edit Distance

  • If each operation has cost of 1
    • Distance between these is 5
  • If substitutions cost 2 (Levenshtein)
    • Distance between them is 8

30 of 86

Edit transcript

30

7/17/2015

31 of 86

Defining Min Edit Distance

  • For two strings S1 of len n, S2 of len m
    • distance(i,j) or D(i,j)
      • means the edit distance of S1[1..i] and S2[1..j]
      • i.e., the minimum number of edit operations need to transform the first i characters of S1 into the first j characters of S2
      • The edit distance of S1, S2 is D(n,m)
  • We compute D(n,m) by computing D(i,j) for all i (0 < i < n) and j (0 < j < m)

31

7/17/2015

32 of 86

Defining Min Edit Distance

  • Base conditions:
    • D(i,0) = i
    • D(0,j) = j

    • Recurrence Relation:

D(i-1,j) + 1

    • D(i,j) = min D(i,j-1) + 1

D(i-1,j-1) + 1; if S1(i) ≠ S2(j)

0; if S1(i) = S2(j)

32

7/17/2015

33 of 86

Dynamic Programming

  • A tabular computation of D(n,m)
  • Bottom-up
    • We compute D(i,j) for small i,j
    • And compute increase D(i,j) based on previously computed smaller values

33

7/17/2015

34 of 86

The Edit Distance Table

N

9

O

8

I

7

T

6

N

5

E

4

T

3

N

2

I

1

#

0

1

2

3

4

5

6

7

8

9

#

E

X

E

C

U

T

I

O

N

35 of 86

N

9

O

8

I

7

T

6

N

5

E

4

T

3

N

2

I

1

#

0

1

2

3

4

5

6

7

8

9

#

E

X

E

C

U

T

I

O

N

36 of 86

N

9

8

9

10

11

12

11

10

9

8

O

8

7

8

9

10

11

10

9

8

9

I

7

6

7

8

9

10

9

8

9

10

T

6

5

6

7

8

9

8

9

10

11

N

5

4

5

6

7

8

9

10

11

10

E

4

3

4

5

6

7

8

9

10

9

T

3

4

5

6

7

8

7

8

9

8

N

2

3

4

5

6

7

8

7

8

7

I

1

2

3

4

5

6

7

6

7

8

#

0

1

2

3

4

5

6

7

8

9

#

E

X

E

C

U

T

I

O

N

37 of 86

Suppose we want the alignment too

  • We can keep a “backtrace”
  • Every time we enter a cell, remember where we came from
  • Then when we reach the end, we can trace back from the upper right corner to get an alignment

38 of 86

Backtrace

N

9

8

9

10

11

12

11

10

9

8

O

8

7

8

9

10

11

10

9

8

9

I

7

6

7

8

9

10

9

8

9

10

T

6

5

6

7

8

9

8

9

10

11

N

5

4

5

6

7

8

9

10

11

10

E

4

3

4

5

6

7

8

9

10

9

T

3

4

5

6

7

8

7

8

9

8

N

2

3

4

5

6

7

8

7

8

7

I

1

2

3

4

5

6

7

6

7

8

#

0

1

2

3

4

5

6

7

8

9

#

E

X

E

C

U

T

I

O

N

39 of 86

Adding Backtrace to MinEdit

  • Base conditions:
    • D(i,0) = i
    • D(0,j) = j
  • Recurrence Relation:

D(i-1,j) + 1

    • D(i,j) = min D(i,j-1) + 1

D(i-1,j-1) + 1; if S1(i) ≠ S2(j)

0; if S1(i) = S2(j)

LEFT

ptr(i,j) DOWN

DIAG

Case 1

Case 2

Case 3

Case 1

Case 2

Case 3

40 of 86

MinEdit with Backtrace

40

7/17/2015

41 of 86

Performance

  • Time:

O(nm)

  • Space:

O(nm)

  • Backtrace

O(n+m)

42 of 86

Weighted Edit Distance

  • Why would we add weights to the computation?
  • How?

43 of 86

Confusion matrix

43

7/17/2015

44 of 86

44

7/17/2015

45 of 86

Weighted Minimum Edit Distance

45

7/17/2015

46 of 86

Why “Dynamic Programming”

“I spent the Fall quarter (of 1950) at RAND. My first task was to find a name for multistage decision processes. An interesting question is, Where did the name, dynamic programming, come from? The 1950s were not good years for mathematical research. We had a very interesting gentleman in Washington named Wilson. He was Secretary of Defense, and he actually had a pathological fear and hatred of the word, research. I’m not using the term lightly; I’m using it precisely. His face would suffuse, he would turn red, and he would get violent if people used the term, research, in his presence. You can imagine how he felt, then, about the term, mathematical. The RAND Corporation was employed by the Air Force, and the Air Force had Wilson as its boss, essentially. Hence, I felt I had to do something to shield Wilson and the Air Force from the fact that I was really doing mathematics inside the RAND Corporation. What title, what name, could I choose? In the first place I was interested in planning, in decision making, in thinking. But planning, is not a good word for various reasons. I decided therefore to use the word, “programming” I wanted to get across the idea that this was dynamic, this was multistage, this was time-varying I thought, lets kill two birds with one stone. Lets take a word that has an absolutely precise meaning, namely dynamic, in the classical physical sense. It also has a very interesting property as an adjective, and that is its impossible to use the word, dynamic, in a pejorative sense. Try thinking of some combination that will possibly give it a pejorative meaning. Its impossible. Thus, I thought dynamic programming was a good name. It was something not even a Congressman could object to. So I used it as an umbrella for my activities.”

Richard Bellman, “Eye of the Hurricane: an autobiography” 1984.

47 of 86

Other uses of Edit Distance in text processing

  • Evaluating Machine Translation and speech recognition

R Spokesman confirms senior government adviser was shot

H Spokesman said the senior adviser was shot dead

S I D I

48 of 86

Edit distance in computationl genomics

48

7/17/2015

49 of 86

The Cell

© 1997-2005 Coriell Institute for Medical Research

50 of 86

Chromosomes

telomere

centromere

nucleosome

chromatin

H1

DNA

H2A, H2B, H3, H4

~146bp

© 1997-2005 Coriell Institute for Medical Research

51 of 86

Nucleotide (base)

O

C

C

C

C

H

H

H

H

H

H

H

C

O

P

O

O-

O

to next nucleotide

to previous nucleotide

to base

3’

5’

Adenine (A)

Cytosine (C)

Guanine (G)

Thymine (T)

“AGACC”!

pyrimidines

purines

© 1997-2005 Coriell Institute for Medical Research

52 of 86

“AGACC” (backbone)

© 1997-2005 Coriell Institute for Medical Research

53 of 86

“AGACC” (DNA)

5’

5’

3’

3’

© 1997-2005 Coriell Institute for Medical Research

54 of 86

Genes & Proteins

3’

5’

5’

3’

TAGGATCGACTATATGGGATTACAAAGCATTTAGGGA...TCACCCTCTCTAGACTAGCATCTATATAAAACAGAA

ATCCTAGCTGATATACCCTAATGTTTCGTAAATCCCT...AGTGGGAGAGATCTGATCGTAGATATATTTTGTCTT

AUGGGAUUACAAAGCAUUUAGGGA...UCACCCUCUCUAGACUAGCAUCUAUAUAA

(transcription)

(translation)

Single-stranded RNA

protein

Double-stranded DNA

Proteins are chains of amino-acids

There are 20 amino-acids

© 1997-2005 Coriell Institute for Medical Research

55 of 86

Sequence Alignment

Slide from Serafim Batzoglou

-AGGCTATCACCTGACCTCCAGGCCGA--TGCCC---

TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC

Definition

Given two strings x = x1x2...xM, y = y1y2…yN,

an alignment is an assignment of gaps to positions

0,…, N in x, and 0,…, N in y, so as to line up each

letter in one sequence with either a letter, or a gap

in the other sequence

AGGCTATCACCTGACCTCCAGGCCGATGCCC

TAGCTATCACGACCGCGGTCGATTTGCCCGAC

56 of 86

Why sequence alignment?

  • Comparing genes or regions from different species
    • to find important regions
    • determine function
    • uncover evolutionary forces
  • Assembling fragments to sequence DNA
  • Compare individuals to looking for mutations

57 of 86

DNA sequencing

How we obtain the sequence of nucleotides of a species

Slide from Serafim Batzoglou

…ACGTGACTGAGGACCGTG

CGACTGAGACTGACTGGGT

CTAGCTAGACTACGTTTTA

TATATATATACGTCGTCGT

ACTGATGACTAGATTACAG

ACTGATTTAGATACCTGAC

TGATTTTAAAAAAATATT…

58 of 86

Whole Genome Shotgun Sequencing

Slide from Serafim Batzoglou

cut many times at random (Shotgun)

genomic segment

Get one or two reads from each segment

~900 bp

~900 bp

59 of 86

Fragment Assembly

Slide from Serafim Batzoglou

60 of 86

Steps to Assemble a Genome

Slide from Serafim Batzoglou

1. Find overlapping reads

4. Derive consensus sequence

..ACGATTACAATAGGTT..

2. Merge some “good” pairs of reads into longer contigs

3. Link contigs to form supercontigs

Some Terminology

read a 500-900 long word that comes

out of sequencer

mate pair a pair of reads from two ends

of the same insert fragment

contig a contiguous sequence formed

by several overlapping reads

with no gaps

supercontig an ordered and oriented set

(scaffold) of contigs, usually by mate

pairs

consensus sequence derived from the

sequene multiple alignment of reads

in a contig

61 of 86

Find Overlapping Reads

Create local multiple alignments from the pairwise read alignments

Slide from Serafim Batzoglou

TAGATTACACAGATTACTGA

TAGATTACACAGATTACTGA

TAG TTACACAGATTATTGA

TAGATTACACAGATTACTGA

TAGATTACACAGATTACTGA

TAGATTACACAGATTACTGA

TAG TTACACAGATTATTGA

TAGATTACACAGATTACTGA

62 of 86

Derive Consensus Sequence

Slide from Serafim Batzoglou

TAGATTACACAGATTACTGA TTGATGGCGTAA CTA

TAGATTACACAGATTACTGACTTGATGGCGTAAACTA

TAG TTACACAGATTATTGACTTCATGGCGTAA CTA

TAGATTACACAGATTACTGACTTGATGGCGTAA CTA

TAGATTACACAGATTACTGACTTGATGGGGTAA CTA

TAGATTACACAGATTACTGACTTGATGGCGTAA CTA

Can derive each consensus base by weighted voting, etc.

63 of 86

Evolution at the DNA level

Slide from Serafim Batzoglou

…ACGGTGCAGTTACCA…

…AC----CAGTCCACCA…

Mutation

SEQUENCE EDITS

REARRANGEMENTS

Deletion

Inversion

Translocation

Duplication

64 of 86

Evolutionary Rates

Slide from Serafim Batzoglou

OK

OK

OK

X

X

Still OK?

next generation

65 of 86

Sequence conservation implies function

Slide from Serafim Batzoglou

Alignment is the key to

  • Finding important regions
  • Determining function
  • Uncovering the evolutionary forces

66 of 86

Sequence Alignment

Slide from Serafim Batzoglou

-AGGCTATCACCTGACCTCCAGGCCGA--TGCCC---

TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC

Definition

Given two strings x = x1x2...xM, y = y1y2…yN,

an alignment is an assignment of gaps to positions

0,…, N in x, and 0,…, N in y, so as to line up each

letter in one sequence with either a letter, or a gap

in the other sequence

AGGCTATCACCTGACCTCCAGGCCGATGCCC

TAGCTATCACGACCGCGGTCGATTTGCCCGAC

67 of 86

Alignments in two fields

  • In Natural Language Processing
    • We generally talk about distance (minimized)
      • And weights
  • In Computational Biology
    • We generally talk about similarity (maximized)
      • And scores

68 of 86

Scoring Alignments

Rough intuition:

  • Similar sequences evolved from a common ancestor
  • Evolution changed the sequences from this ancestral sequence by mutations:
    • Replacements: one letter replaced by another
    • Deletion: deletion of a letter
    • Insertion: insertion of a letter
  • Scoring of sequence similarity should examine how many operations took place

69 of 86

What is a good alignment?

AGGCTAGTT, AGCGAAGTTT

AGGCTAGTT- 6 matches, 3 mismatches, 1 gap

AGCGAAGTTT

AGGCTA-GTT- 7 matches, 1 mismatch, 3 gaps

AG-CGAAGTTT

AGGC-TA-GTT- 7 matches, 0 mismatches, 5 gaps

AG-CG-AAGTTT

Slide from Serafim Batzoglou

70 of 86

Scoring Function

  • Sequence edits:

AGGCCTC

    • Mutations AGGACTC
    • Insertions AGGGCCTC
    • Deletions AGG . CTC

Scoring Function:

Match: +m

Mismatch: -s

Gap: -d

Score F = (# matches) × m - (# mismatches) × s – (#gaps) × d

Slide from Serafim Batzoglou

71 of 86

Example

x = AGTA m = 1

y = ATA s = -1

d = -1

Slide from Serafim Batzoglou

G

-

A

G

T

A

0

-1

-2

-3

-4

A

-1

1

0

-1

-2

T

-2

0

0

1

0

A

-3

-1

-1

0

2

F(i,j) i = 0 1 2 3 4

j = 0

1

2

3

F(1, 1) =

max{F(0,0) + s(A, A),

F(0, 1) – d,

F(1, 0) – d} =

max{0 + 1,

-1 – 1,

-1 – 1} = 1

A

A

T

T

A

A

72 of 86

The Needleman-Wunsch Matrix

Slide from Serafim Batzoglou

x1 ……………………………… xM

y1 ……………………………… yN

Every nondecreasing path

from (0,0) to (M, N)

corresponds to

an alignment

of the two sequences

An optimal alignment is composed of optimal subalignments

73 of 86

The Needleman-Wunsch Algorithm

  1. Initialization.
    1. F(0, 0) = 0
    2. F(0, j) = - j × d
    3. F(i, 0) = - i × d

  • Main Iteration. Filling-in partial alignments
    • For each i = 1……M

For each j = 1……N

F(i-1,j-1) + s(xi, yj) [case 1]

F(i, j) = max F(i-1, j) – d [case 2]

F(i, j-1) – d [case 3]

DIAG, if [case 1]

Ptr(i,j) = LEFT, if [case 2]

UP, if [case 3]

  1. Termination. F(M, N) is the optimal score, and

from Ptr(M, N) can trace back optimal alignment

Slide from Serafim Batzoglou

74 of 86

A variant of the basic algorithm:

  • Maybe it is OK to have an unlimited # of gaps in the beginning and end:

Slide from Serafim Batzoglou

----------CTATCACCTGACCTCCAGGCCGATGCCCCTTCCGGC

GCGAGTTCATCTATCAC--GACCGC--GGTCG--------------

  • If so, we don’t want to penalize gaps at the ends

75 of 86

Different types of overlaps

Slide from Serafim Batzoglou

Example:

2 overlapping“reads” from a

sequencing project

Example:

Search for a mouse gene

within a human chromosome

76 of 86

The Overlap Detection variant

Changes:

  1. Initialization

For all i, j,

F(i, 0) = 0

F(0, j) = 0

  • Termination

maxi F(i, N)

FOPT = max

maxj F(M, j)

Slide from Serafim Batzoglou

x1 ……………………………… xM

y1 ……………………………… yN

77 of 86

The local alignment problem

Given two strings x = x1……xM,

y = y1……yN

Find substrings x’, y’ whose similarity

(optimal global alignment value)

is maximum

x = aaaacccccggggtta

y = ttcccgggaaccaacc

Slide from Serafim Batzoglou

78 of 86

Why local alignment – examples

  • Genes are shuffled between genomes

  • Portions of proteins (domains) are often conserved

Slide from Serafim Batzoglou

79 of 86

Cross-species genome similarity

  • 98% of genes are conserved between any two mammals
  • >70% average similarity in protein sequence

Slide from Serafim Batzoglou

hum_a : GTTGACAATAGAGGGTCTGGCAGAGGCTC--------------------- @ 57331/400001

mus_a : GCTGACAATAGAGGGGCTGGCAGAGGCTC--------------------- @ 78560/400001

rat_a : GCTGACAATAGAGGGGCTGGCAGAGACTC--------------------- @ 112658/369938

fug_a : TTTGTTGATGGGGAGCGTGCATTAATTTCAGGCTATTGTTAACAGGCTCG @ 36008/68174

hum_a : CTGGCCGCGGTGCGGAGCGTCTGGAGCGGAGCACGCGCTGTCAGCTGGTG @ 57381/400001

mus_a : CTGGCCCCGGTGCGGAGCGTCTGGAGCGGAGCACGCGCTGTCAGCTGGTG @ 78610/400001

rat_a : CTGGCCCCGGTGCGGAGCGTCTGGAGCGGAGCACGCGCTGTCAGCTGGTG @ 112708/369938

fug_a : TGGGCCGAGGTGTTGGATGGCCTGAGTGAAGCACGCGCTGTCAGCTGGCG @ 36058/68174

hum_a : AGCGCACTCTCCTTTCAGGCAGCTCCCCGGGGAGCTGTGCGGCCACATTT @ 57431/400001

mus_a : AGCGCACTCG-CTTTCAGGCCGCTCCCCGGGGAGCTGAGCGGCCACATTT @ 78659/400001

rat_a : AGCGCACTCG-CTTTCAGGCCGCTCCCCGGGGAGCTGCGCGGCCACATTT @ 112757/369938

fug_a : AGCGCTCGCG------------------------AGTCCCTGCCGTGTCC @ 36084/68174

hum_a : AACACCATCATCACCCCTCCCCGGCCTCCTCAACCTCGGCCTCCTCCTCG @ 57481/400001

mus_a : AACACCGTCGTCA-CCCTCCCCGGCCTCCTCAACCTCGGCCTCCTCCTCG @ 78708/400001

rat_a : AACACCGTCGTCA-CCCTCCCCGGCCTCCTCAACCTCGGCCTCCTCCTCG @ 112806/369938

fug_a : CCGAGGACCCTGA------------------------------------- @ 36097/68174

“atoh” enhancer in human, mouse, rat, fugu fish

80 of 86

The Smith-Waterman algorithm

Idea: Ignore badly aligning regions

Modifications to Needleman-Wunsch:

Initialization: F(0, j) = F(i, 0) = 0

0

Iteration: F(i, j) = max F(i – 1, j) – d

F(i, j – 1) – d

F(i – 1, j – 1) + s(xi, yj)

Slide from Serafim Batzoglou

81 of 86

The Smith-Waterman algorithm

Termination:

  1. If we want the best local alignment…

FOPT = maxi,j F(i, j)

Find FOPT and trace back

  • If we want all local alignments scoring > t

?? For all i, j find F(i, j) > t, and trace back?

Complicated by overlapping local alignments

Slide from Serafim Batzoglou

82 of 86

Local Alignment Example

Slide from Hasan Oğul

s = TAATA

t = ATCTAA

83 of 86

Local Alignment Example

Slide from Hasan Oğul

s = TAATA

t = TACTAA

84 of 86

Local Alignment Example

s = TAATA

t = TACTAA

Slide from Hasan Oğul

85 of 86

Local Alignment Example

Slide from Hasan Oğul

s = TAATA

t = TACTAA

86 of 86

Summary

  • Tokenization
    • Word Tokenization
    • Normalization
      • Lemmatization and stemming
    • Sentence Tokenization
  • Minimum Edit Distance
    • Levenshtein distance
    • Needleman-Wunsch (weighted global alignment)
    • Smith-Waterman (local alignment)
    • Applications to:
      • spell correction, machine translation, speech recognition
      • DNA fragment combination, evolutationary similarity, mutation detection