1 of 29

Special Topics

CSE 373 A Su 25

Poll Link: https://pollev.com/aelysha

Slide Credit: Dr. Alison Feder, GENOME 373

2 of 29

Announcements

  • Final:
    • Rescheduling must be finalized today!
    • Scheduling for August 22nd due tomorrow!
  • Resubs:
    • Shortest Paths Resubs due August 22nd
  • Exercises:
    • EX 5 due August 18th (today)
    • EX 6 due August 19th (tomorrow), 24th at latest
  • Misc:
    • Evaluations due August 22nd
    • Canvas Quizzes due August 22nd

3 of 29

Counting Sorts

4 of 29

Comparison-Based Sorts: Runtime

  • Currently, we have not seen a sort that is faster than worst-case Θ(N log N)
    • Is it possible to sort N elements in N time in the worst case?

5 of 29

New Approach: Counting Sort

  • What if we didn’t use inequalities or compareTo?
    • Switch to enumeration, listing elements out in a sequence from first to last
      • EX: Monday - Tuesday - Wednesday etc.
  • Counting Sorts sort by enumeration

6 of 29

From Counting Sorts to Radix Sorts

  • Works well for small integers or single characters (N + R)
    • What about long or complex strings?
  • New Approach:
    • Divide strings or numbers into smaller pieces (like Tries)
    • Apply a counting sort for each letter or number

  1. MSD Radix Sort from Left to Right recursively
  2. LSD Radix Sort from Right to Left iteratively

7 of 29

MSD Radix Sort - L*(N + R) or N + R

8 of 29

LSD Radix Sort - L*(N + R)

9 of 29

Bucket Sort

10 of 29

Bucket Sort (N2 + K/R)

  • Continuing with the idea of groupings based on the range of values…
  • Establish some number of buckets to store data into (to represent ranges, starting digits, etc.)
  • Insert into the buckets based on match, creating chains
  • Sort items within each bucket with some comparison sort as a subsort (insertion sort, quicksort, etc.)
  • Iterate across structure in order to combine results

11 of 29

Bucket Sort (N2 + K/R)

12 of 29

Genome Informatics: Global Sequence Alignment

All credit to Dr. Alison Feder for the following slides!

13 of 29

Brief Terminology

  • Sequence = “letters” of DNA or RNA
    • We will use A, C, T, G and refer to them as base pairs
  • Alignment = process of looking at sequences next to each other to see what letters match or mismatch
    • Matching is based on direct characters being the same at the same positions
  • Variant = a difference in a DNA sequence compared to a reference sequence (baseline)
  • Gap = a placeholder dash to indicate a missing letter to allow us to align two sequences
    • Gaps do not pair to gaps!

14 of 29

What is Sequence Alignment?

  • A way to compare similar sequences of DNA, RNA, or proteins to help identify relationships
    • May be similar by conservation, due to evolutionary pressure
    • May be similar to indicate ancestral history (homologous regions)
  • We might come across a patient with symptoms we can’t pinpoint, and want to see if it has a genetic cause.
    • What if we sequence their DNA and compare it to a reference point to know if there are any worrying differences that point to disease-related variation?

15 of 29

Our Patient Scenario

  • What is the scale of this problem?
    • Reading through an entire patient’s DNA on paper manually would be close to 1.6 million pages, just for aligning a small sequence that might be 8 base pairs long
    • We will also see that several possible alignments can be found to a reference sequence, but testing every option in a brute-force manner is incredibly inefficient
  • We’ll see how DSA helps us address this problem.

16 of 29

Understanding Alignment

Reference:

To be aligned:

TGGAAGGGCTAATTCACTCCCAACGAAGACAAGA

CACCCCAA

17 of 29

Understanding Alignment

Some Possibilities:

TGGAAGGGCTAATTCACTCCCAACGAAGACAAGA

CACTCCCA

CAC-CCCAA

CTCCCAAC

To Align: CACCCCAA

How do we know what works best? We need a cost function!

18 of 29

Cost Matrix!

A

C

G

T

A

10

-5

0

-5

C

-5

10

-5

0

G

0

-5

10

-5

T

-5

0

-5

10

Gap penalty = -4

Based on the pairing you use, indicates if it is ‘good’ or ‘bad’.

Sequence 1

Sequence 2

19 of 29

Practice: Cost Matrix

A

C

G

T

A

10

-5

0

-5

C

-5

10

-5

0

G

0

-5

10

-5

T

-5

0

-5

10

Gap penalty = -4

C

A

C

T

C

C

C

A

C

A

C

C

C

C

A

A

10

10

10

0

10

10

-5

10

C

A

C

T

C

C

C

A

C

A

C

-

C

C

C

A

10

10

10

-4

10

10

10

10

By these rules, this is the better alignment

20 of 29

Issues with Scale

  • For the previous example, we didn’t consider every possibility
    • But we would need to do so to determine the absolute BEST option, right?
  • Let’s bring this back to DSA land…

Duplicates!

C

A

C

T

C

C

C

A

C

A

C

C

C

C

A

A

10

10

10

0

10

10

-5

10

C

A

C

T

C

C

C

A

C

A

C

-

C

C

C

A

10

10

10

-4

10

10

10

10

21 of 29

Issues with Scale

  • For the previous example, we didn’t consider every possibility
    • But we would need to do so to determine the absolute BEST option, right?
  • Let’s bring this back to DSA land…

Duplicates!

C

A

C

T

C

C

C

A

C

A

C

C

C

C

A

A

10

10

10

0

10

10

-5

10

C

A

C

T

C

C

C

A

C

A

C

-

C

C

C

A

10

10

10

-4

10

10

10

10

What tool can we use if we have a repeated subproblem that we reference to find the final solution to some problem?

22 of 29

Dynamic Programming

We can move in 3 different directions.

  • Right (+1 column) = Gap for Sequence 2, choose Sequence 1 letter
  • Down (+1 row) = Gap for Sequence 1, choose Sequence 2 letter

0

1

G

0

0

1

C

-4

-8

-G

C-

0

1

G

0

0

-4

1

C

-8

G-

-C

A

C

G

T

A

10

-5

0

-5

C

-5

10

-5

0

G

0

-5

10

-5

T

-5

0

-5

10

Gap penalty = -4

23 of 29

Dynamic Programming

We can move in 3 different directions.

  • Diagonal = Alignment, which can be a match or a mismatch

A

C

G

T

A

10

-5

0

-5

C

-5

10

-5

0

G

0

-5

10

-5

T

-5

0

-5

10

Gap penalty = -4

0

1

G

0

0

-4

1

C

-4

-5

G

C

For a given cell in our matrix, we choose the option with the lowest cost (largest gain). We then track that cost and the direction we took.

24 of 29

Dynamic Programming

0

1

2

3

4

5

G

A

A

T

C

0

1

C

2

A

3

T

4

A

5

C

A

C

G

T

A

10

-5

0

-5

C

-5

10

-5

0

G

0

-5

10

-5

T

-5

0

-5

10

Gap penalty = -4

We can move in 3 different directions.

  • Right = Gap for Sequence 2
  • Down = Gap for Sequence 1
  • Diagonal = Match or Mismatch

25 of 29

Dynamic Programming

0

1

2

3

4

5

G

A

A

T

C

0

0

-4

-8

-12

-16

-20

1

C

-4

-5

-9

-13

-12

-6

2

A

-8

-4

5

1

-3

-7

3

T

-12

-8

1

0

11

7

4

A

-16

-12

2

11

7

6

5

C

-20

-16

-2

7

11

17

A

C

G

T

A

10

-5

0

-5

C

-5

10

-5

0

G

0

-5

10

-5

T

-5

0

-5

10

GAAT-C

-CATAC

GA-ATC

CATA-C

26 of 29

Dynamic Programming

27 of 29

Needleman-Wunsch Algorithm

  1. Base Cases: Row 0 and Column 0 to be filled with gaps * i
  2. Tabulation: Iterate over 2d array and compute 3 different options (right, down, diagonal)
    1. Choose max value to populate cell with
  3. Trace Back: Start at bottom right and trace the path back to the top left, aligning each movement/direction to the global sequence alignment

n

Exhaustive enumeration

Dynamic programming

5

1.7 * 103

7.5 * 101

10

8.1 * 106

3.0 * 102

20

2.6 * 1014

1.2 * 103

30

9.6 * 1021

2.7 * 103

40

3.8 * 1029

4.8 * 103

28 of 29

Local Alignment: Smith-Waterman

  • There may be cases where we don’t want to match to the length of full strings
    • Full alignment is not expected in reality
  • What if we just want to find the best-matching subsequences?
    • Local alignment! Also uses DP.
  • Trick: Any negative values are instead replaced with 0’s.
    • When tracing back, we choose the largest value in the matrix and follow back until we hit a 0 and return this

29 of 29

Thanks! See you all Wednesday! ~

Today’s Song Rec:

“Cities”

Throttle