1 of 29

Special Topics

CSE 373 A Su 25

Poll Link: https://pollev.com/aelysha

Slide Credit: Dr. Alison Feder, GENOME 373

2 of 29

Announcements

Final:

Rescheduling must be finalized today!
Scheduling for August 22nd due tomorrow!

Resubs:

Shortest Paths Resubs due August 22nd

Exercises:

EX 5 due August 18th (today)
EX 6 due August 19th (tomorrow), 24th at latest

Misc:

Evaluations due August 22nd
Canvas Quizzes due August 22nd

3 of 29

Counting Sorts

4 of 29

Comparison-Based Sorts: Runtime

Currently, we have not seen a sort that is faster than worst-case Θ(N log N)

Is it possible to sort N elements in N time in the worst case?

5 of 29

New Approach: Counting Sort

What if we didn’t use inequalities or compareTo?

Switch to enumeration, listing elements out in a sequence from first to last

EX: Monday - Tuesday - Wednesday etc.

Counting Sorts sort by enumeration

6 of 29

From Counting Sorts to Radix Sorts

Works well for small integers or single characters (N + R)

What about long or complex strings?

New Approach:

Divide strings or numbers into smaller pieces (like Tries)
Apply a counting sort for each letter or number

MSD Radix Sort from Left to Right recursively
LSD Radix Sort from Right to Left iteratively

7 of 29

MSD Radix Sort - L*(N + R) or N + R

8 of 29

LSD Radix Sort - L*(N + R)

9 of 29

Bucket Sort

10 of 29

Bucket Sort (N² + K/R)

Continuing with the idea of groupings based on the range of values…
Establish some number of buckets to store data into (to represent ranges, starting digits, etc.)
Insert into the buckets based on match, creating chains
Sort items within each bucket with some comparison sort as a subsort (insertion sort, quicksort, etc.)
Iterate across structure in order to combine results

11 of 29

Bucket Sort (N² + K/R)

12 of 29

Genome Informatics: Global Sequence Alignment

All credit to Dr. Alison Feder for the following slides!

13 of 29

Brief Terminology

Sequence = “letters” of DNA or RNA

We will use A, C, T, G and refer to them as base pairs

Alignment = process of looking at sequences next to each other to see what letters match or mismatch

Matching is based on direct characters being the same at the same positions

Variant = a difference in a DNA sequence compared to a reference sequence (baseline)
Gap = a placeholder dash to indicate a missing letter to allow us to align two sequences

Gaps do not pair to gaps!

14 of 29

What is Sequence Alignment?

A way to compare similar sequences of DNA, RNA, or proteins to help identify relationships

May be similar by conservation, due to evolutionary pressure
May be similar to indicate ancestral history (homologous regions)

We might come across a patient with symptoms we can’t pinpoint, and want to see if it has a genetic cause.

What if we sequence their DNA and compare it to a reference point to know if there are any worrying differences that point to disease-related variation?

15 of 29

Our Patient Scenario

What is the scale of this problem?

Reading through an entire patient’s DNA on paper manually would be close to 1.6 million pages, just for aligning a small sequence that might be 8 base pairs long
We will also see that several possible alignments can be found to a reference sequence, but testing every option in a brute-force manner is incredibly inefficient

We’ll see how DSA helps us address this problem.

16 of 29

Understanding Alignment

Reference:

To be aligned:

TGGAAGGGCTAATTCACTCCCAACGAAGACAAGA

CACCCCAA

17 of 29

Understanding Alignment

Some Possibilities:

TGGAAGGGCTAATTCACTCCCAACGAAGACAAGA

CACTCCCA

CAC-CCCAA

CTCCCAAC

To Align: CACCCCAA

How do we know what works best? We need a cost function!

18 of 29

Cost Matrix!

	A	C	G	T
A	10	-5	0	-5
C	-5	10	-5	0
G	0	-5	10	-5
T	-5	0	-5	10

Gap penalty = -4

Based on the pairing you use, indicates if it is ‘good’ or ‘bad’.

Sequence 1

Sequence 2

19 of 29

Practice: Cost Matrix

	A	C	G	T
A	10	-5	0	-5
C	-5	10	-5	0
G	0	-5	10	-5
T	-5	0	-5	10

Gap penalty = -4

C	A	C	T	C	C	C	A
C	A	C	C	C	C	A	A
10	10	10	0	10	10	-5	10

C	A	C	T	C	C	C	A
C	A	C	-	C	C	C	A
10	10	10	-4	10	10	10	10

By these rules, this is the better alignment

20 of 29

Issues with Scale

For the previous example, we didn’t consider every possibility

But we would need to do so to determine the absolute BEST option, right?

Let’s bring this back to DSA land…

Duplicates!

C	A	C	T	C	C	C	A
C	A	C	C	C	C	A	A
10	10	10	0	10	10	-5	10

C	A	C	T	C	C	C	A
C	A	C	-	C	C	C	A
10	10	10	-4	10	10	10	10

21 of 29

Issues with Scale

For the previous example, we didn’t consider every possibility

But we would need to do so to determine the absolute BEST option, right?

Let’s bring this back to DSA land…

Duplicates!

C	A	C	T	C	C	C	A
C	A	C	C	C	C	A	A
10	10	10	0	10	10	-5	10

C	A	C	T	C	C	C	A
C	A	C	-	C	C	C	A
10	10	10	-4	10	10	10	10

What tool can we use if we have a repeated subproblem that we reference to find the final solution to some problem?

22 of 29

Dynamic Programming

We can move in 3 different directions.

Right (+1 column) = Gap for Sequence 2, choose Sequence 1 letter
Down (+1 row) = Gap for Sequence 1, choose Sequence 2 letter

		0	1
			G
0		0
1	C	-4	-8

-G

C-

		0	1
			G
0		0	-4
1	C		-8

G-

-C

	A	C	G	T
A	10	-5	0	-5
C	-5	10	-5	0
G	0	-5	10	-5
T	-5	0	-5	10

Gap penalty = -4

23 of 29

Dynamic Programming

We can move in 3 different directions.

Diagonal = Alignment, which can be a match or a mismatch

	A	C	G	T
A	10	-5	0	-5
C	-5	10	-5	0
G	0	-5	10	-5
T	-5	0	-5	10

Gap penalty = -4

		0	1
			G
0		0	-4
1	C	-4	-5

G

C

For a given cell in our matrix, we choose the option with the lowest cost (largest gain). We then track that cost and the direction we took.

24 of 29

Dynamic Programming

		0	1	2	3	4	5
			G	A	A	T	C
0
1	C
2	A
3	T
4	A
5	C

	A	C	G	T
A	10	-5	0	-5
C	-5	10	-5	0
G	0	-5	10	-5
T	-5	0	-5	10

Gap penalty = -4

We can move in 3 different directions.

Right = Gap for Sequence 2
Down = Gap for Sequence 1
Diagonal = Match or Mismatch

25 of 29

Dynamic Programming

		0	1	2	3	4	5
			G	A	A	T	C
0		0	-4	-8	-12	-16	-20
1	C	-4	-5	-9	-13	-12	-6
2	A	-8	-4	5	1	-3	-7
3	T	-12	-8	1	0	11	7
4	A	-16	-12	2	11	7	6
5	C	-20	-16	-2	7	11	17

	A	C	G	T
A	10	-5	0	-5
C	-5	10	-5	0
G	0	-5	10	-5
T	-5	0	-5	10

GAAT-C

-CATAC

GA-ATC

CATA-C

26 of 29

Dynamic Programming

27 of 29

Needleman-Wunsch Algorithm

Base Cases: Row 0 and Column 0 to be filled with gaps * i
Tabulation: Iterate over 2d array and compute 3 different options (right, down, diagonal)

Choose max value to populate cell with

Trace Back: Start at bottom right and trace the path back to the top left, aligning each movement/direction to the global sequence alignment

n	Exhaustive enumeration	Dynamic programming
5	1.7 * 10³	7.5 * 10¹
10	8.1 * 10⁶	3.0 * 10²
20	2.6 * 10¹⁴	1.2 * 10³
30	9.6 * 10²¹	2.7 * 10³
40	3.8 * 10²⁹	4.8 * 10³

28 of 29

Local Alignment: Smith-Waterman

There may be cases where we don’t want to match to the length of full strings

Full alignment is not expected in reality

What if we just want to find the best-matching subsequences?

Local alignment! Also uses DP.

Trick: Any negative values are instead replaced with 0’s.

When tracing back, we choose the largest value in the matrix and follow back until we hit a 0 and return this

29 of 29

Thanks! See you all Wednesday! ~

Today’s Song Rec:

“Cities”

Throttle

		0	1	2	3	4	5
			G	A	A	T	C
0		0	-4	-8	-12	-16	-20
1	C	-4	-5	-9	-13	-12	-6
2	A	-8	-4	5	1	-3	-7
3	T	-12	-8	1	0	11	7
4	A	-16	-12	2	11	7	6
5	C	-20	-16	-2	7	11	17

		0	1	2	3	4	5
			G	A	A	T	C
0		0	-4	-8	-12	-16	-20
1	C	-4	-5	-9	-13	-12	-6
2	A	-8	-4	5	1	-3	-7
3	T	-12	-8	1	0	11	7
4	A	-16	-12	2	11	7	6
5	C	-20	-16	-2	7	11	17

		0	1	2	3	4	5
			G	A	A	T	C
0		0	-4	-8	-12	-16	-20
1	C	-4	-5	-9	-13	-12	-6
2	A	-8	-4	5	1	-3	-7
3	T	-12	-8	1	0	11	7
4	A	-16	-12	2	11	7	6
5	C	-20	-16	-2	7	11	17