1 of 38

CS481/CS583: Bioinformatics Algorithms

Can Alkan

EA509

calkan@cs.bilkent.edu.tr

http://www.cs.bilkent.edu.tr/~calkan/teaching/cs481/

2 of 38

CS481/CS583

  • Class hours:
      • Tue 13:30-15:20 - Fri 09:30-10:20
  • Classroom: B-Z04
  • Office hour: by appointment
  • TA: Akmuhammet Ashyralyyev (akmuhammet@bilkent), Zülal Bingöl (zulal.bingol@bilkent)
  • Grading:
    • 1 midterm: 25% + 1 final exam: 35%
    • Homeworks (programming): 30% (n=5 or 6)
    • Quizzes: 10% (n=5, announced)
  • Due to the YÖK (Higher Education Council) regulations, we are taking attendance and will report it to the Department at the end of the semester.
    • But attendance has no direct effect on grades
  • Grading Policy: Note that I do not discuss with students about grades. Therefore, I will not answer any questions about the passing grades and/or students' requests for passing the course. Any emails sent to this effect will be omitted.

3 of 38

CS481/CS583: Resources

  • All slides are available on the web page
    • http://cs.bilkent.edu.tr/~calkan/teaching/cs481/
    • Google Slides links -- read only
  • Recommended Material
    • Genome-Scale Algorithm Design, Veli Mäkinen, Djamal Belazzougui, Fabio Cunial, Alexandru I. Tomescu, 2015, Cambridge University Press
    • Other recommended books are listed on the web page
    • Problem sets on Rosalind: http://rosalind.info/problems/locations/
    • Bioinformatics Algorithms online book & videos:
  • DO NOT TRY TO MEMORIZE!

4 of 38

CS481/CS583

  • Recommended Textbooks
    • Genome Scale Algorithm Design, Veli Makinen, et al., Cambridge University Press, 2015
    • An Introduction to Bioinformatics Algorithms (Computational Molecular Biology), Neil Jones and Pavel Pevzner, MIT Press, 2004
    • https://www.bioinformaticsalgorithms.org/
  • Additional:
    • Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology, Dan Gusfield, Cambridge University Press
    • Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids, Richard Durbin, Sean R. Eddy, Anders Krogh, Graeme Mitchison, Cambridge University Press
    • Bioinformatics: The Machine Learning Approach, Second Edition, Pierre Baldi, Soren Brunak, MIT Press
    • ROSALIND problem sets: http://rosalind.info/problems/locations/

5 of 38

Planned Lecture Cancellations

The following lectures are most likely to be canceled due to conference travels.

  • April 25, 2025
  • April 29, 2025

6 of 38

7 of 38

CS481/CS583

  • This course is about algorithms in the field of bioinformatics:
    • What are the problems?
    • What algorithms are developed for what problem?
    • Algorithm design techniques
  • This course is not about how to analyze biological data using available tools:
    • Recommended course: MBG 326: Introduction to Bioinformatics

8 of 38

CS481/CS583 and other courses

  • Includes elements from:
    • CS201/202: data structures -- implicit prerequisite
    • CS473: algorithms, dynamic programming, greedy algorithms, branch-and-bound, etc.
    • CS476: complexity, context-free grammars, DFA/NFA
    • IE400: branch-and-bound algorithms

9 of 38

CS481/CS583: Assumptions

  • You are assumed to know/understand
    • Computer science basics (CS101/102 or CS111/112)
      • CS201/202 would be better
      • CS473 would be even better
    • Data structures (trees, linked lists, queues, etc.)
    • Elementary algorithms (sorting, hashing, etc.)
    • Programming: C, C++ (preferred); Python, Java, Rust
      • Note: we will give bonus points for the “fastest” code in some homeworks
  • You don’t have to be a “biology expert” and we will not teach any biology in this course: MBG 110 would be sufficient

10 of 38

Bioinformatics Algorithms

  • Development of methods based on computer science for problems in biology and medicine
    • Sequence analysis (combinatorial and statistical/probabilistic methods)
    • Graph theory
    • Data mining
    • Database
    • Statistics
    • Image processing
    • Visualization
    • …..

CS 481/CS583

11 of 38

Bioinformatics: Applications

  • Human disease
    • Personalized Medicine
  • Genomics: Genome analysis, gene discovery, regulatory elements, etc.
  • Population genomics
  • Evolutionary biology
  • Proteomics: analysis of proteins, protein pathways, interactions
  • Transcriptomics: analysis of the transcriptome (RNA sequences)

12 of 38

Why would you learn these algorithms?

  • Most developed for research within other fields that include string processing, clustering, text-pattern search, etc.
  • Bioinformatics (non-academic) jobs on the rise:
    • Genomics England, Genome Asia, etc.: 100,000 genome projects
    • DNAnexus, SevenBridges, LifeBit: genome analysis on the cloud.

13 of 38

Genomics and healthcare

Stark et al., AJHG 2019

14 of 38

Tracking pathogens

http://www.nextstrain.org

15 of 38

(VERY) BRIEF INTRODUCTION TO COMPLEXITY

16 of 38

Tractable vs intractable

  • Tractable problems: there exists a solution with O(f(n)) run time, where f(n) is polynomial
  • P is the set of problems that are known to be solvable in polynomial time
  • NP is the set of problems that are verifiable in polynomial time (or, solvable by a non-deterministic Turing Machine in polynomial time)
    • NP: “non-deterministically polynomial”

17 of 38

NP-hard

  • NP-hard: non-deterministically polynomial - hard
    • Set of problems that are “at least as hard as the hardest problems in NP
    • There are no known polynomial time optimal solutions
    • There may be polynomial-time approximate solutions

18 of 38

NP-Complete

  • A decision problem C is in NPC if :
    • C is in NP
    • Every problem in NP is reducible to C in polynomial time

That means: if you could solve any NPC problem in polynomial time, then you can solve all of them in polynomial time

Decision problems: outputs “yes” or “no”

19 of 38

NP-intermediate

  • Problems that are in NP; but not in either NPC or NP-hard (as far as we know)

20 of 38

P vs. NP

  • We do not know whether P=NP or P≠NP
    • Principal unsolved problem in computer science
    • Most likely P≠NP

21 of 38

P vs. NP vs. NPC vs. NP-hard

22 of 38

Examples

  • P:
    • Sorting numbers, searching numbers, pairwise sequence alignment, etc.
  • NP-complete:
    • Subset-sum, traveling salesman, etc.
  • NP-intermediate:
    • Factorization, graph isomorphism, etc.

23 of 38

Historical reference

  • The notion of NP-Completeness: Stephen Cook and Leonid Levin independently in 1971
    • First NP-Complete problem to be identified: Boolean satisfiability problem (SAT)
      • Cook-Levin theorem
  • More NPC problems: Richard Karp, 1972
    • “21 NPC Problems”
  • Now there are thousands….

24 of 38

ALGORITHM DESIGN TECHNIQUES

25 of 38

Sample problem: Change

  • Input: An amount of money M, in cents
  • Output: Smallest number of coins that adds up to M
    • Quarters (25c): q
    • Dimes (10c): d
    • Nickels (5c): n
    • Pennies (1c): p
    • Or, in general, c1, c2, …, cd (d possible denominations)

26 of 38

Algorithm design techniques

  • Exhaustive search / brute force
    • Examine every possible alternative to find a solution

27 of 38

Algorithm design techniques

  • Greedy algorithms:
    • Choose the “most attractive” alternative at each iteration

28 of 38

Algorithm design techniques

  • Dynamic Programming:
    • Break problems into subproblems; solve subproblems; merge solutions of subproblems to solve the real problem
    • Keep track of computations to avoid recomputing values that you already solved
      • Dynamic programming table
    • Essentially, recursive algorithms that remember previous solutions

29 of 38

DP example: Rocks game

  • Two players
  • Two piles of rocks with p1 rocks in pile 1, and p2 rocks in pile 2
  • In turn, each player picks:
    • One rock from either pile 1 or pile 2; OR
    • One rock from pile 1 and one rock from pile2
  • The player that picks the last rock wins

30 of 38

DP algorithm for Player 1

  • Problem: p1 = p2 = 10
  • Solve more general problem of p1 = n and p2 = m
  • It’s hard to directly calculate for n=5 and m=6; we need to solve smaller problems

31 of 38

DP algorithm for Player 1

Initialize; obvious win for Player 1 for 1,0; 0,1 and 1,1

pile2

pile1

32 of 38

DP algorithm for Player 1

Player 1 cannot win for 2,0 and 0,2

pile2

pile1

33 of 38

DP algorithm for Player 1

Player 1 can win for 2,1 if he picks one from pile2

Player 1 can win for 1,2 if he picks one from pile1

pile2

pile1

34 of 38

DP algorithm for Player 1

Player 1 cannot win for 2,2

Any move causes his opponent to go to W state

pile2

pile1

35 of 38

DP “moves”

When you are at position (i,j)

Go to:

Pick from pile 1:

Pick from pile 2:

Pick from both piles 1 and 2:

(i-1, j)

(i, j-1)

(i-1, j-1)

36 of 38

DP final table

Also keep track of the choices you need to make to achieve W

and L states: traceback table

Player 1 LOSES the game

37 of 38

Note on dynamic programming

Dynamic programming methods solve smaller subproblems first to solve the bigger problem.

INDUCTION

Therefore, when computing the solution:

DO NOT EVER START FROM THE SINK!

SOURCE -> SINK: computing the answer

SINK -> SOURCE: backtracking after the answer is calculated

38 of 38

Other algorithm design techniques

  • Branch and bound:
    • Omit a large number of alternatives when performing brute force
  • Divide and conquer:
    • Split, solve, merge
      • Mergesort
  • Machine learning (CS 464):
    • Analyze previously available solutions, calculate statistics, apply most likely solution
  • Randomized algorithms:
    • Pick a solution randomly, test if it works. If not, pick another random solution