1 of 39

Alignment tools, part II

CSE 180

Yana Safonova

2 of 39

SAM format

Alignment section:

  • Tab separated fields
  • Query name, FLAG, reference name, position in the reference, mapping quality, CIGAR string, information about the mate read, read sequence, quality sequence, optional fields

3 of 39

CIGAR string and sequence fields

CIGAR = L1C1L2C2L3C3..., where

Li is the length of operation

Ci is the code of operation

4 of 39

CIGAR string and sequence fields

  • 293M = sequence with matches and mismatches of length 293

5 of 39

CIGAR string and sequence fields

  • 90M1I3M1D193M6S - ?

6 of 39

CIGAR string and sequence fields

  • 90M1I3M1D193M6S - 90 matches / mismatches, 1 nt insertion, 3 matches / mismatches, 1 nt deletion, 193 matches, 6 nt clipped (but present in the sequence!)
  • Length of aligned sequence = 90 + 1 + 3 + 0 + 193 + 6 = 293

7 of 39

CIGAR string and sequence fields

  • 171M1I30M22H - ?

8 of 39

CIGAR string and sequence fields

  • 171M1I30M22H - 171 matches / mismatches, 1 insertion, 30 matches / mismatches, 22 nt clipped (not present in sequence)
  • Length of aligned sequence = 171 + 1 + 30 = 202
  • Length of original sequence = 171 + 1 + 30 + 22 = 224

9 of 39

Alignment to CIGAR string

AC-CAGCGTACG Reference

|. | .|||...

AGGC-CCGTTTT Query

10 of 39

Alignment to CIGAR string

AC-CAGCGTACG Reference

|. | .|||...

AGGC-CCGTTTT Query

2M1I1M1D4M3S

11 of 39

CIGAR string to alignment

Read1: GGGGTGTAACCGACTAGGGGG

Query ID

FLAG

Reference name

Start position

Mapping quality

CIGAR string

Read1

0

REF

11

14

3S8M1D6M4S

12 of 39

CIGAR to alignment

Read1: GGGGTGTAACCGACTAGGGGG

AGCTAGCATCGTGTCGCCCGTCTAGCATACG…

||||..|| |.||||

gggGTGTAACC-GACTAGgggg

Query ID

FLAG

Reference name

Start position

Mapping quality

CIGAR string

Read1

0

REF

11

14

3S8M1D6M4S

13 of 39

CIGAR string to alignment

Input: parsed_cigar, reference, read, start_pos

# format of parsed cigar = [(‘S’, 1), (‘M’, 10), ...]

14 of 39

CIGAR string to alignment

Input: parsed_cigar, reference, read, start_pos

# format of parsed cigar = [(‘S’, 1), (‘M’, 10), ...]

if parsed_cigar[0][0] == ‘S’:

read = read[parsed_cigar[0][1] : ]

15 of 39

CIGAR string to alignment

Input: parsed_cigar, reference, read, start_pos

# format of parsed cigar = [(‘S’, 1), (‘M’, 10), ...]

if parsed_cigar[0][0] == ‘S’:

read = read[parsed_cigar[0][1] : ]

start_cigar_pos = 0

if parsed_cigar[0][0] in [‘S’, ‘H’]:

start_cigar_pos = 1

16 of 39

CIGAR string to alignment

Input: parsed_cigar, reference, read, start_pos

# format of parsed cigar = [(‘S’, 1), (‘M’, 10), ...]

if parsed_cigar[0][0] == ‘S’:

read = read[parsed_cigar[0][1] : ]

start_cigar_pos = 0

if parsed_cigar[0][0] in [‘S’, ‘H’]:

start_cigar_pos = 1

ref_pos = start_pos - 1 # SAM positions are 1-based

read_pos = 0

read_alignment = ‘’

ref_alignment = ‘’

17 of 39

CIGAR string to alignment

Input: parsed_cigar, reference, read, start_pos

# format of parsed cigar = [(‘S’, 1), (‘M’, 10), ...]

if parsed_cigar[0][0] == ‘S’:

read = read[parsed_cigar[0][1] : ]

start_cigar_pos = 0

if parsed_cigar[0][0] in [‘S’, ‘H’]:

start_cigar_pos = 1

ref_pos = start_pos - 1 # SAM positions are 1-based

read_pos = 0

read_alignment = ‘’

ref_alignment = ‘’

for i in range(start_cigar_pos, len(parsed_cigar)):

cur_oper = parsed_cigar[i][0]

cur_len = parsed_cigar[i][1]

...

18 of 39

CIGAR string to alignment

Input: parsed_cigar, reference, read, start_pos

...

for i in range(start_cigar_pos, len(parsed_cigar)):

cur_oper = parsed_cigar[i][0]

cur_len = parsed_cigar[i][1]

if cur_oper == ‘M’: # matches and mismatches

read_alignment = read[read_pos : read_pos + cur_len]

ref_alignment = ref[ref_pos : ref_pos + cur_len]

ref_pos += cur_len

read_pos += cur_len

19 of 39

CIGAR string to alignment

Input: parsed_cigar, reference, read, start_pos

...

for i in range(start_cigar_pos, len(parsed_cigar)):

cur_oper = parsed_cigar[i][0]

cur_len = parsed_cigar[i][1]

if cur_oper == ‘M’:

...

elif cur_oper == ‘D’: # deletion from the reference

read_alignment += ‘-’ * cur_len

ref_alignment += reference[ref_pos : ref_pos + cur_len]

ref_pos += cur_len

20 of 39

CIGAR string to alignment

Input: parsed_cigar, reference, read, start_pos

...

for i in range(start_cigar_pos, len(parsed_cigar)):

cur_oper = parsed_cigar[i][0]

cur_len = parsed_cigar[i][1]

if cur_oper == ‘M’:

...

elif cur_oper == ‘D’:

...

elif cur_oper == ‘I’: # insertion to the reference

ref_alignment += ‘-’ * cur_len

read_alignment += read[read_pos : read_pos + cur_len]

read_pos += cur_len

21 of 39

CIGAR string to alignment

Input: parsed_cigar, reference, read, start_pos

...

for i in range(start_cigar_pos, len(parsed_cigar)):

cur_oper = parsed_cigar[i][0]

cur_len = parsed_cigar[i][1]

if cur_oper == ‘M’:

...

elif cur_oper == ‘D’:

...

elif cur_oper == ‘I’:

...

else: # cur_oper == ‘H’ or cur_oper == ‘S’

break

22 of 39

CIGAR string to alignment

Input: parsed_cigar, reference, read, start_pos

...

alignment_str = ‘’

for i in range(len(read_alignment)):

if read_alignment[i] == ref_alignment[i]:

alignment_str += ‘|’

elif read_alignment[i] == ‘-’ or ref_alignment[i] == ‘-’:

alignment_str += ‘ ’

else:

alignment_str += ‘.’

print ref_alignment

print alignment_str

print read_alignment

23 of 39

SAM tools

  • BAM and CRAM are compressed binary versions of SAM file
  • Binary format enables fast manipulations of alignments

sort

index

view region

merge

phase

fixmate

24 of 39

Reference indexing in local alignment tools

  • Standard alignment of sequences of length M and N takes O(MN)
  • Alignment of K reads of length N takes k ⋅ O(MN)
  • Preprocessed reference enables computation of alignment in O(kN)
  • Reference preprocessing takes O(M)
  • The overall alignment takes O(M) + O(kN)
  • Alignment of new set of reads to the same library takes only O(kN)!

25 of 39

Alignment strategies: standard approach

A

C

G

T

A

A

G

G

T

  • We want to align sequences ACGTA and AGGT
  • Let’s visualize all possible alignments as a table of size len(ACGTA) * len(AGGT)
  • Our goal is get from the green cell to the violet cell
  • We can move in this directions: →, ↓, and ↘

26 of 39

Alignment strategies: standard approach

A

C

G

T

A

A

G

G

T

→ - insertion to AGGT

↓ - insertion to ACGTA

↘ - match / mismatch

27 of 39

Alignment strategies: standard approach

A

C

G

T

A

A

G

G

T

→ - insertion to AGGT

↓ - insertion to ACGTA

↘ - match / mismatch

AC

A-

28 of 39

Alignment strategies: standard approach

A

C

G

T

A

A

G

G

T

→ - insertion to AGGT

↓ - insertion to ACGTA

↘ - match / mismatch

ACG

A--

29 of 39

Alignment strategies: standard approach

A

C

G

T

A

A

G

G

T

→ - insertion to AGGT

↓ - insertion to ACGTA

↘ - match / mismatch

ACGTA

A----

30 of 39

Alignment strategies: standard approach

A

C

G

T

A

A

G

G

T

→ - insertion to AGGT

↓ - insertion to ACGTA

↘ - match / mismatch

ACGTA-

A----G

31 of 39

Alignment strategies: standard approach

A

C

G

T

A

A

G

G

T

→ - insertion to AGGT

↓ - insertion to ACGTA

↘ - match / mismatch

ACGTA---

A----GGT

32 of 39

Alignment strategies: standard approach

A

C

G

T

A

A

G

G

T

→ - insertion to AGGT

↓ - insertion to ACGTA

↘ - match / mismatch

ACGTA

AGGT-

33 of 39

From alignment matrix to dot plots

Block1 Block2 Block1 Block3 Block1

Block1 Block4 Block3 Block1

34 of 39

From alignment matrix to dot plots

Block1 Block2 Block1 Block3 Block1

Block1 Block4 Block3 Block1

35 of 39

From alignment matrix to dot plots

Block1 Block2 Block1 Block3 Block1

Block1 Block4 Block3 Block1

36 of 39

From alignment matrix to dot plots

Block1 Block2 Block1 Block3 Block1

Block1 Block4 Block3 Block1

37 of 39

From alignment matrix to dot plots

Block1 Block2 Block1 Block3 Block1

Block1 Block4 Block3 Block1

38 of 39

From alignment matrix to dot plots

Block1 Block2 Block1 Block3 Block1

Block1 Block4 Block3 Block1

39 of 39

Dot plot examples

Drosophila melanogaster SLIT protein aligned against itself

Two substrains of E. coli K12: MG1655 (x) and W3110 (y)