Alignment tools, part II
CSE 180
Yana Safonova
SAM format
Alignment section:
CIGAR string and sequence fields
CIGAR = L1C1L2C2L3C3..., where
Li is the length of operation
Ci is the code of operation
CIGAR string and sequence fields
CIGAR string and sequence fields
CIGAR string and sequence fields
CIGAR string and sequence fields
CIGAR string and sequence fields
Alignment to CIGAR string
AC-CAGCGTACG Reference
|. | .|||...
AGGC-CCGTTTT Query
Alignment to CIGAR string
AC-CAGCGTACG Reference
|. | .|||...
AGGC-CCGTTTT Query
2M1I1M1D4M3S
CIGAR string to alignment
Read1: GGGGTGTAACCGACTAGGGGG
Query ID | FLAG | Reference name | Start position | Mapping quality | CIGAR string |
Read1 | 0 | REF | 11 | 14 | 3S8M1D6M4S |
CIGAR to alignment
Read1: GGGGTGTAACCGACTAGGGGG
AGCTAGCATCGTGTCGCCCGTCTAGCATACG…
||||..|| |.||||
gggGTGTAACC-GACTAGgggg
Query ID | FLAG | Reference name | Start position | Mapping quality | CIGAR string |
Read1 | 0 | REF | 11 | 14 | 3S8M1D6M4S |
CIGAR string to alignment
Input: parsed_cigar, reference, read, start_pos
# format of parsed cigar = [(‘S’, 1), (‘M’, 10), ...]
CIGAR string to alignment
Input: parsed_cigar, reference, read, start_pos
# format of parsed cigar = [(‘S’, 1), (‘M’, 10), ...]
if parsed_cigar[0][0] == ‘S’:
read = read[parsed_cigar[0][1] : ]
CIGAR string to alignment
Input: parsed_cigar, reference, read, start_pos
# format of parsed cigar = [(‘S’, 1), (‘M’, 10), ...]
if parsed_cigar[0][0] == ‘S’:
read = read[parsed_cigar[0][1] : ]
start_cigar_pos = 0
if parsed_cigar[0][0] in [‘S’, ‘H’]:
start_cigar_pos = 1
CIGAR string to alignment
Input: parsed_cigar, reference, read, start_pos
# format of parsed cigar = [(‘S’, 1), (‘M’, 10), ...]
if parsed_cigar[0][0] == ‘S’:
read = read[parsed_cigar[0][1] : ]
start_cigar_pos = 0
if parsed_cigar[0][0] in [‘S’, ‘H’]:
start_cigar_pos = 1
ref_pos = start_pos - 1 # SAM positions are 1-based
read_pos = 0
read_alignment = ‘’
ref_alignment = ‘’
CIGAR string to alignment
Input: parsed_cigar, reference, read, start_pos
# format of parsed cigar = [(‘S’, 1), (‘M’, 10), ...]
if parsed_cigar[0][0] == ‘S’:
read = read[parsed_cigar[0][1] : ]
start_cigar_pos = 0
if parsed_cigar[0][0] in [‘S’, ‘H’]:
start_cigar_pos = 1
ref_pos = start_pos - 1 # SAM positions are 1-based
read_pos = 0
read_alignment = ‘’
ref_alignment = ‘’
for i in range(start_cigar_pos, len(parsed_cigar)):
cur_oper = parsed_cigar[i][0]
cur_len = parsed_cigar[i][1]
...
CIGAR string to alignment
Input: parsed_cigar, reference, read, start_pos
...
for i in range(start_cigar_pos, len(parsed_cigar)):
cur_oper = parsed_cigar[i][0]
cur_len = parsed_cigar[i][1]
if cur_oper == ‘M’: # matches and mismatches
read_alignment = read[read_pos : read_pos + cur_len]
ref_alignment = ref[ref_pos : ref_pos + cur_len]
ref_pos += cur_len
read_pos += cur_len
CIGAR string to alignment
Input: parsed_cigar, reference, read, start_pos
...
for i in range(start_cigar_pos, len(parsed_cigar)):
cur_oper = parsed_cigar[i][0]
cur_len = parsed_cigar[i][1]
if cur_oper == ‘M’:
...
elif cur_oper == ‘D’: # deletion from the reference
read_alignment += ‘-’ * cur_len
ref_alignment += reference[ref_pos : ref_pos + cur_len]
ref_pos += cur_len
CIGAR string to alignment
Input: parsed_cigar, reference, read, start_pos
...
for i in range(start_cigar_pos, len(parsed_cigar)):
cur_oper = parsed_cigar[i][0]
cur_len = parsed_cigar[i][1]
if cur_oper == ‘M’:
...
elif cur_oper == ‘D’:
...
elif cur_oper == ‘I’: # insertion to the reference
ref_alignment += ‘-’ * cur_len
read_alignment += read[read_pos : read_pos + cur_len]
read_pos += cur_len
CIGAR string to alignment
Input: parsed_cigar, reference, read, start_pos
...
for i in range(start_cigar_pos, len(parsed_cigar)):
cur_oper = parsed_cigar[i][0]
cur_len = parsed_cigar[i][1]
if cur_oper == ‘M’:
...
elif cur_oper == ‘D’:
...
elif cur_oper == ‘I’:
...
else: # cur_oper == ‘H’ or cur_oper == ‘S’
break
CIGAR string to alignment
Input: parsed_cigar, reference, read, start_pos
...
alignment_str = ‘’
for i in range(len(read_alignment)):
if read_alignment[i] == ref_alignment[i]:
alignment_str += ‘|’
elif read_alignment[i] == ‘-’ or ref_alignment[i] == ‘-’:
alignment_str += ‘ ’
else:
alignment_str += ‘.’
print ref_alignment
print alignment_str
print read_alignment
SAM tools
sort
index
view region
merge
phase
fixmate
Reference indexing in local alignment tools
Alignment strategies: standard approach
| A | C | G | T | A |
A | | | | | |
G | | | | | |
G | | | | | |
T | | | | | |
Alignment strategies: standard approach
| A | C | G | T | A |
A | | | | | |
G | | | | | |
G | | | | | |
T | | | | | |
→ - insertion to AGGT
↓ - insertion to ACGTA
↘ - match / mismatch
Alignment strategies: standard approach
| A | C | G | T | A |
A | | | | | |
G | | | | | |
G | | | | | |
T | | | | | |
→ - insertion to AGGT
↓ - insertion to ACGTA
↘ - match / mismatch
AC
A-
Alignment strategies: standard approach
| A | C | G | T | A |
A | | | | | |
G | | | | | |
G | | | | | |
T | | | | | |
→ - insertion to AGGT
↓ - insertion to ACGTA
↘ - match / mismatch
ACG
A--
Alignment strategies: standard approach
| A | C | G | T | A |
A | | | | | |
G | | | | | |
G | | | | | |
T | | | | | |
→ - insertion to AGGT
↓ - insertion to ACGTA
↘ - match / mismatch
ACGTA
A----
Alignment strategies: standard approach
| A | C | G | T | A |
A | | | | | |
G | | | | | |
G | | | | | |
T | | | | | |
→ - insertion to AGGT
↓ - insertion to ACGTA
↘ - match / mismatch
ACGTA-
A----G
Alignment strategies: standard approach
| A | C | G | T | A |
A | | | | | |
G | | | | | |
G | | | | | |
T | | | | | |
→ - insertion to AGGT
↓ - insertion to ACGTA
↘ - match / mismatch
ACGTA---
A----GGT
Alignment strategies: standard approach
| A | C | G | T | A |
A | | | | | |
G | | | | | |
G | | | | | |
T | | | | | |
→ - insertion to AGGT
↓ - insertion to ACGTA
↘ - match / mismatch
ACGTA
AGGT-
From alignment matrix to dot plots
Block1 Block2 Block1 Block3 Block1
Block1 Block4 Block3 Block1
From alignment matrix to dot plots
Block1 Block2 Block1 Block3 Block1
Block1 Block4 Block3 Block1
From alignment matrix to dot plots
Block1 Block2 Block1 Block3 Block1
Block1 Block4 Block3 Block1
From alignment matrix to dot plots
Block1 Block2 Block1 Block3 Block1
Block1 Block4 Block3 Block1
From alignment matrix to dot plots
Block1 Block2 Block1 Block3 Block1
Block1 Block4 Block3 Block1
From alignment matrix to dot plots
Block1 Block2 Block1 Block3 Block1
Block1 Block4 Block3 Block1
Dot plot examples
Drosophila melanogaster SLIT protein aligned against itself
Two substrains of E. coli K12: MG1655 (x) and W3110 (y)