SAM/BAM format, samtools, and IGV-ish
Applied Computational Genomics, Lecture 08
https://github.com/quinlan-lab/applied-computational-genomics
Aaron Quinlan
Departments of Human Genetics and Biomedical Informatics
USTAR Center for Genetic Discovery
University of Utah
quinlanlab.org
But first, two new UNIX tricks.
paste 1.fq 2.fq | sort | uniq -c &
nohup paste 1.fq 2.fq | sort | uniq -c > out.txt &
Sequence alignment software
Aligner | Approach | Applications | Availability |
BWA-mem | Burrows-Wheeler | DNA, SE, PE, SV | open-source |
Bowtie2 | Burrows-Wheeler | DNA, SE, PE, SV | open-source |
Novoalign | hash-based | DNA, SE, PE | free for academic use |
TopHat | Burrows-Wheeler | RNA-seq | open-source |
STAR | hash-based (reads) | RNA-seq | open-source |
GSNAP | hash-based (reads) | RNA-seq | open-source |
BWA-MEM: never "published" ; widely used.
https://arxiv.org/pdf/1303.3997v2.pdf
BWA-MEM
@seq1
ATTCGAAACA...
+
DDED88(999...
@seq2
CCCCGTTTCA...
+
AAC887BBAC...
BWA MEM
Unaligned
Sample Data
In FASTQ (SE or PE)
>chr1
TACCTCCAGGGGGCATCCTCCCCCCCAATTCGAAACACAATCGTAGCCCCTGGCACTACCTATGTGTGTCAATTCGGAGAGAGAGAGATTCACGAAAAAAAAGTCTGGACTCAACTAGGATACACACATTCGGCTACAGATACCAAAAAAAAAAAAAAAAAAATTTTCACCATTGAGGCACCACCTTCTCGTCGCTGCGTCGCTCTGCTCGCTTCGGCTAAAAATTCGCGCAATACATTCGGCTACAGATACCAAA
Reference genome (FASTA)
seq1 99 1 3666901 60 149M = 3666935 185 ATTCGAAACA... DDED88(999 MC:Z:151M MD:Z:149 RG:Z:15-0017315_1 NM:i:0 MQ:i:60 AS:i:149 XS:i:44
seq2 147 1 3666935 60 151M = 3666901 -185 CCCCGTTTCA... AAC887BBAC... MC:Z:149M MD:Z:151 RG:Z:15-0017315_1 NM:i:0 MQ:i:60 AS:i:151 XS:i:59
Aligned
Sample Data in
SAM format
BWA-MEM workflow
Create BWT of reference genome.
$ bwa index grch38.fa
Align paired-end FASTQ
to BWT index.
$ bwa mem -t 16 grch38.fa 1.fq 2.fq > sample.sam
Output is in SAM format.
Use multiple threads if you have a computer with multiple CPUs.
This takes a long time, but you do it once
Where do you find FASTA reference genomes?
SAM format: a text-based standard(!) for representing sequence alignments
SAM format overview
What critical information do we need for sequence alignments?
SAM format overview
Col # | Name | Meaning | Example |
1 | QNAME | Read or Pair name | HWI:ST156_1:278:1:1058:4544:0 |
2 | FLAG | Bitwise FLAG | Much more soon! |
3 | RNAME | Reference sequence name | chr1 |
4 | POS | 1-based alignment start coordinate | 8,724,005 |
5 | MAPQ | Mapping quality | 60 |
6 | CIGAR | Extended CIGAR string | Much more soon! |
7 | MRNM | If paired, the mate’s reference seq. | chr1 |
8 | MPOS | If paired, the mate’s alignment start | 8,724,505 |
9 | ISIZE | If paired, the insert size | 562 |
10 | SEQ | The sequence of the query/mate | ACAAATTCAG... |
11 | QUAL | The quality string for the query/mate | HHH$^^%$$$... |
12 | OPT | Optional Tags | XA:i:2, MD:Z:0T34G15 |
Col # | Name | Meaning | Example |
1 | QNAME | Read or Pair name | HWI:ST156_1:278:1:1058:4544:0 |
2 | FLAG | Bitwise FLAG | Much more soon! |
3 | RNAME | Reference sequence name | chr1 |
4 | POS | 1-based alignment start coordinate | 8,724,005 |
5 | MAPQ | Mapping quality | 60 |
6 | CIGAR | Extended CIGAR string | Much more soon! |
7 | MRNM | If paired, the mate’s reference seq. | chr1 |
8 | MPOS | If paired, the mate’s alignment start | 8,724,505 |
9 | ISIZE | If paired, the insert size | 562 |
10 | SEQ | The sequence of the query/mate | ACAAATTCAG... |
11 | QUAL | The quality string for the query/mate | HHH$^^%$$$... |
12 | OPT | Optional Tags | XA:i:2, MD:Z:0T34G15 |
1
2
3
4
ST-E00223:32:H5J57CCXX:6:2123:15189:52872 97 1 10001 0 4S15M1I54M2I50M25S 4 699063 0 ACCCTAACCCTAACCCTAACCCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAAACCCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCAACCCTAACCCCCCCCCACCCAACCCCACCCCCCAC AAFFFKKKKKKKKKKKKKKKKKKKKFKKFKKKKKKFKKAKKKKFFKKKKKFKF<FF7FFFFK7FK,AA,FKFFKF,FKKK7<KKK,,7FFF7F,AAFFK,AKA,,FFF<,A7FKK,,,7A,,,7,,,,,((,(,,(,,,,,,,(,,A(,,( MC:Z:80S11M1D58M MD:Z:119 RG:Z:15-0017315_1 NM:i:3 MQ:i:47 AS:i:104 XS:i:103
ST-E00223:46:HG7V5CCXX:2:1116:12601:22862 1123 1 10006 0 81M70S = 10106 143 CTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACACTCACCCTAACCCTACCCCTAACCCTAACCCTATCATTCACTCGAACCCTAACACTACCGCTAGCGCTAACTCCAGCCCGCACACTATCGCTAACCCTCACGC AAFFFKKKKKKKKKKKKKKKFKKA,,A,A,,77<,,A,,7FFK7,,,,AF,,,A7,,,77,,,7AA,,7,FFK,<A,,,,7<<,,,,AA,7,AA,,7,,,7,,,,,,,,,,,A7,,,,,,7,7,7(,(,((,,,7,A,F,7,,,<,,AA,, MC:Z:108S43M MD:Z:47C2A12A17 RG:Z:15-0017315_1 NM:i:3 MQ:i:2 AS:i:66 XS:i:71
ST-E00223:32:H5J57CCXX:5:2208:10074:43308 99 1 10008 36 101M1I41M7S = 10107 137 AACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTACCCCTAACCCTAACCCTAACCCTAACCCTCACCCTCACCCTCACCCTAGCCCAAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTACCCCG AAFFFKFKKKKKKKKKKKKKKKKFKKKKKKKFKKK7<KA7<F,AF7F,7,A,FA7<KKKKKFKKKK<,,FKAKKAFFA,A7,7,,,7,7,,,,,7F<,,7,,,,7,,A,,,<F7,,,7FA,7,,<F,,,<A,7<AFKK<,<,,7,,,,,( MC:Z:112S38M MD:Z:49A28A5A5A6A44 RG:Z:15-0017315_1 NM:i:6 MQ:i:36 AS:i:110 XS:i:113
ST-E00223:46:HG7V5CCXX:5:2119:12936:64896 99 1 10013 0 90M61S = 10176 211 TAACCCTAAGCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCATAACCCAAACCCTAACCCTAACCCTAACCCGAACCGTAAGCCAAAACATAACCACAACCATAACAATAACCAAAACCTTAACGTTAAACAT A<AFFKKKK,AAAFKK<FKKKKAK,<A,AFKKKKAFFFKKKKAFKKFAFFFKKA7,FFA,F,F,7F,AFF77AFFAFFKAFKKA,,FFKF,AA,F,,7F,A,,7A,,,7A,,,,,,AFK,,,AA,,,7FKA,A,AFF,<,,,,,,,77<AA MC:Z:99S48M4S MD:Z:9C49C6T23 RG:Z:15-0017315_1 NM:i:3 MQ:i:0 AS:i:75 XS:i:75
ST-E00223:32:H5J57CCXX:1:1205:17290:54577 99 1 10019 1 92M59S = 10354 414 TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCAAACCCCCACCCCAAACCCAAGCCCAACCCTACCCCCTAACCCCTAACCCCAACCCTGACC AAFFFKKKKFKKKKAKKKKKKKKKKKKKKKF7KKKKKKKKKKKKKFKKKKFKKKKKKKAKKKFFKKFFKKKK,<,7<FA,FFFK,,7FFF,,,,,A,((,(((,7,7A,,,,,A,,,777A,AAKK<<,7FFA7,A,,AA,<,,AFA,7AF MC:Z:72S79M MD:Z:92 RG:Z:15-0017315_1 NM:i:0 MQ:i:20 AS:i:92 XS:i:97
Recall: Edit distance
How many edits (changes) must be made to a word or kmer to make it match (align) to another word or kmer?
CURLED
HURLED
Edit distance = 1. Substitute C for H
SHORT
SHO-T
Edit distance = 1. Delete R
TGTTACGG
GGTTGACTA
?
TG-TT-ACGG
-GGTTGACTA
Edit distance = 5
TGTT-ACGG
GGTTGACTA
Edit distance = 4
The CIGAR string: encode the details of the alignment
Operation | Meaning |
M | Match* |
D | Deletion w.r.t. reference |
I | Insertion w.r.t. reference |
N | Split or spliced alignment |
S | Soft-clipping |
H | Hard-clipping |
P | Padding |
Reference:
Experimental:
ACCTGTC--TACCTTACG
ACCT-TCCATACTTTATC
4M1D2M2I7M2S
CIGAR string:
4M
1D
2M
2I
7M
2S
LENGTH/OPERATION
The extended CIGAR string: M become = and X
Operation | Meaning |
= | Exact match |
X | Mismatch |
D | Deletion w.r.t. reference |
I | Insertion w.r.t. reference |
N | Split or spliced alignment |
S | Soft-clipping |
H | Hard-clipping |
P | Padding |
Reference:
Experimental:
ACCTGTC--TACCTTACG
ACCT-TCCATACTTTATC
4=1D2=2I3=1X3=2S
CIGAR string:
4=
1D
2=
2I
3=
2S
1X
3=
The FLAG column
ST-E00223:32:H5J57CCXX:6:2123:15189:52872 97 1 10001
ST-E00223:46:HG7V5CCXX:2:1116:12601:22862 1123 1 10006
ST-E00223:32:H5J57CCXX:5:2208:10074:43308 99 1 10008
ST-E00223:46:HG7V5CCXX:5:2119:12936:64896 99 1 10013
ST-E00223:32:H5J57CCXX:1:1205:17290:54577 99 1 10019
ST-E00223:32:H5J57CCXX:6:1115:16844:11013 81 1 10026
ST-E00223:32:H5J57CCXX:7:2113:18935:32356 99 1 10032
ST-E00223:46:HG7V5CCXX:6:2117:3082:44239 99 1 10040
ST-E00223:46:HG7V5CCXX:5:2213:10744:58813 163 1 10074
ST-E00223:32:H5J57CCXX:4:1220:14651:8868 99 1 10086
Sequence ID
FLAG
CHROM
POS
The FLAG score
base2 | base10 | base16 | Meaning | Applies to: |
00000000001 | 1 | 0x0001 | The read originated from a paired sequencing molecule | Both |
00000000010 | 2 | 0x0002 | The read is mapped in a proper pair | Pairs only |
00000000100 | 4 | 0x0004 | The query sequence itself is unmapped | Both |
00000001000 | 8 | 0x0008 | The query’s mate is unmapped | Pairs only |
00000010000 | 16 | 0x0010 | Strand of the query (0 for forward; 1 for reverse strand) | Both |
00000100000 | 32 | 0x0020 | Strand of the query’s mate | Pairs only |
00001000000 | 64 | 0x0040 | The query is the first read in the pair | Pairs only |
00010000000 | 128 | 0x0080 | The read is the second read in the pair | Pairs only |
00100000000 | 256 | 0x0100 | The alignment is not primary | Both |
01000000000 | 512 | 0x0200 | The read fails platform/vendor quality checks | Both |
10000000000 | 1024 | 0x0400 | The read is either a PCR duplicate or an optical duplicate | Both |
base2 | base10 | base16 | Meaning | Applies to: |
00000000001 | 1 | 0x0001 | The read originated from a paired sequencing molecule | Both |
00000000010 | 2 | 0x0002 | The read is mapped in a proper pair | Pairs only |
00000000100 | 4 | 0x0004 | The query sequence itself is unmapped | Both |
00000001000 | 8 | 0x0008 | The query’s mate is unmapped | Pairs only |
00000010000 | 16 | 0x0010 | Strand of the query (0 for forward; 1 for reverse strand) | Both |
00000100000 | 32 | 0x0020 | Strand of the query’s mate | Pairs only |
00001000000 | 64 | 0x0040 | The query is the first read in the pair | Pairs only |
00010000000 | 128 | 0x0080 | The read is the second read in the pair | Pairs only |
00100000000 | 256 | 0x0100 | The alignment is not primary | Both |
01000000000 | 512 | 0x0200 | The read fails platform/vendor quality checks | Both |
10000000000 | 1024 | 0x0400 | The read is either a PCR duplicate or an optical duplicate | Both |
00001100011
26+25+21+20 = 64+32+2+1 = 99
ST-E00223:32:H5J57CCXX:4:1220:14651:8868 99 1 10086
Use samtools to convert SAM to BAM.
Create BWT of reference genome.
$ bwa index grch38.fa
Align paired-end FASTQ
to BWT index.
$ bwa mem -t 16 grch38.fa 1.fq 2.fq > sample.sam
Output is in SAM format.
Use multiple threads if you have a computer with multiple CPUs.
Convert SAM to BAM
$ samtools view -Sb sample.sam > sample.bam
Output is in BAM format.
However, it is unsorted - that is, random genomic order as reads are randomly placed in FASTQ by sequencer.
This takes a long time, but you do it once
What is the best way to use tons of disk space and have very inefficient analyses?
Text files of billions of alignments
Let's download a SAM file to play with
curl https://home.chpc.utah.edu/~u1007787/HG00096.sam.gz > HG00096.sam.gz
# uncompress the file
gzip -d HG00096.sam.gz
Peek indide. What is all that?
head HG00096.sam
Skip the header. Look at the alignment records
grep -v ^@ HG00096.sam | head
SRR062634.5360 181 20 45834704 0 35M65S = 45834704 0 CACCAATACAAAAACCCCCCACCCAAAAAAGAAAACTCCCTATCCCCTGTTCCTGTTCCATCCCCCACTGGTGGATGAGAGAGGGTCCCGGGAGGGGGGG #################################################################################################A8) XC:i:35 RG:Z:SRR062634
SRR062634.15082 181 20 37991695 0 35M65S = 37991695 0 AAAACTATTAAAAAGGTGTCCAAAAAAGAAAATAAAAAGCAAAGTTGGCTTATTTCCCAAAACCCACAGCCCAAAAAAAAGAAAAAAACCATTAAACAAA #################################################################################################### XC:i:35 RG:Z:SRR062634
SAMTOOLS: Converting and manipulating SAM/BAM
http://www.htslib.org/doc/samtools.html
Commands:
-- Indexing
dict create a sequence dictionary file
faidx index/extract FASTA
index index alignment
-- Editing
calmd recalculate MD/NM tags and '=' bases
fixmate fix mate information
reheader replace BAM header
rmdup remove PCR duplicates
targetcut cut fosmid regions (for fosmid pool only)
addreplacerg adds or replaces RG tags
-- Viewing
flags explain BAM flags
tview text alignment viewer
view SAM<->BAM<->CRAM conversion
depad convert padded BAM to unpadded BAM
Convert a SAM file to BAM for analysis. Essential.
samtools view
samtools view -b HG00096.sam > HG00096.bam
# compare the file sizes of the SAM and BAM files!!
# why is the BAM file so much smaller?
Peek in the BAM file with head
head HG00096.bam
# what is this garbage?!
BAM is binary, not text. Use samtools
samtools view HG00096.bam | head
Look at the chrom and position columns (3,4)
samtools view HG00096.bam | cut -f 3,4 | head
20 39038334
20 39038227
20 8140216
20 8140318
20 26129711
20 26129787
20 38810978
20 38811045
20 32585093
20 32585176
# the alignment records are not sorted by position. Bad! Why?
Sort the BAM file by position to do any analysis.
Need to do this is batches by classroom row!
samtools sort
samtools sort HG00096.bam > HG00096.sorted.bam
samtools view HG00096.sorted.bam | cut -f 3,4 | head -n 5
20 59993
20 59993
20 59999
20 60015
20 60026
Now we must index the sorted BAM file.
This allows tools to rapidly find all alignments associated with any arbitrary genomic locus (e.g., a gene of interest)
samtools index
samtools index HG00096.sorted.bam
ls -ltr
WE ARE READY FOR IGV. BUT WHAT FIRST?
Need to download IGV to our laptops, not malibu!
https://igv.org/doc/desktop/#DownloadPage/
Launch IGV
What are we missing?
We need to transfer the sorted BAM and index from malibu to our laptop
# leave malibu
exit
# use scp (secure copy) to transfer the BAM file from malibu to laptop
scp u1007787@malibu.genetics.utah.edu:~/HG00096.sorted.bam .
# use scp (secure copy) to transfer the BAM file from malibu to laptop
scp u1007787@malibu.genetics.utah.edu:~/HG00096.sorted.bam.bai .
Now we can use File->Load From File in IGV
IGV tutorial
https://github.com/griffithlab/rnaseq_tutorial/wiki/IGV-Tutorial
Reading assignment before Tuesday
Pages 30-35 of section 1.3 of Jonathan Protchard’s book before class on Tuesday. 20 minutes of reading.
https://web.stanford.edu/group/pritchardlab/HGbook/Release-2023-09/HGBook-2023-09-chapters/HGBook-2023-09-23-ch1.3.pdf