1 of 39

SAM/BAM format, samtools, and IGV-ish

Applied Computational Genomics, Lecture 08

https://github.com/quinlan-lab/applied-computational-genomics

Aaron Quinlan

Departments of Human Genetics and Biomedical Informatics

USTAR Center for Genetic Discovery

University of Utah

quinlanlab.org

2 of 39

But first, two new UNIX tricks.

Run a command in the "background". That is, run the command, but give you control of your prompt.

paste 1.fq 2.fq | sort | uniq -c &

Run in the background and keep running even if you terminate your session.

nohup paste 1.fq 2.fq | sort | uniq -c > out.txt &

3 of 39

Sequence alignment software

Aligner	Approach	Applications	Availability
BWA-mem	Burrows-Wheeler	DNA, SE, PE, SV	open-source
Bowtie2	Burrows-Wheeler	DNA, SE, PE, SV	open-source
Novoalign	hash-based	DNA, SE, PE	free for academic use
TopHat	Burrows-Wheeler	RNA-seq	open-source
STAR	hash-based (reads)	RNA-seq	open-source
GSNAP	hash-based (reads)	RNA-seq	open-source

4 of 39

BWA-MEM: never "published" ; widely used.

https://arxiv.org/pdf/1303.3997v2.pdf

5 of 39

BWA-MEM

@seq1

ATTCGAAACA...

+

DDED88(999...

@seq2

CCCCGTTTCA...

+

AAC887BBAC...

BWA MEM

Unaligned

Sample Data

In FASTQ (SE or PE)

>chr1

TACCTCCAGGGGGCATCCTCCCCCCCAATTCGAAACACAATCGTAGCCCCTGGCACTACCTATGTGTGTCAATTCGGAGAGAGAGAGATTCACGAAAAAAAAGTCTGGACTCAACTAGGATACACACATTCGGCTACAGATACCAAAAAAAAAAAAAAAAAAATTTTCACCATTGAGGCACCACCTTCTCGTCGCTGCGTCGCTCTGCTCGCTTCGGCTAAAAATTCGCGCAATACATTCGGCTACAGATACCAAA

Reference genome (FASTA)

seq1 99 1 3666901 60 149M = 3666935 185 ATTCGAAACA... DDED88(999 MC:Z:151M MD:Z:149 RG:Z:15-0017315_1 NM:i:0 MQ:i:60 AS:i:149 XS:i:44

seq2 147 1 3666935 60 151M = 3666901 -185 CCCCGTTTCA... AAC887BBAC... MC:Z:149M MD:Z:151 RG:Z:15-0017315_1 NM:i:0 MQ:i:60 AS:i:151 XS:i:59

Aligned

Sample Data in

SAM format

6 of 39

BWA-MEM workflow

Create BWT of reference genome.

$ bwa index grch38.fa

Align paired-end FASTQ

to BWT index.

$ bwa mem -t 16 grch38.fa 1.fq 2.fq > sample.sam

Output is in SAM format.

Use multiple threads if you have a computer with multiple CPUs.

This takes a long time, but you do it once

7 of 39

Where do you find FASTA reference genomes?

https://hgdownload.cse.ucsc.edu/goldenpath/hg38/bigZips/

https://hgdownload.soe.ucsc.edu/downloads.html

8 of 39

SAM format: a text-based standard(!) for representing sequence alignments

9 of 39

SAM format overview

In the dark ages, sequence aligners used disparate output formats. Pain.
1000 Genomes Project sought to standardize. Standards are good.
The result is imperfect, but it’s a huge improvement.
Strengths of the SAM and BAM formats

Compressed: less disk hungry
Indexed: fast viewing, slicing, etc.
Single-end and paired-end
Relatively simple to produce
Good toolkits available

10 of 39

What critical information do we need for sequence alignments?

11 of 39

SAM format overview

http://samtools.sourceforge.net/samtools.shtml

Col #	Name	Meaning	Example
1	QNAME	Read or Pair name	HWI:ST156_1:278:1:1058:4544:0
2	FLAG	Bitwise FLAG	Much more soon!
3	RNAME	Reference sequence name	chr1
4	POS	1-based alignment start coordinate	8,724,005
5	MAPQ	Mapping quality	60
6	CIGAR	Extended CIGAR string	Much more soon!
7	MRNM	If paired, the mate’s reference seq.	chr1
8	MPOS	If paired, the mate’s alignment start	8,724,505
9	ISIZE	If paired, the insert size	562
10	SEQ	The sequence of the query/mate	ACAAATTCAG...
11	QUAL	The quality string for the query/mate	HHH$^^%$$$...
12	OPT	Optional Tags	XA:i:2, MD:Z:0T34G15

12 of 39

Col #	Name	Meaning	Example
1	QNAME	Read or Pair name	HWI:ST156_1:278:1:1058:4544:0
2	FLAG	Bitwise FLAG	Much more soon!
3	RNAME	Reference sequence name	chr1
4	POS	1-based alignment start coordinate	8,724,005
5	MAPQ	Mapping quality	60
6	CIGAR	Extended CIGAR string	Much more soon!
7	MRNM	If paired, the mate’s reference seq.	chr1
8	MPOS	If paired, the mate’s alignment start	8,724,505
9	ISIZE	If paired, the insert size	562
10	SEQ	The sequence of the query/mate	ACAAATTCAG...
11	QUAL	The quality string for the query/mate	HHH$^^%$$$...
12	OPT	Optional Tags	XA:i:2, MD:Z:0T34G15

1

2

3

4

13 of 39

ST-E00223:32:H5J57CCXX:6:2123:15189:52872 97 1 10001 0 4S15M1I54M2I50M25S 4 699063 0 ACCCTAACCCTAACCCTAACCCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAAACCCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCAACCCTAACCCCCCCCCACCCAACCCCACCCCCCAC AAFFFKKKKKKKKKKKKKKKKKKKKFKKFKKKKKKFKKAKKKKFFKKKKKFKF<FF7FFFFK7FK,AA,FKFFKF,FKKK7<KKK,,7FFF7F,AAFFK,AKA,,FFF<,A7FKK,,,7A,,,7,,,,,((,(,,(,,,,,,,(,,A(,,( MC:Z:80S11M1D58M MD:Z:119 RG:Z:15-0017315_1 NM:i:3 MQ:i:47 AS:i:104 XS:i:103

ST-E00223:46:HG7V5CCXX:2:1116:12601:22862 1123 1 10006 0 81M70S = 10106 143 CTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACACTCACCCTAACCCTACCCCTAACCCTAACCCTATCATTCACTCGAACCCTAACACTACCGCTAGCGCTAACTCCAGCCCGCACACTATCGCTAACCCTCACGC AAFFFKKKKKKKKKKKKKKKFKKA,,A,A,,77<,,A,,7FFK7,,,,AF,,,A7,,,77,,,7AA,,7,FFK,<A,,,,7<<,,,,AA,7,AA,,7,,,7,,,,,,,,,,,A7,,,,,,7,7,7(,(,((,,,7,A,F,7,,,<,,AA,, MC:Z:108S43M MD:Z:47C2A12A17 RG:Z:15-0017315_1 NM:i:3 MQ:i:2 AS:i:66 XS:i:71

ST-E00223:32:H5J57CCXX:5:2208:10074:43308 99 1 10008 36 101M1I41M7S = 10107 137 AACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTACCCCTAACCCTAACCCTAACCCTAACCCTCACCCTCACCCTCACCCTAGCCCAAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTACCCCG AAFFFKFKKKKKKKKKKKKKKKKFKKKKKKKFKKK7<KA7<F,AF7F,7,A,FA7<KKKKKFKKKK<,,FKAKKAFFA,A7,7,,,7,7,,,,,7F<,,7,,,,7,,A,,,<F7,,,7FA,7,,<F,,,<A,7<AFKK<,<,,7,,,,,( MC:Z:112S38M MD:Z:49A28A5A5A6A44 RG:Z:15-0017315_1 NM:i:6 MQ:i:36 AS:i:110 XS:i:113

ST-E00223:46:HG7V5CCXX:5:2119:12936:64896 99 1 10013 0 90M61S = 10176 211 TAACCCTAAGCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCATAACCCAAACCCTAACCCTAACCCTAACCCGAACCGTAAGCCAAAACATAACCACAACCATAACAATAACCAAAACCTTAACGTTAAACAT A<AFFKKKK,AAAFKK<FKKKKAK,<A,AFKKKKAFFFKKKKAFKKFAFFFKKA7,FFA,F,F,7F,AFF77AFFAFFKAFKKA,,FFKF,AA,F,,7F,A,,7A,,,7A,,,,,,AFK,,,AA,,,7FKA,A,AFF,<,,,,,,,77<AA MC:Z:99S48M4S MD:Z:9C49C6T23 RG:Z:15-0017315_1 NM:i:3 MQ:i:0 AS:i:75 XS:i:75

ST-E00223:32:H5J57CCXX:1:1205:17290:54577 99 1 10019 1 92M59S = 10354 414 TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCAAACCCCCACCCCAAACCCAAGCCCAACCCTACCCCCTAACCCCTAACCCCAACCCTGACC AAFFFKKKKFKKKKAKKKKKKKKKKKKKKKF7KKKKKKKKKKKKKFKKKKFKKKKKKKAKKKFFKKFFKKKK,<,7<FA,FFFK,,7FFF,,,,,A,((,(((,7,7A,,,,,A,,,777A,AAKK<<,7FFA7,A,,AA,<,,AFA,7AF MC:Z:72S79M MD:Z:92 RG:Z:15-0017315_1 NM:i:0 MQ:i:20 AS:i:92 XS:i:97

14 of 39

Recall: Edit distance

How many edits (changes) must be made to a word or kmer to make it match (align) to another word or kmer?

CURLED

HURLED

Edit distance = 1. Substitute C for H

SHORT

SHO-T

Edit distance = 1. Delete R

TGTTACGG

GGTTGACTA

?

TG-TT-ACGG

-GGTTGACTA

Edit distance = 5

TGTT-ACGG

GGTTGACTA

Edit distance = 4

15 of 39

The CIGAR string: encode the details of the alignment

Operation	Meaning
M	Match*
D	Deletion w.r.t. reference
I	Insertion w.r.t. reference
N	Split or spliced alignment
S	Soft-clipping
H	Hard-clipping
P	Padding

Reference:

Experimental:

ACCTGTC--TACCTTACG

ACCT-TCCATACTTTATC

4M1D2M2I7M2S

CIGAR string:

4M

1D

2M

2I

7M

2S

LENGTH/OPERATION

16 of 39

The extended CIGAR string: M become = and X

Operation	Meaning
=	Exact match
X	Mismatch
D	Deletion w.r.t. reference
I	Insertion w.r.t. reference
N	Split or spliced alignment
S	Soft-clipping
H	Hard-clipping
P	Padding

Reference:

Experimental:

ACCTGTC--TACCTTACG

ACCT-TCCATACTTTATC

4=1D2=2I3=1X3=2S

CIGAR string:

4=

1D

2=

2I

3=

2S

1X

3=

17 of 39

The FLAG column

ST-E00223:32:H5J57CCXX:6:2123:15189:52872 97 1 10001

ST-E00223:46:HG7V5CCXX:2:1116:12601:22862 1123 1 10006

ST-E00223:32:H5J57CCXX:5:2208:10074:43308 99 1 10008

ST-E00223:46:HG7V5CCXX:5:2119:12936:64896 99 1 10013

ST-E00223:32:H5J57CCXX:1:1205:17290:54577 99 1 10019

ST-E00223:32:H5J57CCXX:6:1115:16844:11013 81 1 10026

ST-E00223:32:H5J57CCXX:7:2113:18935:32356 99 1 10032

ST-E00223:46:HG7V5CCXX:6:2117:3082:44239 99 1 10040

ST-E00223:46:HG7V5CCXX:5:2213:10744:58813 163 1 10074

ST-E00223:32:H5J57CCXX:4:1220:14651:8868 99 1 10086

Sequence ID

FLAG

CHROM

POS

18 of 39

The FLAG score

base2	base10	base16	Meaning	Applies to:
00000000001	1	0x0001	The read originated from a paired sequencing molecule	Both
00000000010	2	0x0002	The read is mapped in a proper pair	Pairs only
00000000100	4	0x0004	The query sequence itself is unmapped	Both
00000001000	8	0x0008	The query’s mate is unmapped	Pairs only
00000010000	16	0x0010	Strand of the query (0 for forward; 1 for reverse strand)	Both
00000100000	32	0x0020	Strand of the query’s mate	Pairs only
00001000000	64	0x0040	The query is the first read in the pair	Pairs only
00010000000	128	0x0080	The read is the second read in the pair	Pairs only
00100000000	256	0x0100	The alignment is not primary	Both
01000000000	512	0x0200	The read fails platform/vendor quality checks	Both
10000000000	1024	0x0400	The read is either a PCR duplicate or an optical duplicate	Both

19 of 39

base2	base10	base16	Meaning	Applies to:
00000000001	1	0x0001	The read originated from a paired sequencing molecule	Both
00000000010	2	0x0002	The read is mapped in a proper pair	Pairs only
00000000100	4	0x0004	The query sequence itself is unmapped	Both
00000001000	8	0x0008	The query’s mate is unmapped	Pairs only
00000010000	16	0x0010	Strand of the query (0 for forward; 1 for reverse strand)	Both
00000100000	32	0x0020	Strand of the query’s mate	Pairs only
00001000000	64	0x0040	The query is the first read in the pair	Pairs only
00010000000	128	0x0080	The read is the second read in the pair	Pairs only
00100000000	256	0x0100	The alignment is not primary	Both
01000000000	512	0x0200	The read fails platform/vendor quality checks	Both
10000000000	1024	0x0400	The read is either a PCR duplicate or an optical duplicate	Both

00001100011

2⁶+2⁵+2¹+2⁰ = 64+32+2+1 = 99

ST-E00223:32:H5J57CCXX:4:1220:14651:8868 99 1 10086

20 of 39

Use samtools to convert SAM to BAM.

Create BWT of reference genome.

$ bwa index grch38.fa

Align paired-end FASTQ

to BWT index.

$ bwa mem -t 16 grch38.fa 1.fq 2.fq > sample.sam

Output is in SAM format.

Use multiple threads if you have a computer with multiple CPUs.

Convert SAM to BAM

$ samtools view -Sb sample.sam > sample.bam

Output is in BAM format.

However, it is unsorted - that is, random genomic order as reads are randomly placed in FASTQ by sequencer.

This takes a long time, but you do it once

21 of 39

What is the best way to use tons of disk space and have very inefficient analyses?

Text files of billions of alignments

22 of 39

Let's download a SAM file to play with

curl https://home.chpc.utah.edu/~u1007787/HG00096.sam.gz > HG00096.sam.gz

# uncompress the file

gzip -d HG00096.sam.gz

23 of 39

Peek indide. What is all that?

head HG00096.sam

24 of 39

Skip the header. Look at the alignment records

grep -v ^@ HG00096.sam | head

SRR062634.5360 181 20 45834704 0 35M65S = 45834704 0 CACCAATACAAAAACCCCCCACCCAAAAAAGAAAACTCCCTATCCCCTGTTCCTGTTCCATCCCCCACTGGTGGATGAGAGAGGGTCCCGGGAGGGGGGG #################################################################################################A8) XC:i:35 RG:Z:SRR062634

SRR062634.15082 181 20 37991695 0 35M65S = 37991695 0 AAAACTATTAAAAAGGTGTCCAAAAAAGAAAATAAAAAGCAAAGTTGGCTTATTTCCCAAAACCCACAGCCCAAAAAAAAGAAAAAAACCATTAAACAAA #################################################################################################### XC:i:35 RG:Z:SRR062634

25 of 39

SAMTOOLS: Converting and manipulating SAM/BAM

http://www.htslib.org/doc/samtools.html

Commands:

-- Indexing

dict create a sequence dictionary file

faidx index/extract FASTA

index index alignment

-- Editing

calmd recalculate MD/NM tags and '=' bases

fixmate fix mate information

reheader replace BAM header

rmdup remove PCR duplicates

targetcut cut fosmid regions (for fosmid pool only)

addreplacerg adds or replaces RG tags

-- Viewing

flags explain BAM flags

tview text alignment viewer

view SAM<->BAM<->CRAM conversion

depad convert padded BAM to unpadded BAM

26 of 39

Convert a SAM file to BAM for analysis. Essential.

samtools view

samtools view -b HG00096.sam > HG00096.bam

# compare the file sizes of the SAM and BAM files!!

# why is the BAM file so much smaller?

27 of 39

Peek in the BAM file with head

head HG00096.bam

# what is this garbage?!

28 of 39

BAM is binary, not text. Use samtools

samtools view HG00096.bam | head

29 of 39

Look at the chrom and position columns (3,4)

samtools view HG00096.bam | cut -f 3,4 | head

20 39038334

20 39038227

20 8140216

20 8140318

20 26129711

20 26129787

20 38810978

20 38811045

20 32585093

20 32585176

# the alignment records are not sorted by position. Bad! Why?

30 of 39

Sort the BAM file by position to do any analysis.

Need to do this is batches by classroom row!

samtools sort

samtools sort HG00096.bam > HG00096.sorted.bam

samtools view HG00096.sorted.bam | cut -f 3,4 | head -n 5

20 59993

20 59999

20 60015

20 60026

31 of 39

Now we must index the sorted BAM file.

This allows tools to rapidly find all alignments associated with any arbitrary genomic locus (e.g., a gene of interest)

samtools index

samtools index HG00096.sorted.bam

ls -ltr

32 of 39

WE ARE READY FOR IGV. BUT WHAT FIRST?

33 of 39

Need to download IGV to our laptops, not malibu!

https://igv.org/doc/desktop/#DownloadPage/

34 of 39

Launch IGV

35 of 39

What are we missing?

36 of 39

We need to transfer the sorted BAM and index from malibu to our laptop

# leave malibu

exit

# use scp (secure copy) to transfer the BAM file from malibu to laptop

scp u1007787@malibu.genetics.utah.edu:~/HG00096.sorted.bam .

# use scp (secure copy) to transfer the BAM file from malibu to laptop

scp u1007787@malibu.genetics.utah.edu:~/HG00096.sorted.bam.bai .

37 of 39

Now we can use File->Load From File in IGV

38 of 39

IGV tutorial

https://github.com/griffithlab/rnaseq_tutorial/wiki/IGV-Tutorial

39 of 39

Reading assignment before Tuesday

Pages 30-35 of section 1.3 of Jonathan Protchard’s book before class on Tuesday. 20 minutes of reading.

https://web.stanford.edu/group/pritchardlab/HGbook/Release-2023-09/HGBook-2023-09-chapters/HGBook-2023-09-23-ch1.3.pdf