1 of 39

SAM/BAM format, samtools, and IGV-ish

Applied Computational Genomics, Lecture 08

https://github.com/quinlan-lab/applied-computational-genomics

Aaron Quinlan

Departments of Human Genetics and Biomedical Informatics

USTAR Center for Genetic Discovery

University of Utah

quinlanlab.org

2 of 39

But first, two new UNIX tricks.

  • Run a command in the "background". That is, run the command, but give you control of your prompt.

paste 1.fq 2.fq | sort | uniq -c &

  • Run in the background and keep running even if you terminate your session.

nohup paste 1.fq 2.fq | sort | uniq -c > out.txt &

3 of 39

Sequence alignment software

Aligner

Approach

Applications

Availability

BWA-mem

Burrows-Wheeler

DNA, SE, PE, SV

open-source

Bowtie2

Burrows-Wheeler

DNA, SE, PE, SV

open-source

Novoalign

hash-based

DNA, SE, PE

free for academic use

TopHat

Burrows-Wheeler

RNA-seq

open-source

STAR

hash-based (reads)

RNA-seq

open-source

GSNAP

hash-based (reads)

RNA-seq

open-source

4 of 39

BWA-MEM: never "published" ; widely used.

https://arxiv.org/pdf/1303.3997v2.pdf

5 of 39

BWA-MEM

@seq1

ATTCGAAACA...

+

DDED88(999...

@seq2

CCCCGTTTCA...

+

AAC887BBAC...

BWA MEM

Unaligned

Sample Data

In FASTQ (SE or PE)

>chr1

TACCTCCAGGGGGCATCCTCCCCCCCAATTCGAAACACAATCGTAGCCCCTGGCACTACCTATGTGTGTCAATTCGGAGAGAGAGAGATTCACGAAAAAAAAGTCTGGACTCAACTAGGATACACACATTCGGCTACAGATACCAAAAAAAAAAAAAAAAAAATTTTCACCATTGAGGCACCACCTTCTCGTCGCTGCGTCGCTCTGCTCGCTTCGGCTAAAAATTCGCGCAATACATTCGGCTACAGATACCAAA

Reference genome (FASTA)

seq1 99 1 3666901 60 149M = 3666935 185 ATTCGAAACA... DDED88(999 MC:Z:151M MD:Z:149 RG:Z:15-0017315_1 NM:i:0 MQ:i:60 AS:i:149 XS:i:44

seq2 147 1 3666935 60 151M = 3666901 -185 CCCCGTTTCA... AAC887BBAC... MC:Z:149M MD:Z:151 RG:Z:15-0017315_1 NM:i:0 MQ:i:60 AS:i:151 XS:i:59

Aligned

Sample Data in

SAM format

6 of 39

BWA-MEM workflow

Create BWT of reference genome.

$ bwa index grch38.fa

Align paired-end FASTQ

to BWT index.

$ bwa mem -t 16 grch38.fa 1.fq 2.fq > sample.sam

Output is in SAM format.

Use multiple threads if you have a computer with multiple CPUs.

This takes a long time, but you do it once

7 of 39

Where do you find FASTA reference genomes?

8 of 39

SAM format: a text-based standard(!) for representing sequence alignments

9 of 39

SAM format overview

  • In the dark ages, sequence aligners used disparate output formats. Pain.
  • 1000 Genomes Project sought to standardize. Standards are good.
  • The result is imperfect, but it’s a huge improvement.
  • Strengths of the SAM and BAM formats
      • Compressed: less disk hungry
      • Indexed: fast viewing, slicing, etc.
      • Single-end and paired-end
      • Relatively simple to produce
      • Good toolkits available

10 of 39

What critical information do we need for sequence alignments?

11 of 39

SAM format overview

Col #

Name

Meaning

Example

1

QNAME

Read or Pair name

HWI:ST156_1:278:1:1058:4544:0

2

FLAG

Bitwise FLAG

Much more soon!

3

RNAME

Reference sequence name

chr1

4

POS

1-based alignment start coordinate

8,724,005

5

MAPQ

Mapping quality

60

6

CIGAR

Extended CIGAR string

Much more soon!

7

MRNM

If paired, the mate’s reference seq.

chr1

8

MPOS

If paired, the mate’s alignment start

8,724,505

9

ISIZE

If paired, the insert size

562

10

SEQ

The sequence of the query/mate

ACAAATTCAG...

11

QUAL

The quality string for the query/mate

HHH$^^%$$$...

12

OPT

Optional Tags

XA:i:2, MD:Z:0T34G15

12 of 39

Col #

Name

Meaning

Example

1

QNAME

Read or Pair name

HWI:ST156_1:278:1:1058:4544:0

2

FLAG

Bitwise FLAG

Much more soon!

3

RNAME

Reference sequence name

chr1

4

POS

1-based alignment start coordinate

8,724,005

5

MAPQ

Mapping quality

60

6

CIGAR

Extended CIGAR string

Much more soon!

7

MRNM

If paired, the mate’s reference seq.

chr1

8

MPOS

If paired, the mate’s alignment start

8,724,505

9

ISIZE

If paired, the insert size

562

10

SEQ

The sequence of the query/mate

ACAAATTCAG...

11

QUAL

The quality string for the query/mate

HHH$^^%$$$...

12

OPT

Optional Tags

XA:i:2, MD:Z:0T34G15

1

2

3

4

13 of 39

ST-E00223:32:H5J57CCXX:6:2123:15189:52872 97 1 10001 0 4S15M1I54M2I50M25S 4 699063 0 ACCCTAACCCTAACCCTAACCCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAAACCCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCAACCCTAACCCCCCCCCACCCAACCCCACCCCCCAC AAFFFKKKKKKKKKKKKKKKKKKKKFKKFKKKKKKFKKAKKKKFFKKKKKFKF<FF7FFFFK7FK,AA,FKFFKF,FKKK7<KKK,,7FFF7F,AAFFK,AKA,,FFF<,A7FKK,,,7A,,,7,,,,,((,(,,(,,,,,,,(,,A(,,( MC:Z:80S11M1D58M MD:Z:119 RG:Z:15-0017315_1 NM:i:3 MQ:i:47 AS:i:104 XS:i:103

ST-E00223:46:HG7V5CCXX:2:1116:12601:22862 1123 1 10006 0 81M70S = 10106 143 CTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACACTCACCCTAACCCTACCCCTAACCCTAACCCTATCATTCACTCGAACCCTAACACTACCGCTAGCGCTAACTCCAGCCCGCACACTATCGCTAACCCTCACGC AAFFFKKKKKKKKKKKKKKKFKKA,,A,A,,77<,,A,,7FFK7,,,,AF,,,A7,,,77,,,7AA,,7,FFK,<A,,,,7<<,,,,AA,7,AA,,7,,,7,,,,,,,,,,,A7,,,,,,7,7,7(,(,((,,,7,A,F,7,,,<,,AA,, MC:Z:108S43M MD:Z:47C2A12A17 RG:Z:15-0017315_1 NM:i:3 MQ:i:2 AS:i:66 XS:i:71

ST-E00223:32:H5J57CCXX:5:2208:10074:43308 99 1 10008 36 101M1I41M7S = 10107 137 AACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTACCCCTAACCCTAACCCTAACCCTAACCCTCACCCTCACCCTCACCCTAGCCCAAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTACCCCG AAFFFKFKKKKKKKKKKKKKKKKFKKKKKKKFKKK7<KA7<F,AF7F,7,A,FA7<KKKKKFKKKK<,,FKAKKAFFA,A7,7,,,7,7,,,,,7F<,,7,,,,7,,A,,,<F7,,,7FA,7,,<F,,,<A,7<AFKK<,<,,7,,,,,( MC:Z:112S38M MD:Z:49A28A5A5A6A44 RG:Z:15-0017315_1 NM:i:6 MQ:i:36 AS:i:110 XS:i:113

ST-E00223:46:HG7V5CCXX:5:2119:12936:64896 99 1 10013 0 90M61S = 10176 211 TAACCCTAAGCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCATAACCCAAACCCTAACCCTAACCCTAACCCGAACCGTAAGCCAAAACATAACCACAACCATAACAATAACCAAAACCTTAACGTTAAACAT A<AFFKKKK,AAAFKK<FKKKKAK,<A,AFKKKKAFFFKKKKAFKKFAFFFKKA7,FFA,F,F,7F,AFF77AFFAFFKAFKKA,,FFKF,AA,F,,7F,A,,7A,,,7A,,,,,,AFK,,,AA,,,7FKA,A,AFF,<,,,,,,,77<AA MC:Z:99S48M4S MD:Z:9C49C6T23 RG:Z:15-0017315_1 NM:i:3 MQ:i:0 AS:i:75 XS:i:75

ST-E00223:32:H5J57CCXX:1:1205:17290:54577 99 1 10019 1 92M59S = 10354 414 TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCAAACCCCCACCCCAAACCCAAGCCCAACCCTACCCCCTAACCCCTAACCCCAACCCTGACC AAFFFKKKKFKKKKAKKKKKKKKKKKKKKKF7KKKKKKKKKKKKKFKKKKFKKKKKKKAKKKFFKKFFKKKK,<,7<FA,FFFK,,7FFF,,,,,A,((,(((,7,7A,,,,,A,,,777A,AAKK<<,7FFA7,A,,AA,<,,AFA,7AF MC:Z:72S79M MD:Z:92 RG:Z:15-0017315_1 NM:i:0 MQ:i:20 AS:i:92 XS:i:97

14 of 39

Recall: Edit distance

How many edits (changes) must be made to a word or kmer to make it match (align) to another word or kmer?

CURLED

HURLED

Edit distance = 1. Substitute C for H

SHORT

SHO-T

Edit distance = 1. Delete R

TGTTACGG

GGTTGACTA

?

TG-TT-ACGG

-GGTTGACTA

Edit distance = 5

TGTT-ACGG

GGTTGACTA

Edit distance = 4

15 of 39

The CIGAR string: encode the details of the alignment

Operation

Meaning

M

Match*

D

Deletion w.r.t. reference

I

Insertion w.r.t. reference

N

Split or spliced alignment

S

Soft-clipping

H

Hard-clipping

P

Padding

Reference:

Experimental:

ACCTGTC--TACCTTACG

ACCT-TCCATACTTTATC

4M1D2M2I7M2S

CIGAR string:

4M

1D

2M

2I

7M

2S

LENGTH/OPERATION

16 of 39

The extended CIGAR string: M become = and X

Operation

Meaning

=

Exact match

X

Mismatch

D

Deletion w.r.t. reference

I

Insertion w.r.t. reference

N

Split or spliced alignment

S

Soft-clipping

H

Hard-clipping

P

Padding

Reference:

Experimental:

ACCTGTC--TACCTTACG

ACCT-TCCATACTTTATC

4=1D2=2I3=1X3=2S

CIGAR string:

4=

1D

2=

2I

3=

2S

1X

3=

17 of 39

The FLAG column

ST-E00223:32:H5J57CCXX:6:2123:15189:52872 97 1 10001

ST-E00223:46:HG7V5CCXX:2:1116:12601:22862 1123 1 10006

ST-E00223:32:H5J57CCXX:5:2208:10074:43308 99 1 10008

ST-E00223:46:HG7V5CCXX:5:2119:12936:64896 99 1 10013

ST-E00223:32:H5J57CCXX:1:1205:17290:54577 99 1 10019

ST-E00223:32:H5J57CCXX:6:1115:16844:11013 81 1 10026

ST-E00223:32:H5J57CCXX:7:2113:18935:32356 99 1 10032

ST-E00223:46:HG7V5CCXX:6:2117:3082:44239 99 1 10040

ST-E00223:46:HG7V5CCXX:5:2213:10744:58813 163 1 10074

ST-E00223:32:H5J57CCXX:4:1220:14651:8868 99 1 10086

Sequence ID

FLAG

CHROM

POS

18 of 39

The FLAG score

base2

base10

base16

Meaning

Applies to:

00000000001

1

0x0001

The read originated from a paired sequencing molecule

Both

00000000010

2

0x0002

The read is mapped in a proper pair

Pairs only

00000000100

4

0x0004

The query sequence itself is unmapped

Both

00000001000

8

0x0008

The query’s mate is unmapped

Pairs only

00000010000

16

0x0010

Strand of the query (0 for forward; 1 for reverse strand)

Both

00000100000

32

0x0020

Strand of the query’s mate

Pairs only

00001000000

64

0x0040

The query is the first read in the pair

Pairs only

00010000000

128

0x0080

The read is the second read in the pair

Pairs only

00100000000

256

0x0100

The alignment is not primary

Both

01000000000

512

0x0200

The read fails platform/vendor quality checks

Both

10000000000

1024

0x0400

The read is either a PCR duplicate or an optical duplicate

Both

19 of 39

base2

base10

base16

Meaning

Applies to:

00000000001

1

0x0001

The read originated from a paired sequencing molecule

Both

00000000010

2

0x0002

The read is mapped in a proper pair

Pairs only

00000000100

4

0x0004

The query sequence itself is unmapped

Both

00000001000

8

0x0008

The query’s mate is unmapped

Pairs only

00000010000

16

0x0010

Strand of the query (0 for forward; 1 for reverse strand)

Both

00000100000

32

0x0020

Strand of the query’s mate

Pairs only

00001000000

64

0x0040

The query is the first read in the pair

Pairs only

00010000000

128

0x0080

The read is the second read in the pair

Pairs only

00100000000

256

0x0100

The alignment is not primary

Both

01000000000

512

0x0200

The read fails platform/vendor quality checks

Both

10000000000

1024

0x0400

The read is either a PCR duplicate or an optical duplicate

Both

00001100011

26+25+21+20 = 64+32+2+1 = 99

ST-E00223:32:H5J57CCXX:4:1220:14651:8868 99 1 10086

20 of 39

Use samtools to convert SAM to BAM.

Create BWT of reference genome.

$ bwa index grch38.fa

Align paired-end FASTQ

to BWT index.

$ bwa mem -t 16 grch38.fa 1.fq 2.fq > sample.sam

Output is in SAM format.

Use multiple threads if you have a computer with multiple CPUs.

Convert SAM to BAM

$ samtools view -Sb sample.sam > sample.bam

Output is in BAM format.

However, it is unsorted - that is, random genomic order as reads are randomly placed in FASTQ by sequencer.

This takes a long time, but you do it once

21 of 39

What is the best way to use tons of disk space and have very inefficient analyses?

Text files of billions of alignments

22 of 39

Let's download a SAM file to play with

curl https://home.chpc.utah.edu/~u1007787/HG00096.sam.gz > HG00096.sam.gz

# uncompress the file

gzip -d HG00096.sam.gz

23 of 39

Peek indide. What is all that?

head HG00096.sam

24 of 39

Skip the header. Look at the alignment records

grep -v ^@ HG00096.sam | head

SRR062634.5360 181 20 45834704 0 35M65S = 45834704 0 CACCAATACAAAAACCCCCCACCCAAAAAAGAAAACTCCCTATCCCCTGTTCCTGTTCCATCCCCCACTGGTGGATGAGAGAGGGTCCCGGGAGGGGGGG #################################################################################################A8) XC:i:35 RG:Z:SRR062634

SRR062634.15082 181 20 37991695 0 35M65S = 37991695 0 AAAACTATTAAAAAGGTGTCCAAAAAAGAAAATAAAAAGCAAAGTTGGCTTATTTCCCAAAACCCACAGCCCAAAAAAAAGAAAAAAACCATTAAACAAA #################################################################################################### XC:i:35 RG:Z:SRR062634

25 of 39

SAMTOOLS: Converting and manipulating SAM/BAM

http://www.htslib.org/doc/samtools.html

Commands:

-- Indexing

dict create a sequence dictionary file

faidx index/extract FASTA

index index alignment

-- Editing

calmd recalculate MD/NM tags and '=' bases

fixmate fix mate information

reheader replace BAM header

rmdup remove PCR duplicates

targetcut cut fosmid regions (for fosmid pool only)

addreplacerg adds or replaces RG tags

-- Viewing

flags explain BAM flags

tview text alignment viewer

view SAM<->BAM<->CRAM conversion

depad convert padded BAM to unpadded BAM

26 of 39

Convert a SAM file to BAM for analysis. Essential.

samtools view

samtools view -b HG00096.sam > HG00096.bam

# compare the file sizes of the SAM and BAM files!!

# why is the BAM file so much smaller?

27 of 39

Peek in the BAM file with head

head HG00096.bam

# what is this garbage?!

28 of 39

BAM is binary, not text. Use samtools

samtools view HG00096.bam | head

29 of 39

Look at the chrom and position columns (3,4)

samtools view HG00096.bam | cut -f 3,4 | head

20 39038334

20 39038227

20 8140216

20 8140318

20 26129711

20 26129787

20 38810978

20 38811045

20 32585093

20 32585176

# the alignment records are not sorted by position. Bad! Why?

30 of 39

Sort the BAM file by position to do any analysis.

Need to do this is batches by classroom row!

samtools sort

samtools sort HG00096.bam > HG00096.sorted.bam

samtools view HG00096.sorted.bam | cut -f 3,4 | head -n 5

20 59993

20 59993

20 59999

20 60015

20 60026

31 of 39

Now we must index the sorted BAM file.

This allows tools to rapidly find all alignments associated with any arbitrary genomic locus (e.g., a gene of interest)

samtools index

samtools index HG00096.sorted.bam

ls -ltr

32 of 39

WE ARE READY FOR IGV. BUT WHAT FIRST?

33 of 39

Need to download IGV to our laptops, not malibu!

https://igv.org/doc/desktop/#DownloadPage/

34 of 39

Launch IGV

35 of 39

What are we missing?

36 of 39

We need to transfer the sorted BAM and index from malibu to our laptop

# leave malibu

exit

# use scp (secure copy) to transfer the BAM file from malibu to laptop

scp u1007787@malibu.genetics.utah.edu:~/HG00096.sorted.bam .

# use scp (secure copy) to transfer the BAM file from malibu to laptop

scp u1007787@malibu.genetics.utah.edu:~/HG00096.sorted.bam.bai .

37 of 39

Now we can use File->Load From File in IGV

38 of 39

IGV tutorial

https://github.com/griffithlab/rnaseq_tutorial/wiki/IGV-Tutorial

39 of 39

Reading assignment before Tuesday

Pages 30-35 of section 1.3 of Jonathan Protchard’s book before class on Tuesday. 20 minutes of reading.

https://web.stanford.edu/group/pritchardlab/HGbook/Release-2023-09/HGBook-2023-09-chapters/HGBook-2023-09-23-ch1.3.pdf