1 of 37

Variant Calling with GATK

Mohammed Khalfan

NGS Bioinformatics Workshop

NYU

January 2018

2 of 37

What is Variant Calling?

Identifying single nucleotide polymorphisms (SNPs) and small insertions and deletion (indels) from next generation sequencing data.

Plays an important role in scientific discovery.

3 of 37

What is Variant Calling?

Identifying single nucleotide polymorphisms (SNPs) and small insertions and deletion (indels) from next generation sequencing data.

Plays an important role in scientific discovery.

Conceptually simple:

4 of 37

What is Variant Calling?

But in practice, things usually look more like this

5 of 37

The key challenge with NGS data is distinguishing which mismatches represent real mutations and which are just noise

6 of 37

Genome Analysis Toolkit (GATK)

Developed by the Broad Institute
Industry Standard for identifying SNPs and indels in germline DNA and RNAseq data

7 of 37

Genome Analysis Toolkit (GATK)

Developed by the Broad Institute
Industry Standard for identifying SNPs and indels in germline DNA and RNAseq data
In addition to the variant callers themselves, GATK also includes many utilities to perform related tasks such as processing and quality control of high-throughput sequencing data.

9 of 37

Modules we will use

GATK
BWA
Picard-tools
Samtools
SnpEff
Tabix (part of HTSlib)
IGV

10 of 37

Pre Processing

Raw data (typically FASTQ files) are not immediately usable for variant discovery analysis. The first phase of the workflow includes the pre-processing steps that are necessary to get your data from raw FASTQ files to an analysis-ready BAM file.

11 of 37

Pre Processing

Align reads to reference
Sort sam file (output from alignment) and convert to bam
Alignment Metrics
Mark duplicates
Prepare reference dictionary, fasta index, and bam index
Recalibrate base quality scores

13 of 37

1) Alignment

BWA-MEM
Recommended by Broad
Better for variant calling
Recommended for longer reads, ie. 70bp and up
Need to use reads groups

14 of 37

Read Groups

Used to identify reads and where the came from
GATK will not work without read groups
We apply read group data during the alignment step

@RG ID:sample_1 LB:sample_1 PL:ILLUMINA PM:HISEQ SM:sample_1

15 of 37

2) Sort and Convert to BAM

Downstream steps require data to be sorted by coordinate and in BAM format
We use Picard Tools

16 of 37

3) Alignment Metrics

Samtools flagstat
Gives us some basic alignment stats such as:

Total # of reads
# of reads aligned
# of reads properly paired

17 of 37

4) Mark Duplicates

A single DNA fragment may be sequenced several times
Example: duplicates can arise during library construction using PCR
These duplicate reads are not informative and cannot be considered as evidence for or against a putative variant
Without this step, you risk having over-representation in your sequence of areas preferentially amplified during PCR
Duplicate reads can also result from a single amplification cluster, incorrectly detected as multiple clusters by the optical sensor of the sequencing instrument (Optical Duplicate)

18 of 37

4) Mark Duplicates

We use Picard Tools to mark duplicates
Note: Picard tools doesn’t remove duplicates, it simply marks duplicate reads as such in the SAM record
Downstream GATK tools will ignore reads flagged as duplicates by default

19 of 37

5) Prepare reference dictionary, fasta index, and bam index

In order to run GATK, we need to build a reference dictionary, a fasta index, and a bam index
We use Picard Tools to build the reference dictionary for GATK. The input for this is the reference fasta sequence, the output is a .dict file
We use samtools to build the fasta index (.fai)
We also use samtools to build the bam index (.bai)

20 of 37

6) Basq Quality Score Recalibration (BQSR)

21 of 37

6) Basq Quality Score Recalibration (BQSR)

What are quality scores?

22 of 37

6) Basq Quality Score Recalibration (BQSR)

What are quality scores?

Per-base estimates of error emitted by the sequencer
Expresses the level of confidence for each base called
Use standard Pred scores: Q20 is a general cutoff for high quality and represents 99% certainty that a base was called correctly
99% certainty means 1 out of 100 expected to be wrong.
In a small dataset of 1M reads with a read length of 50, this means 50M bases total. With 99% confidence, this means 50,000 possible erroneous bases.

23 of 37

Example of average quality score at east position in the read, for all reads in a library (output from FastQC)

24 of 37

Individual quality scores (blue bars) for each position in a single read. The horizontal blue line represents the Q20 phred score value

25 of 37

Why do we care about quality scores so much?

Variant calling algorithms rely on the quality score assigned to the individual base calls
Tells us how much we can trust that particular observation to inform us about the biological truth of the site
If we have a basecall that has a low quality score, that means we're not sure we actually read that A correctly, and it could actually be something else
So we won't trust it as much as other base calls that have higher qualities
We use that score to weigh the evidence that we have for or against a variant existing at a particular site

26 of 37

Why Recalibrate?

Scores produced by the machines are subject to various sources of systematic technical error
Leads to over- or under-estimated base quality scores in the data.
Errors can arise due to the physics or the chemistry of how the sequencing reaction works, possibly manufacturing flaws in the equipment.

27 of 37

Why Recalibrate?

Base quality score recalibration (BQSR) is a process in which we apply machine learning to model these errors empirically and adjust the quality scores accordingly.

28 of 37

Why Recalibrate?

Base quality score recalibration (BQSR) is a process in which we apply machine learning to model these errors empirically and adjust the quality scores accordingly.

This allows us to get more accurate base qualities, which in turn improves the accuracy of our variant calls.

29 of 37

How does BQSR work?

You provide GATK Base Recalibrator with a set of known variants.
GATK Base Recalibrator analyzes all reads looking for mismatches between the read and reference, skipping those positions which are included in the set of known variants (from step 1).
GATK Base Recalibrator computes statistics on the mismatches (identified in step 2) based on the reported quality score, the position in the read, the sequencing context (ex: preceding and current nucleotide).
Based on the statistics computed in step 3, an empirical quality score is assigned to each mismatch, overwriting the original reported quality score.

30 of 37

What if you don’t have a set of known variants?

32 of 37

Variant Discovery

Once we have a set of recalibrated reads, we can move on to the variant discovery phase

Call variants
Filter variants
Annotation
Visualization

34 of 37

Call Variants

We use the GATK HaplotypeCaller tool
This step is designed to maximize sensitivity in order to minimize false negatives, i.e. failing to identify real variants
Creates a single file with both SNPs and indels
We extract each type of variant into it’s own file so we can process them individually

35 of 37

2) Filter Variants

The first step is designed to maximize sensitivity and is thus very lenient in calling variants
Good because it minimizes the chance of missing real variants
But means that we need to filter the raw call set in order to reduce the amount of false positives
Important in order to obtain the the highest-quality call set possible

36 of 37

3) Annotation

We use SnpEff
Annotates and predicts the effects of variants on genes

Codon changes
Amino acid changes
Genomic region
Functional effect (silent, missense)

SnpEff has pre-built databases for thousands of genomes

1 of 37

2 of 37

3 of 37

4 of 37

5 of 37

6 of 37

7 of 37

8 of 37

9 of 37

10 of 37

11 of 37

12 of 37

13 of 37

14 of 37

15 of 37

16 of 37

17 of 37

18 of 37

19 of 37

20 of 37

21 of 37

22 of 37

23 of 37

24 of 37

25 of 37

26 of 37

27 of 37

28 of 37

29 of 37

30 of 37

31 of 37

32 of 37

33 of 37

34 of 37

35 of 37

36 of 37

37 of 37