1 of 37

Variant Calling with GATK

Mohammed Khalfan

NGS Bioinformatics Workshop

NYU

January 2018

2 of 37

What is Variant Calling?

Identifying single nucleotide polymorphisms (SNPs) and small insertions and deletion (indels) from next generation sequencing data.

Plays an important role in scientific discovery.

3 of 37

What is Variant Calling?

Identifying single nucleotide polymorphisms (SNPs) and small insertions and deletion (indels) from next generation sequencing data.

Plays an important role in scientific discovery.

Conceptually simple:

4 of 37

What is Variant Calling?

But in practice, things usually look more like this

5 of 37

The key challenge with NGS data is distinguishing which mismatches represent real mutations and which are just noise

6 of 37

Genome Analysis Toolkit (GATK)

  • Developed by the Broad Institute
  • Industry Standard for identifying SNPs and indels in germline DNA and RNAseq data

7 of 37

Genome Analysis Toolkit (GATK)

  • Developed by the Broad Institute
  • Industry Standard for identifying SNPs and indels in germline DNA and RNAseq data
  • In addition to the variant callers themselves, GATK also includes many utilities to perform related tasks such as processing and quality control of high-throughput sequencing data.

8 of 37

9 of 37

Modules we will use

  • GATK
  • BWA
  • Picard-tools
  • Samtools
  • SnpEff
  • Tabix (part of HTSlib)
  • IGV

10 of 37

Pre Processing

Raw data (typically FASTQ files) are not immediately usable for variant discovery analysis. The first phase of the workflow includes the pre-processing steps that are necessary to get your data from raw FASTQ files to an analysis-ready BAM file.

11 of 37

Pre Processing

Raw data (typically FASTQ files) are not immediately usable for variant discovery analysis. The first phase of the workflow includes the pre-processing steps that are necessary to get your data from raw FASTQ files to an analysis-ready BAM file.

  • Align reads to reference
  • Sort sam file (output from alignment) and convert to bam
  • Alignment Metrics
  • Mark duplicates
  • Prepare reference dictionary, fasta index, and bam index
  • Recalibrate base quality scores

12 of 37

13 of 37

1) Alignment

  • BWA-MEM
  • Recommended by Broad
  • Better for variant calling
  • Recommended for longer reads, ie. 70bp and up
  • Need to use reads groups

14 of 37

Read Groups

  • Used to identify reads and where the came from
  • GATK will not work without read groups
  • We apply read group data during the alignment step

@RG ID:sample_1 LB:sample_1 PL:ILLUMINA PM:HISEQ SM:sample_1

15 of 37

2) Sort and Convert to BAM

  • Downstream steps require data to be sorted by coordinate and in BAM format
  • We use Picard Tools

16 of 37

3) Alignment Metrics

  • Samtools flagstat
  • Gives us some basic alignment stats such as:
    • Total # of reads
    • # of reads aligned
    • # of reads properly paired

17 of 37

4) Mark Duplicates

  • A single DNA fragment may be sequenced several times
  • Example: duplicates can arise during library construction using PCR
  • These duplicate reads are not informative and cannot be considered as evidence for or against a putative variant
  • Without this step, you risk having over-representation in your sequence of areas preferentially amplified during PCR
  • Duplicate reads can also result from a single amplification cluster, incorrectly detected as multiple clusters by the optical sensor of the sequencing instrument (Optical Duplicate)

18 of 37

4) Mark Duplicates

  • We use Picard Tools to mark duplicates
  • Note: Picard tools doesn’t remove duplicates, it simply marks duplicate reads as such in the SAM record
  • Downstream GATK tools will ignore reads flagged as duplicates by default

19 of 37

5) Prepare reference dictionary, fasta index, and bam index

  • In order to run GATK, we need to build a reference dictionary, a fasta index, and a bam index
  • We use Picard Tools to build the reference dictionary for GATK. The input for this is the reference fasta sequence, the output is a .dict file
  • We use samtools to build the fasta index (.fai)
  • We also use samtools to build the bam index (.bai)

20 of 37

6) Basq Quality Score Recalibration (BQSR)

21 of 37

6) Basq Quality Score Recalibration (BQSR)

What are quality scores?

22 of 37

6) Basq Quality Score Recalibration (BQSR)

What are quality scores?

  • Per-base estimates of error emitted by the sequencer
  • Expresses the level of confidence for each base called
  • Use standard Pred scores: Q20 is a general cutoff for high quality and represents 99% certainty that a base was called correctly
  • 99% certainty means 1 out of 100 expected to be wrong.
  • In a small dataset of 1M reads with a read length of 50, this means 50M bases total. With 99% confidence, this means 50,000 possible erroneous bases.

23 of 37

Example of average quality score at east position in the read, for all reads in a library (output from FastQC)

24 of 37

Individual quality scores (blue bars) for each position in a single read. The horizontal blue line represents the Q20 phred score value

25 of 37

Why do we care about quality scores so much?

  • Variant calling algorithms rely on the quality score assigned to the individual base calls
  • Tells us how much we can trust that particular observation to inform us about the biological truth of the site
  • If we have a basecall that has a low quality score, that means we're not sure we actually read that A correctly, and it could actually be something else
  • So we won't trust it as much as other base calls that have higher qualities
  • We use that score to weigh the evidence that we have for or against a variant existing at a particular site

26 of 37

Why Recalibrate?

  • Scores produced by the machines are subject to various sources of systematic technical error
  • Leads to over- or under-estimated base quality scores in the data.
  • Errors can arise due to the physics or the chemistry of how the sequencing reaction works, possibly manufacturing flaws in the equipment.

27 of 37

Why Recalibrate?

Base quality score recalibration (BQSR) is a process in which we apply machine learning to model these errors empirically and adjust the quality scores accordingly.

28 of 37

Why Recalibrate?

Base quality score recalibration (BQSR) is a process in which we apply machine learning to model these errors empirically and adjust the quality scores accordingly.

This allows us to get more accurate base qualities, which in turn improves the accuracy of our variant calls.

29 of 37

How does BQSR work?

  • You provide GATK Base Recalibrator with a set of known variants.
  • GATK Base Recalibrator analyzes all reads looking for mismatches between the read and reference, skipping those positions which are included in the set of known variants (from step 1).
  • GATK Base Recalibrator computes statistics on the mismatches (identified in step 2) based on the reported quality score, the position in the read, the sequencing context (ex: preceding and current nucleotide).
  • Based on the statistics computed in step 3, an empirical quality score is assigned to each mismatch, overwriting the original reported quality score.

30 of 37

What if you don’t have a set of known variants?

31 of 37

32 of 37

Variant Discovery

Once we have a set of recalibrated reads, we can move on to the variant discovery phase

  • Call variants
  • Filter variants
  • Annotation
  • Visualization

33 of 37

34 of 37

  • Call Variants
  • We use the GATK HaplotypeCaller tool
  • This step is designed to maximize sensitivity in order to minimize false negatives, i.e. failing to identify real variants
  • Creates a single file with both SNPs and indels
  • We extract each type of variant into it’s own file so we can process them individually

35 of 37

2) Filter Variants

  • The first step is designed to maximize sensitivity and is thus very lenient in calling variants
  • Good because it minimizes the chance of missing real variants
  • But means that we need to filter the raw call set in order to reduce the amount of false positives
  • Important in order to obtain the the highest-quality call set possible

36 of 37

3) Annotation

  • We use SnpEff
  • Annotates and predicts the effects of variants on genes
    • Codon changes
    • Amino acid changes
    • Genomic region
    • Functional effect (silent, missense)
  • SnpEff has pre-built databases for thousands of genomes

37 of 37

4) Visualization - IGV