Variant Calling with GATK
Mohammed Khalfan
NGS Bioinformatics Workshop
NYU
January 2018
What is Variant Calling?
Identifying single nucleotide polymorphisms (SNPs) and small insertions and deletion (indels) from next generation sequencing data.
Plays an important role in scientific discovery.
What is Variant Calling?
Identifying single nucleotide polymorphisms (SNPs) and small insertions and deletion (indels) from next generation sequencing data.
Plays an important role in scientific discovery.
Conceptually simple:
What is Variant Calling?
But in practice, things usually look more like this
The key challenge with NGS data is distinguishing which mismatches represent real mutations and which are just noise
Genome Analysis Toolkit (GATK)
Genome Analysis Toolkit (GATK)
Modules we will use
Pre Processing
Raw data (typically FASTQ files) are not immediately usable for variant discovery analysis. The first phase of the workflow includes the pre-processing steps that are necessary to get your data from raw FASTQ files to an analysis-ready BAM file.
Pre Processing
Raw data (typically FASTQ files) are not immediately usable for variant discovery analysis. The first phase of the workflow includes the pre-processing steps that are necessary to get your data from raw FASTQ files to an analysis-ready BAM file.
1) Alignment
Read Groups
@RG ID:sample_1 LB:sample_1 PL:ILLUMINA PM:HISEQ SM:sample_1
2) Sort and Convert to BAM
3) Alignment Metrics
4) Mark Duplicates
4) Mark Duplicates
5) Prepare reference dictionary, fasta index, and bam index
6) Basq Quality Score Recalibration (BQSR)
6) Basq Quality Score Recalibration (BQSR)
What are quality scores?
6) Basq Quality Score Recalibration (BQSR)
What are quality scores?
Example of average quality score at east position in the read, for all reads in a library (output from FastQC)
Individual quality scores (blue bars) for each position in a single read. The horizontal blue line represents the Q20 phred score value
Why do we care about quality scores so much?
Why Recalibrate?
Why Recalibrate?
Base quality score recalibration (BQSR) is a process in which we apply machine learning to model these errors empirically and adjust the quality scores accordingly.
Why Recalibrate?
Base quality score recalibration (BQSR) is a process in which we apply machine learning to model these errors empirically and adjust the quality scores accordingly.
This allows us to get more accurate base qualities, which in turn improves the accuracy of our variant calls.
How does BQSR work?
What if you don’t have a set of known variants?
Variant Discovery
Once we have a set of recalibrated reads, we can move on to the variant discovery phase
2) Filter Variants
3) Annotation
4) Visualization - IGV