Variant Detection (Model organism)
Mahtab Mirmomeni & Andrew Lonie
In this tutorial we compare the performance of three statistically-based variant detection tools:
Each of these tools takes as its input a BAM file of aligned reads and generates a list of likely variants in VCF format
Background [15 min]
Read the background to the workshop here
Where is the data in this tutorial from?
The data has been produced from human whole genomic DNA. Only reads that have mapped to a part of chromosome 20 have been used, to make the data suitable for an interactive tutorial. There are about one million 100bp reads in the dataset, produced on an Illumina HiSeq2000. This data was generated as part of the 1000 Genomes project: http://www.1000genomes.org/
Preparation [10 min]
You can do this in a few ways, of which by far the easiest is:
NOTE: If you log out of Galaxy and log back at a later time your data and results from previous experiments will be available in the right panel of your screen called the ‘History’
Completed Galaxy history for this section (in SharedData>Published Histories): VariantDetection_ADVNCD_Prep
Section 1: Quality Control [5 mins]
The aim here is to evaluate the quality of the short data. If the quality is poor, then adjustments can be made - eg trimming the short reads, or adjusting your expectations of the final outcome!
From the left hand tool panel in Galaxy, select NGS: QC and manipulation -> FASTQC
Completed Galaxy history for this section (in SharedData>Published Histories): VariantDet_ADVNCD_Sec1
Import this History: https://swift.rc.nectar.org.au:8888/v1/AUTH_a3929895f9e94089ad042c9900e1ee82/VariantDet_ADVNCD/Galaxy-History-VariantDet_ADVNCD_Sec1.tar.gz
Section 2: Alignment [15 mins]
The basic process here is to map each of the individual reads in the sample FASTQ readsets to a reference genome, so that we can then identify the sequence changes with respect to the reference genome.
Some of the variant callers need extra information regarding the source of reads in order to identify the correct error profiles to use in their statistical variant detection model, so we add more information into the alignment step so that that generated BAM file contains the metadata the variant caller expects.
NGS: Mapping>Map with BWA for Illumina [2-3mins]
NGS: Sam Tools>SAM-to-BAM [1 min]
Scroll around and zoom in and out in the IGV genome viewer to get a feel for genomic data. Note that coverage is variable, with some regions getting almost no coverage (e.g. try chr20:1,870,686-1,880,895 - if you zoom right in to base resolution you’ll see that this region is very GC rich, meaning it’s hard to sequence. Unfortunately it also contains the first few exons of a gene...)
The next steps can be computationally intensive if performed on the entire genome. We can limit the analysis to the region that we know our data comes from, the first part of chromosome 20
This will generate an interval file specifying the first 2mb of chromosome 20, which we’ll use in the following steps
NGS: GATK2 Tools>Depth of Coverage [1-2 min]
Completed Galaxy history for this section (in SharedData>Published Histories): VariantDet_ADVNCD_Sec2
Import this History: https://swift.rc.nectar.org.au:8888/v1/AUTH_a3929895f9e94089ad042c9900e1ee82/VariantDet_ADVNCD/Galaxy-History-VariantDet_ADVNCD_Sec2.tar.gz
Section 3. Calling single nucleotide variations with mpileup [15 min]
Mpileup is a Bayesian variant genotyper which assesses the likelihood of each possible genotype for each position in the reference genome, given the observed reads at that position, and reports back the list of variants (positions where a genotype different to homozygous reference was called).
NGS:SAM Tools>Mpileup [1-2 min]
Mpileup creates a BCF (binary VCF) file which contains the list of variations in a compressed format
NGS:SAM Tools> bcftools view
IGV displays VCF variants as coloured bars; homozygotes are solid bars, heterozygotes are two-colour representing the two alleles (A=green, C=blue, G=yellow, T=red). If you hover over a variant bar, IGV displays info about that variant including number of bases supporting the variant call, quality of variant call, etc. Generally, homozygotes are easier to call correctly than heterozygotes.
Completed Galaxy history for this section (in SharedData>Published Histories): VariantDet_ADVNCD_Sec3
Import this History: https://swift.rc.nectar.org.au:8888/v1/AUTH_a3929895f9e94089ad042c9900e1ee82/VariantDet_ADVNCD/Galaxy-History-VariantDet_ADVNCD_Sec3.tar.gz
Section 4. Local realignment [15 min]
Alignment to large genomes is a compromise between speed and accuracy. This is often a problem where indels are present - reads with indels are computationally difficult to align correctly and often are misaligned as a compromise to speed. This is one of the major sources of false positive variant errors. We can improve the performance of variant callers by doing a further step of ‘realignment’ after the initial alignment, focussing on just those areas with potential indels, and spending more time trying to find the right alignment that works for all the reads in that area using a slower but more accurate alignment algorithm. Because we now know the genomic area that the reads come from, we can use information from a group of reads to decide about a difference to the reference genome; this cannot be done at the initial alignment step, when reads are considered individually.
Both GATK and FreeBayes will benefit from local realignment around indels. Mpileup does its own adjustments for indels.
NGS: GATK2 Tools> Realigner Target Creator
NGS: GATK2 Tools> Indel Realigner
Completed Galaxy history for this section (in SharedData>Published Histories): VariantDet_ADVNCD_Sec4
Import this History: https://swift.rc.nectar.org.au:8888/v1/AUTH_a3929895f9e94089ad042c9900e1ee82/VariantDet_ADVNCD/Galaxy-History-VariantDet_ADVNCD_Sec4.tar.gz
Section 5. FreeBayes [10 min]
FreeBayes is a Bayesian variant caller which aggressively calls all possible variants, leaving filtering to the user. FreeBayes generates a variant Quality score (as do all variant callers) which can be used for filtering.
NGS: Variation Detection -> Free Bayes
Note that FreeBayes does not do any filtering, but rather leaves the user to filter based on e.g. variant quality scores. So it returns many more variants, many of which will be low quality. In other words, FreeBayes will call a potential variant based on much less evidence than MPileup.
Completed Galaxy history for this section (in SharedData>Published Histories): VariantDet_ADVNCD_Sec5
Import this History: https://swift.rc.nectar.org.au:8888/v1/AUTH_a3929895f9e94089ad042c9900e1ee82/VariantDet_ADVNCD/Galaxy-History-VariantDet_ADVNCD_Sec5.tar.gz
Section 6. GATK Unified Genotyper [10 min]
The GATK Unified Genotyper is a Bayesian variant caller and genotyper from the Broad Institute. Many users consider the GATK to be best practice in human variant calling.
NGS: GATK2 Tools ->Unified Genotyper
We now have three different variant call sets from the same data. Browse around in IGV looking at places where they are different: eg look at: chr20:1,127,767-1,127,906
Here each of the variant callers has given a different SNP call. This is a confusing region for SNP callers due to the 15bp deletion, which may or may not be on both alleles.
We could immediately improve the FreeBayes callset by filtering on variant quality scores. If you want, you can do this with the NGS: GATK>Select Variants tool, choose the FreeBayes vcf file, click ‘Add new criteria to use when selecting data.’, use ‘QUAL>50’, Execute)
Completed Galaxy history for this section (in SharedData>Published Histories): VariantDet_ADVNCD_Sec6
Import this History: https://swift.rc.nectar.org.au:8888/v1/AUTH_a3929895f9e94089ad042c9900e1ee82/VariantDet_ADVNCD/Galaxy-History-VariantDet_ADVNCD_Sec6.tar.gz
Section 7. Evaluate known variations [15 min]
How can we evaluate our variants?
We know a lot about variation in humans from many empirical studies, including the 1000Genomes project, so we have some expectations on what we should see when we detect variants in a new sample:
We can immediately see that each of the variant callers has more variants than we would have expected - we would have expected around 2000 in our 2 megabase region but we see between 3000 and 5000. This is normal for variant calling, where most callers err on the side of sensitivity to reduce false negatives (missed SNVs), expecting the user to do further filtering to remove false positives (false SNVs).
NGS: GATK2 Tools -> Eval Variants
Note that the Ti/Tv for all callsets is around 2.3 for the dbSNP concordant variants, but varies widely for the novel variants (Mpileup:2.32; FreeBayes: 2.25; GATK: 2.26). This is as expected, as these are almost certainly true variant sites.
The Ti/Tv ratio for the Mpileup callset (input_0) looks pretty convincing: it’s 2.32 for all variants, 2.32 for the dbSNP (what we consider very likely to be true variants) and 2.32 for the novel variants. This is a very convincing set of numbers, but it’s worth noting that Mpileup called the fewest number of variants. FreeBayes, which called the most variants, has a Ti/Tv ratio for novel variants that looks very much like random noise - a Ti/Tv of 1 is what you would expect from variants with overall random base changes, which is not what we expect to see.
It is of course easy to call the highly confident variants, the trick is in calling ALL the variants, including ones on the margin of false positive/false negative.
So although we have a lot of confidence in the set of variants called by Mpileup, we might want to assure ourselves that this callset is exhaustive and we don’t have a high false negative rate. THIS is very hard to do!
One way to work out how many of the variants are in common between the call sets is to produce a Venn diagram to visualise the overlap. We can do this in Galaxy by aligning our three variant call sets using the chromosomal coordinate of each SNP, in the second column of each VCF file.
First though, we need to remove the headers from each file, leaving us just those lines with a valid position value.
Remove the header lines of each of the three VCF files
This will remove any lines starting with a ‘#’ character – in other words, all the headers.
Unfortunately the Venn diagram plotting function of Galaxy does not overlay values onto the chart but instead provides them underneath. To interpret these, you need to understand the symbols used:
For example, the line “Mpileup ∩ GATK \ FreeBayes - 356” means that 356 SNPs were identified by Mpileup and GATK but not by FreeBayes. For a more detailed explanation of set terminology, see here. Also note that it is a ‘proportional Venn’ – in other words, the relative sizes of the circles and area of overlap is proportional to the number of values in each.
The diagram below sums up the output in a more conventional Venn diagram format. The three callers agree on 2677 variants, and GATK and Mpileup broadly agree on most variants, differing only in ~240 total (i.e. 10%).
FreeBayes has called many more SNPs with most of its calls probably being false positives. If we filter the FreeBayes variants to remove all those with a quality score of less than 50, we get a more convincing intersection agreement, with the filtered FreeBayes set only containing seven SNPs not identified by GATK or Mpileup.
The way we have done this is to eliminate most of the false positives. Unfortunately it looks like we might have also removed a number of true positives, with 200 fewer SNPs in the three-way intersection.
This is the fundamental issue with variant calling: it’s a probabilistic process and the ‘truth’ set of variants is very hard to know!
Generally we would do some more filtering of the call sets using various approaches including tailored filtering on known characteristics of poor variant calls such as PCR duplicates and strand bias. If we have lots of variants we can train a classifier on known good variants (e.g. those that we find in dbsnp) to recognise the characteristics of those variants and use the classifier to try to distinguish good variant calls from bad.
Completed Galaxy history for this section (in SharedData>Published Histories): VariantDet_ADVNCD_Sec7
Import this History: https://swift.rc.nectar.org.au:8888/v1/AUTH_a3929895f9e94089ad042c9900e1ee82/VariantDet_ADVNCD/Galaxy-History-VariantDet_ADVNCD_Sec7.tar.gz
Section 8. Annotation [15 min]
Detected variants can be annotated against all registered variants in a pregenerated variant annotation database such as ENSEMBL. The resulting annotation will show the detected variant’s genomic location and whether the variant detected has happened in a gene, transcript or other parts of the genome. The ENSEMBL Variant Effect Predictor hosted on EBI has been used in this tutorial as the reference database.
NGS: Annotation>Ensembl variant effect predictor
View the output of the annotation step
At this stage the most interesting columns are ‘Consequence’ and ‘Extra’. ‘Consequence’ is a simple prediction of the type of variant by position. In most cases the variants are in intergenic regions, and the effect is of course very hard to predict.
In the first few variants we can see a missense mutation though: chr20:76771
View this variant in IGV and confirm that it is in a gene, and causes an amino acid change. Is it a heterozygotic or homozygotic variant?
The ‘Extra’ column, for potentially deletarious mutations, contains a prediction of how damaging that mutation would be to the protein product. The predictive algorithms are ‘SIFT’ and ‘Polyphen’.
Filter and Sort>Filter
Completed Galaxy history for this section (in SharedData>Published Histories): VariantDet_ADVNCD_Sec8
Import this History: https://swift.rc.nectar.org.au:8888/v1/AUTH_a3929895f9e94089ad042c9900e1ee82/VariantDet_ADVNCD/Galaxy-History-VariantDet_ADVNCD_Sec8.tar.gz
Section 9. Wrapup [30 min]
To finish, we’ll compare our variants against a ‘truth’ set and get a feel for how good our variant calling is and what we’ve missed.
The truth set we will use has been generated using Mpileup on a very deep sequencing of the same NA12878 sample by the Broad Institute.
The deeply sequenced dataset identifies 3220 variants. Note that Mpileup has a bias against variants in areas of indels, so may miss some variants there. Intersecting the callsets, we see that GATK has the most in common with our deeply sequenced callset (3018=7+91=3108 in common), but also the most unique (and therefore potentially false positive) calls. But, of course, our ‘truth’ set is not necessarily true (or, at least, exhaustive)! It was itself called by a particular genotyper, which has it’s own biases...
Sites where calling variants is unclear & differs between callers:
chr20:1,941,090-1,941,234 - indel next to variant, therefore ignored by Mpileup
chr20:1,898,458-1,899,623 - looks to be homozygous variant, not called by GATK for some reason
chr20:1,895,655-1,895,945 - hard to sequence area in coding region of gene, therefore low coverage, so 80x picks up many more variants. GATK misses ~10 variants in this region...
chr20:1,895,907-1,896,051 - GATK found variants, Mpileup missed them, Mpileup_80x found them
chr20:1,874,687-1,874,758 - low coverage, clear variant, FreeBayes called it, GATK/Mpileup_30x missed it, Mpileup_80x found it
chr20:1,874,438-1,875,020 - same as above, in or close to promoter of gene
chr20:1,875,196-1,875,778 - looks impossible to sequence or align, no variants can be called. Potential false negatives.
Note that low coverage areas are the main problem.
30x depth means 30x average, not 30x minimum
Go back and look at the depth_of_coverage analysis summary of your BAM dataset. The final column says that 89.8% of bases have >15 reads. This means that 10.2% of bases have <=15 reads. What does this mean for the false negative rate?
If you want, do a depth_of_coverage analysis on the 80x BAM file. In this case we go up to 96.1% over 15 reads.
For each variant caller, how many false positive variants are there?
How many false negative variants are there?
Is it important if a variant is heterozygotic or homozygotic?
How could we test the performance of the callers against each other for each case?
How can we use the QUAL score?
If we were preparing a list of variants for the next phase of the research, what would we say about them to the rest of the research group?
Completed Galaxy history for this section (in SharedData>Published Histories): VariantDet_ADVNCD_Complete