Variant Detection (Model organism)

Advanced tutorial

Mahtab Mirmomeni & Andrew Lonie

http://genome.edu.au

http://vlsci.org.au

Contents

Tutorial Overview

Background [15 min]

Preparation [10 min]

Section 1: Quality Control [5 mins]

Section 2: Alignment [15 mins]

Section 3. Calling single nucleotide variations with mpileup [15 min]

Section 4. Local realignment [15 min]

Section 5. FreeBayes [10 min]

Section 6. GATK Unified Genotyper [10 min]

Section 7. Evaluate known variations [15 min]

Section 8. Extensions [15 min]

References

Tutorial Overview

In this tutorial we compare the performance of three statistically-based variant detection tools:

  • SAMtools: Mpileup
  • GATK: Unified Genotyper
  • FreeBayes

Each of these tools takes as its input a BAM file of aligned reads and generates a list of likely variants in VCF format

Background [15 min]

Read the background to the workshop here

Where is the data in this tutorial from?

The data has been produced from human whole genomic DNA. Only reads that have mapped to a part of chromosome 20 have been used, to make the data suitable for an interactive tutorial. There are about one million 100bp reads in the dataset, produced on an Illumina HiSeq2000. This data was generated as part of the 1000 Genomes project: http://www.1000genomes.org/

Preparation [10 min]

  1. Register as a new user in Galaxy if you don’t already have an account (what is Galaxy?)
  1. Open a browser and go to the the Galaxy server: http://galaxy-tut.genome.edu.au
  1. NOTE: Firefox/Safari/Chrome all work well, Internet Explorer not so well
  1. Register as a new user: User>Register or login if you already have an account

  1. Import the data for the workshop 

You can do this in a few ways, of which by far the easiest is:

  1. Go to Shared Data -> Published Histories and click on VariantDetection_Advanced_Prep
  1. Click 'Import History' at top right, wait for the history to be imported to your account, and then ‘start using this history’.
  2. This will create a new Galaxy history in your account with all of the required data files
  3. Proceed to Section 1

Alternatively:

  1. Download the data directly to your computer using these URLs, then upload to Galaxy:
  1. https://swift.rc.nectar.org.au:8888/v1/AUTH_377/public/variantCalling_ADVNCD/NA12878.hiseq.wgs_chr20_2mb.30xPE.fastq_1
  2. https://swift.rc.nectar.org.au:8888/v1/AUTH_377/public/variantCalling_ADVNCD/NA12878.hiseq.wgs_chr20_2mb.30xPE.fastq_2
  3. Once file is downloaded, in Galaxy tools panel click on Get data>Upload File
  4. Click 'Choose File', locate the local copy of the fastq file and upload
  5. Make sure you select 'fastqsanger' under File Format or the alignment section won't work!
  6. Click Execute and wait for the files to upload to Galaxy. It can take a couple of minutes.
  7. Do the same for files that we’ll need for evaluation:
  1. https://swift.rc.nectar.org.au:8888/v1/AUTH_377/public/variantCalling_ADVNCD/dbSNP135_excludingsitesafter129_chr20.vcf

Or:

  1. Upload the data directly to Galaxy using the file URLs
  1. Follow the steps above but instead of downloading the files locally and then uploading, copy the URLs from the previous step into the URL/Text box and Galaxy will download and import them directly. Remember to select the ‘fastqsanger’ File Format for the sequence files (but not the annotated gene list file) and ‘vcf’ for the vcf file.
  2. When the files are finished uploading, rename them to ‘NA12878.hiseq.wgs_chr20_2mb.30xPE.fastq_1’, ‘NA12878.hiseq.wgs_chr20_2mb.30xPE.fastq_2’, ‘dbSNP135_excludingsitesafter129.chr20.vcf’ by clicking on the pencil icon to the top right of the file name in the right hand Galaxy panel (the history panel)  

NOTE: If you log out of Galaxy and log back at a later time your data and results from previous experiments will be available in the right panel of your screen called the ‘History’

Or:

  1. Import this history into your Galaxy instances using ‘Import from File’ under the Histories menu: https://swift.rc.nectar.org.au:8888/v1/AUTH_a3929895f9e94089ad042c9900e1ee82/VariantDet_ADVNCD/Galaxy-History-VariantDet_ADVNCD_Prep.tar.gz

  1. You should now have these files in your history:
  1. NA12878.hiseq.wgs_chr20_2mb.30xPE.fastq_1
  2. NA12878.hiseq.wgs_chr20_2mb.30xPE.fastq_2
  3. dbSNP135_excludingsitesafter129.chr20.vcf

Completed Galaxy history for this section (in SharedData>Published Histories): VariantDetection_ADVNCD_Prep

Section 1: Quality Control [5 mins]

The aim here is to evaluate the quality of the short data. If the quality is poor, then adjustments can be made - eg trimming the short reads, or adjusting your expectations of the final outcome!

  1. Analyse the quality of the reads in the FASTQ file

From the left hand tool panel in Galaxy, select NGS: QC and manipulation -> FASTQC

  1. Click on the eye icon to view the various quality metrics
  2. Look at the generated FastQC metrics. This data looks pretty good - high per-base quality scores (most above 30)

Completed Galaxy history for this section (in SharedData>Published Histories): VariantDet_ADVNCD_Sec1

OR

Import this History: https://swift.rc.nectar.org.au:8888/v1/AUTH_a3929895f9e94089ad042c9900e1ee82/VariantDet_ADVNCD/Galaxy-History-VariantDet_ADVNCD_Sec1.tar.gz

Section 2: Alignment [15 mins]

The basic process here is to map each of the individual reads in the sample FASTQ readsets to a reference genome, so that we can then identify the sequence changes with respect to the reference genome.

Some of the variant callers need extra information regarding the source of reads in order to identify the correct error profiles to use in their statistical variant detection model, so we add more information into the alignment step so that that generated BAM file contains the metadata the variant caller expects.

  1. Map/align the reads with BWA to Human reference genome 19 (hg19)

NGS: Mapping>Map with BWA for Illumina [2-3mins]

  1. Reference genome: Human (hg19)
  2. Is this library mate-paired? Paired-end
  3. Forward FASTQ File:  NA12878.hiseq.wgs_chr20_2mb.30xPE.fastq_1
  4. Reverse FASTQ File:  NA12878.hiseq.wgs_chr20_2mb.30xPE..fastq_2
  5. BWA settings to use: Full Parameter List
  6. Specify the read group for this file? (samse/sampe -r): Yes
  7. Read Group Identifier (ID): Tutorial_readgroup
  8. Library name (LB): Tutorial_library
  9. Platform/technology used to produce the reads (PL): ILLUMINA
  10. Sample (SM): NA12878
  11. Use defaults for other fields
  12. Execute

  1. Compress the SAM file to BAM format

NGS: Sam Tools>SAM-to-BAM [1 min]

  1. SAM File to Convert: The BAM file you just generated
  2. Execute

  1. Open the aligned BAM file in the Integrated Genome Viewer (IGV)
  1. First, give your BAM dataset a more meaningful name - eg 'NA12878.chr20_2mb.30xPE.BWA_mapped'
  2. In the history panel for the newly generated BAM file, click on the ‘display with IGV web current
  1. if you have IGV already running click ‘display with IGV local’
  1. Select chr20  in the IGV chromosomal region drop down box (top of IGV, on the left next to the organism drop down box).
  1. Zoom in to the left hand end of chromosome 20 to see the read alignments - remember our reads only cover the first 2mb of the chromosome.

Scroll around and zoom in and out in the IGV genome viewer to get a feel for genomic data. Note that coverage is variable, with some regions getting almost no coverage (e.g. try chr20:1,870,686-1,880,895   - if you zoom right in to base resolution you’ll see that this region is very GC rich, meaning it’s hard to sequence. Unfortunately it also contains the first few exons of a gene...)

  1. Generate a genomic interval (BED) file that we will use to restrict further analyses to the first 2mb of chromosome 20

The next steps can be computationally intensive if performed on the entire genome. We can limit the analysis to the region that we know our data comes from, the first part of chromosome 20

  1. Text manipulation>Create single interval
  1. Chromosome: chr20
  2. Start position: 0
  3. End position: 2000000
  4. Name: chr20_2mb
  5. Strand: plus
  6. Execute
  7. When the file is created, rename it to: ‘chr20_2mb.bed’

This will generate an interval file specifying the first 2mb of chromosome 20, which we’ll use in the following steps

  1. Evaluate the depth of coverage of the aligned region

NGS: GATK2 Tools>Depth of Coverage [1-2 min]

  1. BAM files: Select the BAM file you just generated
  2. Using reference genome: Human (hg19)
  3. Output format: table
  4. Basic or Advanced GATK options: Advanced
  5. Add new Operate on Genomic intervals
  1. Select the chr20_2mb.bed file from your history
  1. Execute
  2. When completed, view ‘Depth of Coverage on data.... (output summary sample)’
  1. First, you will need to tell Galaxy that this is a text file. Click on the pencil icon, and change the Datatype of any ‘Depth of Coverage’ output you want to see to ‘txt’
  1. Depth of Coverage on data.... (output summary sample) will tell you the total depth of coverage across your sample (for the first 2mb, as specified in the ‘Add new Operated on Genomic intervals’ parameter)
  1. The columns in this report are: sample, total bases, mean depth
  2. It should be ~24x. Note that ~89% of reference bases are covered by at least 15x coverage, which is a sort of informal agreed minimum for reasonable variant calling
  1. The other tables give you more detailed statistics on the level of coverage, broken down by regions etc. We don’t really need them so to keep our Galaxy history clean we will delete all the outputs of this step except for the ‘Depth of Coverage on data.... (output summary sample)’ file.
  1. Use the ‘X’ next to a history file to delete it.

Completed Galaxy history for this section (in SharedData>Published Histories):  VariantDet_ADVNCD_Sec2

OR

Import this History: https://swift.rc.nectar.org.au:8888/v1/AUTH_a3929895f9e94089ad042c9900e1ee82/VariantDet_ADVNCD/Galaxy-History-VariantDet_ADVNCD_Sec2.tar.gz

Section 3. Calling single nucleotide variations with mpileup [15 min]

Mpileup is a Bayesian variant genotyper which assesses the likelihood of each possible genotype for each position in the reference genome, given the observed reads at that position, and reports back the list of variants (positions where a genotype different to homozygous reference was called).

Read more about how to use mpileup here

  1. Pileup the mapped reads in the BAM file from the previous step

NGS:SAM Tools>Mpileup [1-2 min]

  1. BAM file: The BAM file you generated in Section 2
  2. Using reference genome: hg19
  3. Genotype Likelihood Computation: Perform a genotype likelihood computation
  4. Perform INDEL calling: Do not perform indel calling
  5. Set advanced options: Advanced
  1. List of regions or sites on which to operate: chr20.bed
  1. Execute

Mpileup creates a BCF (binary VCF) file which contains the list of variations in a compressed format

  1. Convert the BCF file to a VCF file

NGS:SAM Tools> bcftools view

  1. Choose a bcf file to view: Mpileup file just generated ‘MPileup on data... and data …’
  2. Keep other defaults
  3. Execute

  1. Rename the file and change the output format from tabular to vcf
  1. Change the properties of the file using the pencil icon:
  2. Rename the file to something useful eg ‘NA12878.Mpileup.chr20_2mb.vcf’
  3. Save
  4. Click on the Datatype link
  5. New Type: vcf
  6. Save

  1. Check the generated list of variants
  1. Note there are ~3000 variants in this list
  2. Open the VCF file in IGV, using the ‘display with IGV local’ link in the history panel (don’t use IGV web_current or you’ll open another IGV instance)
  3. After a minute or so you’ll see a new track appear in IGV, on top of the BAM file tracks you have open
  4. Zoom in or out until you see some variant calls appear in the track
  5. Choose a region with some variants and match the alignments to the variant calls
  1. Try chr20:1,123,714-1,126,378

IGV displays VCF variants as coloured bars; homozygotes are solid bars, heterozygotes are two-colour representing the two alleles (A=green, C=blue, G=yellow, T=red). If you hover over a variant bar, IGV displays info about that variant including number of bases supporting the variant call, quality of variant call, etc. Generally, homozygotes are easier to call correctly than heterozygotes.

Completed Galaxy history for this section (in SharedData>Published Histories): VariantDet_ADVNCD_Sec3

OR

Import this History: https://swift.rc.nectar.org.au:8888/v1/AUTH_a3929895f9e94089ad042c9900e1ee82/VariantDet_ADVNCD/Galaxy-History-VariantDet_ADVNCD_Sec3.tar.gz

Section 4. Local realignment [15 min]

Alignment to large genomes is a compromise between speed and accuracy. This is often a problem where indels are present - reads with indels are computationally difficult to align correctly and often are misaligned as a compromise to speed. This is one of the major sources of false positive variant errors. We can improve the performance of variant callers by doing a further step of ‘realignment’ after the initial alignment, focussing on just those areas with potential indels, and spending more time trying to find the right alignment that works for all the reads in that area using a slower but more accurate alignment algorithm. Because we now know the genomic area that the reads come from, we can use information from a group of reads to decide about a difference to the reference genome; this cannot be done at the initial alignment step, when reads are considered individually.

Both GATK and FreeBayes will benefit from local realignment around indels. Mpileup does its own adjustments for indels.

 

  1. Generate the list of indel regions to be realigned

NGS: GATK2 Tools> Realigner Target Creator

  1. BAM file: Select the BAM file from previous section
  2. Using reference genome: Human (hg19)
  3. Basic or Advanced GATK options: Advanced
  4. Add new Operate on Genomic intervals:
  1. Select the chr20_2mb.bed file from your history
  1. Execute
  2. When finished, you should see that ~2800 regions have been identified as candidates for local realignment

  1. Realign the subset of reads around the target indel areas and generate a new BAM file

NGS: GATK2 Tools> Indel Realigner

  1. BAM file: Select the BAM file from previous section
  2. Using reference genome: Human (hg19)
  3. Restrict realignment to provided intervals: use the intervals file generated in the previous step
  4. Execute
  5. Rename the file to something useful eg 'NA12878.chr20_2mb.30xPE.realigned'
  6. To keep your history clean, delete the log files generated by GATK in the last two steps

  1. Compare the realigned BAM file to original BAM around an indel
  1. Open the realigned BAM file in IGV
  2. It should appear as a new frame below the already opened previous BAM
  3. Generally, the new BAM should appear identical to the old BAM except in the realigned regions
  4. Find some regions with indels that have been realigned (use the ‘Realigner Target Creator on data... (GATK intervals)’ file from the first step of realignment, it has a list of the realigned regions)
  5. If you can’t find anything obvious, check region chr20:1163914-1163954
  1. you should see that realignment has resulted in reads originally providing evidence of a ‘G/C’ variant at chr20:1163937 to be realigned with a 10bp insertion at chr20:1163835 and no evidence of the variant

Completed Galaxy history for this section (in SharedData>Published Histories): VariantDet_ADVNCD_Sec4

OR

Import this History: https://swift.rc.nectar.org.au:8888/v1/AUTH_a3929895f9e94089ad042c9900e1ee82/VariantDet_ADVNCD/Galaxy-History-VariantDet_ADVNCD_Sec4.tar.gz

Section 5. FreeBayes [10 min]

FreeBayes is a Bayesian variant caller which aggressively calls all possible variants, leaving filtering to the user. FreeBayes generates a variant Quality score (as do all variant callers) which can be used for filtering.

Read more about FreeBayes here

  1. Detect variations in the alignment with the reference genome using FreeBayes

NGS: Variation Detection -> Free Bayes

  1. BAM file: 'NA12878.chr20_2mb.30xPE.realigned'
  2. Using reference genome: hg19
  3. Basic or Advanced options: Advanced
  4. Limit analysis to listed targets: Limit By target File
  5. Limit analysis to targets listed in the BED-format FILE: chr20_2mb.bed
  6. Set allele scope options: Set
  7. Ignore insertion and deletion alleles: check
  8. Set input filters options: Set
  9. Exclude reads with more than N base mismatches, ignoring gaps with quality >= mismatch-base-quality-threshold: 1    
  10. Keep all other defaults
  11. Execute
  12. When finished, rename to ‘NA12878.FreeBayes.chr20_2mb.vcf’

  1. Check the generated list of variants
  1. Note there are ~6100 variants in this list
  2. Open the VCF file in IGV, using the link in the history panel
  3. Find a region where FreeBayes has called a variant but Mpileup hasn’t
  1. Try chr20:1,123,714-1,128,378

Note that FreeBayes does not do any filtering, but rather leaves the user to filter based on e.g. variant quality scores. So it returns many more variants, many of which will be low quality. In other words, FreeBayes will call a potential variant based on much less evidence than MPileup.

  • FreeBayes gives you more false positives
  • Mpileup will give you more false negatives

Completed Galaxy history for this section (in SharedData>Published Histories): VariantDet_ADVNCD_Sec5

OR

Import this History: https://swift.rc.nectar.org.au:8888/v1/AUTH_a3929895f9e94089ad042c9900e1ee82/VariantDet_ADVNCD/Galaxy-History-VariantDet_ADVNCD_Sec5.tar.gz

Section 6. GATK Unified Genotyper [10 min]

The GATK Unified Genotyper is a Bayesian variant caller and genotyper from the Broad Institute. Many users consider the GATK to be best practice in human variant calling.

Read more about the GATK here

  1. Generate VCF file of variations using Unified Genotyper

NGS: GATK2 Tools ->Unified Genotyper

  1. BAM file: Realigned BAM file ‘NA12878.chr20_2mb.30xPE.realigned’
  2. Using reference genome: Human (hg19)
  3. dbSNP ROD file: dbsnp135_excludingsitesafter129_chr20.vcf
  4. Genotype likelihoods calculation model to employ: SNP
  5. Basic or Advanced GATK options: Advanced
  6. Add new Operate on Genomic intervals:
  1. List of regions or sites on which to operate: chr20_2mb
  1. Execute
  2. Rename the file to something useful eg ‘NA12878.GATK.chr20_2mb.vcf’
  3. Delete the (log) and (metrics) files

  1. Check the generated list of variants
  1. Note there are ~3200 variants in this list
  2. Open the VCF file in IGV, using the link in the history panel
  3. Find a regions where GATK has called a variant but FreeBayes/mpileup hasn’t, and vice versa
  1. Try chr20:1,123,714-1,128,378

We now have three different variant call sets from the same data. Browse around in IGV looking at places where they are different: eg look at: chr20:1,127,767-1,127,906

Here each of the variant callers has given a different SNP call. This is a confusing region for SNP callers due to the 15bp deletion, which may or may not be on both alleles.

We could immediately improve the FreeBayes callset by filtering on variant quality scores. If you want, you can do this with the NGS: GATK>Select Variants tool, choose the FreeBayes vcf file, click ‘Add new criteria to use when selecting data.’, use ‘QUAL>50’, Execute)

Completed Galaxy history for this section (in SharedData>Published Histories): VariantDet_ADVNCD_Sec6

OR

Import this History: https://swift.rc.nectar.org.au:8888/v1/AUTH_a3929895f9e94089ad042c9900e1ee82/VariantDet_ADVNCD/Galaxy-History-VariantDet_ADVNCD_Sec6.tar.gz

Section 7. Evaluate known variations [15 min]

How can we evaluate our variants?

We know a lot about variation in humans from many empirical studies, including the 1000Genomes project, so we have some expectations on what we should see when we detect variants in a new sample:

  • Expect to see true variations at the rate of about 1 per 1000bp against the reference genome
  • 85% of variations ‘rediscovered’ - that is, 85% already known and recorded in dbSNP (% dependent on the version of dbSNP)
  • A transition/transversion (Ti/Tv) rate of >2 if the variants are high quality, even higher if the variants are in coding regions.

Read more about SNP call set properties here

We can immediately see that each of the variant callers has more variants than we would have expected - we would have expected around 2000 in our 2 megabase region but we see between 3000 and 5000. This is normal for variant calling, where most callers err on the side of sensitivity to reduce false negatives (missed SNVs), expecting the user to do further filtering to remove false positives (false SNVs).

  1. Evaluate dbSNP concordance and Ti/Tv ratio using the GATK VariantEval tool

NGS: GATK2 Tools -> Eval Variants

  1. Variant 1: NA12878.Mpileup.chr20_2mb.vcf
  2. Add new Variant
  1. Variant 2: NA12878.FreeBayes.chr20_2mb.vcf
  1. Add new Variant
  1. Variant 3: NA12878.UnifiedGeno.chr20_2mb.vcf
  1. Using reference genome: Human (hg19)
  2. Provide a dbsnp reference-ordered data file: set dbSNP
  1. dbSNP ROD file: dbSNP135_excludingsitesafter129.chr20.vcf
  1. Basic or Advanced Analysis options: Advanced
  1. Eval modules to apply on the eval track(s):
  1. CompOverlap
  2. TiTvVariantEvaluator
  1. Do not use the standard eval modules by default: check
  1. Execute
  2. When finished, change the Datatype of the generated Eval Variant (report) to ‘txt’

  1. Interpret the dbSNP concordance section of the evaluation report
  1. View the ‘Eval Variants on data... (report)’
  2. The first section of the report lists the overlap in variants between the generated VCF files and known human variation sites from dbSNP
  3. The report needs a bit of explaining:
  1. The EvalRod column specifies which of the input VCFs is analysed (input_0 = first VCF, input_1 = second etc). If you followed the steps above exactly, then:
  1. input_0 = Mpileup
  2. input_1 = FreeBayes
  3. input_2 = GATK
  1. The CompRod column specifies the set of variants against which the input VCFs are being compared
  2. Novelty = whether the variants have been found in the supplied dbSNP file or not (known = in dbSNP, novel = not in dbSNP)
  3. nEvalVariants = # variant sites found in both EvalRod (input) VCF and CompRod VCF
  4. novelSites = # variant sites found in both EvalRod but not CompRod (ie dbSNP)
  5. CompOverlap = # variant sites found in both EvalRod and CompRod
  6. compRate = % of variant sites from EvalRod found in CompRod (=CompOverlap/nEvalVariants).
  1. This is the important one, and it’s what people generally refer to as dbSNP concordance
  1. concordantRate = % overlapping variants with same genotype

  1. Interpret the TiTv section of the evaluation report
  1. The second section of the report lists the TiTv ratio for different groups of variants from the callsets
  2. It generally follows the format above but the interesting columns being ‘tiTvRatio’ for each of the callsets
  1. The thing to look for is a Ti/Tv close to the TiTvRatioStandard of 2.38. Generally, the higher the better, as a high Ti/Tv ratio indicates that most variants are likely to reflect our expectations from what we know about human variation.

Note that the Ti/Tv for all callsets is around 2.3 for the dbSNP concordant variants, but varies widely for the novel variants (Mpileup:2.32; FreeBayes: 2.25; GATK: 2.26). This is as expected, as these are almost certainly true variant sites.

The Ti/Tv ratio for the Mpileup callset (input_0) looks pretty convincing: it’s 2.32 for all variants, 2.32 for the dbSNP (what we consider very likely to be true variants) and 2.32 for the novel variants. This is a very convincing set of numbers, but it’s worth noting that Mpileup called the fewest number of variants. FreeBayes, which called the most variants, has a Ti/Tv ratio for novel variants that looks very much like random noise - a Ti/Tv of 1 is what you would expect from variants with overall random base changes, which is not what we expect to see.

It is of course easy to call the highly confident variants, the trick is in calling ALL the variants, including ones on the margin of false positive/false negative.

So although we have a lot of confidence in the set of variants called by Mpileup, we might want to assure ourselves that this callset is exhaustive and we don’t have a high false negative rate. THIS is very hard to do!

  1. How much overlap is there in the call sets?

One way to work out how many of the variants are in common between the call sets is to produce a Venn diagram to visualise the overlap. We can do this in Galaxy by aligning our three variant call sets using the chromosomal coordinate of each SNP, in the second column of each VCF file.

First though, we need to remove the headers from each file, leaving us just those lines with a valid position value.

Remove the header lines of each of the three VCF files

  1. Filter and Sort>Select lines that match an expression
  1. Select lines from:NA12878.Mpileup.chr20_2mb.vcf
  2. that: NOT matching
  3. the pattern: ^#
  1. Execute
  2. Repeat for NA12878.FreeBayes.chr20_2mb.vcf and NA12878.GATK.chr20_2mb.vcf

This will remove any lines starting with a ‘#’ character – in other words, all the headers.

  1. Graph/Display Data>proportional venn
  1. Input file 1: The Mpileup VCF file with the header removed, from above
  2. Column index: 1
  3. As name: Mpileup
  4. Input file 2: The FreeBayes VCF file with the header removed, from above
  5. Column index: 1
  6. As name file 2: FreeBayes
  7. Two or three: three
  8. Input file 3: The GATK VCF file with the header removed, from above
  9. Column index: 1
  10. As name file 3: GATK
  1. Execute
  2. View the resulting Venn diagram to see the overlap for the three callers.

Unfortunately the Venn diagram plotting function of Galaxy does not overlay values onto the chart but instead provides them underneath. To interpret these, you need to understand the symbols used:

  • ∪ = union – the total number of values in one, other or both of two sets
  • ∩ = intersection – the total number of values that are members of two sets
  • \ = difference – the total number of values that are in one set but not another

For example, the line “Mpileup ∩ GATK \ FreeBayes - 356” means that 356 SNPs were identified by Mpileup and GATK but not by FreeBayes. For a more detailed explanation of set terminology, see here. Also note that it is a ‘proportional Venn’ – in other words, the relative sizes of the circles and area of overlap is proportional to the number of values in each.

The diagram below sums up the output in a more conventional Venn diagram format. The three callers agree on 2677 variants, and GATK and Mpileup broadly agree on most variants, differing only in ~240 total (i.e. 10%).

FreeBayes has called many more SNPs with most of its calls probably being false positives. If we filter the FreeBayes variants to remove all those with a quality score of less than 50, we get a more convincing intersection agreement, with the filtered FreeBayes set only containing seven SNPs not identified by GATK or Mpileup.

The way we have done this is to eliminate most of the false positives. Unfortunately it looks like we might have also removed a number of true positives, with 200 fewer SNPs in the three-way intersection.

This is the fundamental issue with variant calling: it’s a probabilistic process and the ‘truth’ set of variants is very hard to know!

Generally we would do some more filtering of the call sets using various approaches including tailored filtering on known characteristics of poor variant calls such as PCR duplicates and strand bias. If we have lots of variants we can train a classifier on known good variants (e.g. those that we find in dbsnp) to recognise the characteristics of those variants and use the classifier to try to distinguish good variant calls from bad.

Completed Galaxy history for this section (in SharedData>Published Histories): VariantDet_ADVNCD_Sec7

OR

Import this History: https://swift.rc.nectar.org.au:8888/v1/AUTH_a3929895f9e94089ad042c9900e1ee82/VariantDet_ADVNCD/Galaxy-History-VariantDet_ADVNCD_Sec7.tar.gz

Section 8. Annotation [15 min]

Detected variants can be annotated against all registered variants in a pregenerated variant annotation database such as ENSEMBL. The resulting annotation will show the detected variant’s genomic location and whether the variant detected has happened in a gene, transcript or other parts of the genome. The ENSEMBL Variant Effect Predictor hosted on EBI has been used in this tutorial as the reference database.

  1. Annotate the detected variants against the ensembl database.

            NGS: Annotation>Ensembl variant effect predictor

  1. Input file: Any of the vcf files you have generated. eg NA12878.GATK.chr20_2mb.vcf
  2. Name of the Species being annotated: Human
  3. Execute
  1. Interpret the annotation output

           View the output of the annotation step

At this stage the most interesting columns are ‘Consequence’ and ‘Extra’. ‘Consequence’ is a simple prediction of the type of variant by position. In most cases the variants are in intergenic regions, and the effect is of course very hard to predict.

In the first few variants we can see a missense mutation though: chr20:76771

View this variant in IGV and confirm that it is in a gene, and causes an amino acid change. Is it a heterozygotic or homozygotic variant?

The ‘Extra’ column, for potentially deletarious mutations, contains a prediction of how damaging that mutation would be to the protein product. The predictive algorithms are ‘SIFT’ and ‘Polyphen’.

  1. How many deleterious mutations are there in our callset?

 Filter and Sort>Filter

  1. Filter criteria: ‘c7==’missense_variant’
  2. How many missense variants are there? Is this a surprising number?
  3. Are there any nonsense mutations?

Completed Galaxy history for this section (in SharedData>Published Histories): VariantDet_ADVNCD_Sec8

OR

Import this History: https://swift.rc.nectar.org.au:8888/v1/AUTH_a3929895f9e94089ad042c9900e1ee82/VariantDet_ADVNCD/Galaxy-History-VariantDet_ADVNCD_Sec8.tar.gz

Section 9. Wrapup [30 min]

To finish, we’ll compare our variants against a ‘truth’ set and get a feel for how good our variant calling is and what we’ve missed.

The truth set we will use has been generated using Mpileup on a very deep sequencing of the same NA12878 sample by the Broad Institute.

  1. Upload the truth set to your history
  1. Shared Data>‘Published Histories, then ‘VariantDetection_ADVANCED_Section9
  2. Select the two datasets ‘NA12878.80x.Mpileup.wgs_chr20_2mb.vcf’, NA12878.hiseq_80x.wgs_chr20_2mb.bwa.hg19.bam’
  3. ‘Import to your current history’, ‘Go’
  4. Return to the analysis view with ‘Analyse Data’
  5. The two datasets should now be in your history
  6. View both of the datasets in IGV alongside your generated VCF files

The deeply sequenced dataset identifies 3220 variants. Note that Mpileup has a bias against variants in areas of indels, so may miss some variants there. Intersecting the callsets, we see that GATK has the most in common with our deeply sequenced callset (3018=7+91=3108 in common), but also the most unique (and therefore potentially false positive) calls. But, of course, our ‘truth’ set is not necessarily true (or, at least, exhaustive)! It was itself called by a particular genotyper, which has it’s own biases...

  1. Contrast NA12878.80x.Mpileup vs NA12878.hiseq_30x.Mpileup vs NA12878.hiseq_30x.GATK

Sites where calling variants is unclear & differs between callers:

chr20:1,941,090-1,941,234 - indel next to variant, therefore ignored by Mpileup

chr20:1,898,458-1,899,623 - looks to be homozygous variant, not called by GATK for some reason

chr20:1,895,655-1,895,945 - hard to sequence area in coding region of gene, therefore low coverage, so 80x picks up many more variants. GATK misses ~10 variants in this region...

chr20:1,895,907-1,896,051 - GATK found variants, Mpileup missed them, Mpileup_80x found them

chr20:1,874,687-1,874,758 - low coverage, clear variant, FreeBayes called it, GATK/Mpileup_30x missed it, Mpileup_80x found it

chr20:1,874,438-1,875,020 - same as above, in or close to promoter of gene

chr20:1,875,196-1,875,778 - looks impossible to sequence or align, no variants can be called. Potential false negatives.

Note that low coverage areas are the main problem.

30x depth means 30x average, not 30x minimum

Go back and look at the depth_of_coverage analysis summary of your BAM dataset. The final column says that 89.8% of bases have >15 reads. This means that 10.2% of bases have <=15 reads. What does this mean for the false negative rate?

If you want, do a depth_of_coverage analysis on the 80x BAM file. In this case we go up to 96.1% over 15 reads.

  1. Discussion points

For each variant caller, how many false positive variants are there?

How many false negative variants are there?

Is it important if a variant is heterozygotic or homozygotic?

How could we test the performance of the callers against each other for each case?

How can we use the QUAL score?

If we were preparing a list of variants for the next phase of the research, what would we say about them to the rest of the research group?

Completed Galaxy history for this section (in SharedData>Published Histories): VariantDet_ADVNCD_Complete

OR

Import this History: https://swift.rc.nectar.org.au:8888/v1/AUTH_a3929895f9e94089ad042c9900e1ee82/VariantDet_ADVNCD/Galaxy-History-VariantDet_ADVNCD_Complete.tar.gz

References