Variant Detection (Human)

Introductory genomics tutorial

Andrew Lonie


Tutorial Overview

Background [15 min]

Preparation [15 min]

Section 1: Quality Control [10 mins]

Section 2: Alignment [20 mins]

Section 3. Calling Variants using the GATK Unified Genotyper [15 min]

Section 4. Evaluate known variations [15 min]

Section 5. Annotation [20 min]

Tutorial Overview

In this tutorial we cover the concepts of detecting small variants (SNVs and indels) in human genomic DNA using a small set of sequencing reads from chromosome 20

The fundamentals of short-read sequence analysis

Next-generation sequencing is a technology rather than an experimental approach; it generates data that can be used in a variety of ways. A common use of NGS is to characterise the sequence variation between samples. A typical experiment might involve:

  • Generate DNA sample (from genomic DNA, targetted DNA regions like exomes, mRNA)
  • Sequence on next-generation machine. Output is many millions of short (35-200bp) reads with quality information on each base.
  • Analyse quality metrics for read data and filter data if required
  • Align reads against a reference sequence. This is more complex than it sounds due to errors and variation between samples.
  • Interpret the aggregated evidence from the reads - identify differences between observations and the reference, including variations, expression levels. Normally involves statistics!
  • Biological validation: validate variations independently

What’s not covered

The tutorial is designed to introduce the tools, datatypes and workflow of variation detection. We filter the variations manually to understand what is actually happening in variant calling. In practice the datasets would be much larger and you would use more sophisticated tools to detect and call variants, based on statistical models of variant representation in a pool of reads.

Background [15 min]

Read a presentation on variant calling here

Where is the data in this tutorial from?

The data has been produced from human whole genomic DNA. Only reads that have mapped to a part of chromosome 20  have been used, to make the data suitable for an interactive tutorial. There are about one million 100bp reads in the dataset, produced on an Illumina HiSeq2000. This data was generated as part of the 1000 Genomes project:

Preparation [15 min]

  1. Register as a new user in Galaxy if you don’t already have an account (what is Galaxy?)
  1. Open a browser and go to the the Galaxy server designated by your tutor, or use the public Galaxy tutorial server:
  1. NOTE: Firefox/Safari/Chrome all work well, Internet Explorer not so well
  1. Register as a new user: User>Register or login if you already have an account

  1. Import the data for the workshop 

You can do this in a few ways*, of which by far the easiest is:

Go to Shared Data -> Published Histories and click on CoPJun2015_Prep

  1. Click 'Import History' at top right, wait for the history to be imported to your account, and then ‘start using this history’.
  2. This will create a new Galaxy history in your account with all of the required data files
  3. Proceed to Section 1        

*there are other ways to get data into Galaxy. See alternative methods at the end of this tutorial.

  1. You should now have these files in your history:
  1. NA12878.hiseq.wgs_chr20_2mb.30xPE.fastq_1
  2. NA12878.hiseq.wgs_chr20_2mb.30xPE.fastq_2
  3. dbSNP135_excludingsitesafter129.chr20.vcf

Completed Galaxy history for this section (in SharedData>Published Histories): CoPJun2015_Prep

Section 1: Quality Control [10 mins]

The aim here is to evaluate the quality of the short data. If the quality is poor, then adjustments can be made - eg trimming the short reads, or adjusting your expectations of the final outcome!

  1. Analyse the quality of the reads in the FASTQ file

From the left hand tool panel in Galaxy, select NGS: QC and manipulation>FASTQC: Read QC

  1. Click on the eye icon to view the various quality metrics
  2. Look at the generated FastQC metrics. This data looks pretty good - high per-base quality scores (most above 30)

Completed Galaxy history for this section (in SharedData>Published Histories): CoPJun2015_Sec1

Section 2: Alignment [20 mins]

The basic process here is to map each of the individual reads in the sample FASTQ readsets to a reference genome, so that we can then identify the sequence changes with respect to the reference genome.

Some of the variant callers need extra information regarding the source of reads in order to identify the correct error profiles to use in their statistical variant detection model, so we add more information into the alignment step so that that generated BAM file contains the metadata the variant caller expects.

  1. Map/align the reads with BWA to Human reference genome 19 (hg19)

NGS: Mapping>Map with BWA for Illumina [2-3mins]

  1. Reference genome: Human (hg19)
  2. Is this library mate-paired? Paired-end
  3. Forward FASTQ File:  NA12878.hiseq.wgs_chr20_2mb.30xPE.fastq_1
  4. Reverse FASTQ File:  NA12878.hiseq.wgs_chr20_2mb.30xPE..fastq_2
  5. BWA settings to use: Full Parameter List
  6. Specify the read group for this file? (samse/sampe -r): Yes
  7. Read Group Identifier (ID): Tutorial_readgroup
  8. Library name (LB): Tutorial_library
  9. Platform/technology used to produce the reads (PL): ILLUMINA
  10. Sample (SM): NA12878
  11. Use defaults for other fields
  12. Execute

  1. Compress the SAM file to BAM format

NGS: Sam Tools>SAM-to-BAM [1 min]

  1. SAM File to Convert: The BAM file you just generated
  2. Execute

  1. Open the aligned BAM file in the Integrated Genome Viewer (IGV)
  1. First, give your BAM dataset a more meaningful name - eg 'NA12878.chr20_2mb.30xPE.BWA_mapped'
  2. In the history panel for the newly generated BAM file, click on the ‘display with IGV web current
  1. if you have IGV already running click ‘display with IGV local’
  1. Select chr20  in the IGV chromosomal region drop down box (top of IGV, on the left next to the organism drop down box).
  1. Zoom in to the left hand end of chromosome 20 to see the read alignments - remember our reads only cover the first 2mb of the chromosome.

Scroll around and zoom in and out in the IGV genome viewer to get a feel for genomic data. Note that coverage is variable, with some regions getting almost no coverage (e.g. try chr20:1,870,686-1,880,895   - if you zoom right in to base resolution you’ll see that this region is very GC rich, meaning it’s hard to sequence. Unfortunately it also contains the first few exons of a gene...)

  1. Generate a genomic interval (BED) file that we will use to restrict further analyses to the first 2mb of chromosome 20

The next steps can be computationally intensive if performed on the entire genome. We can limit the analysis to the region that we know our data comes from, the first part of chromosome 20

  1. Text manipulation>Create single interval
  1. Chromosome: chr20
  2. Start position: 0
  3. End position: 2000000
  4. Name: chr20_2mb
  5. Strand: plus
  6. Execute
  7. When the file is created, rename it to: ‘chr20_2mb.bed’

This will generate an interval file specifying the first 2mb of chromosome 20, which we’ll use in the following steps

  1. Evaluate the depth of coverage of the aligned region

NGS: GATK2 Tools>Depth of Coverage [2-5 min]

  1. BAM files: Select the BAM file you just generated
  2. Using reference genome: Human (hg19)
  3. Output format: table
  4. Basic or Advanced GATK options: Advanced
  5. Add new Operate on Genomic intervals
  1. Select the chr20_2mb.bed file from your history
  1. Execute
  2. When completed, view ‘Depth of Coverage on data.... (output summary sample)’
  1. First, you will need to tell Galaxy that this is a text file. Click on the pencil icon, and change the Datatype of any ‘Depth of Coverage’ output you want to see to ‘txt’
  1. Depth of Coverage on data.... (output summary sample) will tell you the total depth of coverage across your sample (for the first 2mb, as specified in the ‘Add new Operated on Genomic intervals’ parameter)
  1. The columns in this report are: sample, total bases, mean depth
  2. It should be ~24x. Note that ~89% of reference bases are covered by at least 15x coverage, which is a sort of informal agreed minimum for reasonable variant calling
  1. The other tables give you more detailed statistics on the level of coverage, broken down by regions etc. We don’t really need them so to keep our Galaxy history clean we will delete all the outputs of this step except for the ‘Depth of Coverage on data.... (output summary sample)’ file.
  1. Use the ‘X’ next to a history file to delete it.

Completed Galaxy history for this section (in SharedData>Published Histories): CoPJun2015_Sec2

Section 3. Calling Variants using the GATK Unified Genotyper [15 min]

The GATK Unified Genotyper is a Bayesian variant caller and genotyper from the Broad Institute. Many users consider the GATK to be best practice in human variant calling.

Read more about the GATK here

  1. Generate VCF file of variations using Unified Genotyper

NGS: GATK2 Tools ->Unified Genotyper

  1. BAM file: Realigned BAM file ‘NA12878.chr20_2mb.30xPE.realigned’
  2. Using reference genome: Human (hg19)
  3. dbSNP ROD file: dbsnp135_excludingsitesafter129_chr20.vcf
  4. Genotype likelihoods calculation model to employ: SNP
  5. Basic or Advanced GATK options: Advanced
  6. Add new Operate on Genomic intervals:
  1. List of regions or sites on which to operate: chr20_2mb
  1. Execute
  2. Rename the file to something useful eg ‘NA12878.GATK.chr20_2mb.vcf’
  3. Delete the (log) and (metrics) files

  1. Check the generated list of variants
  1. Note there are ~3700 variants in this list
  2. Open the VCF file in IGV, using the link in the history panel
  1. Try chr20:1,123,714-1,128,378

Browse around in IGV looking at places where variant calling is complex: eg look at: chr20:1,127,767-1,127,906

This is a confusing region for SNP callers due to the 15bp deletion, which may or may not be on both alleles.

Section 3a: Try a different variant caller and compare


  1. Try the alternative caller Mpileup
  1. NGS: SAM Tools > Mpileup
  2. Using reference genome: Human (hg19)
  3. Genotype Likelihood Computation: Perform genotype likelihood computation
  4. Set Advanced Options: Advanced
  5. List of regions or sites on which to operate: chr20_2mb.bed
  6. Execute
  7. When finished, change datatype to VCF
  8. There are ~3500 variants in this list - different to the 3700 from GATK in step 1

Browse around in IGV looking at places where variant calling is different between the callers: eg look at:




Completed Galaxy history for this section (in SharedData>Published Histories): CoPJun2015_Sec3

Section 4. Evaluate known variations [15 min]

How can we evaluate our variants?

We know a lot about variation in humans from many empirical studies, including the 1000Genomes project, so we have some expectations on what we should see when we detect variants in a new sample:

  • Expect to see true variations at the rate of about 1 per 1000bp against the reference genome
  • 85% of variations ‘rediscovered’ - that is, 85% already known and recorded in dbSNP (% dependent on the version of dbSNP)
  • A transition/transversion (Ti/Tv) rate of >2 if the variants are high quality, even higher if the variants are in coding regions.

Read more about SNP call set properties here

We can immediately see that each of the variant callers has more variants than we would have expected - we would have expected around 2000 in our 2 megabase region but we see between 3000 and 5000. This is normal for variant calling, where most callers err on the side of sensitivity to reduce false negatives (missed SNVs), expecting the user to do further filtering to remove false positives (false SNVs).

  1. Evaluate dbSNP concordance and Ti/Tv ratio using the GATK VariantEval tool

NGS: GATK2 Tools -> Eval Variants

  1. Variant 1: NA12878.UnifiedGeno.chr20_2mb.vcf
  2. Using reference genome: Human (hg19)
  3. Provide a dbsnp reference-ordered data file: set dbSNP
  1. dbSNP ROD file: dbSNP135_excludingsitesafter129.chr20.vcf
  1. Basic or Advanced Analysis options: Advanced
  1. Eval modules to apply on the eval track(s):
  1. CompOverlap
  2. TiTvVariantEvaluator
  1. Do not use the standard eval modules by default: check
  1. Execute
  2. When finished, change the Datatype of the generated Eval Variant (report) to ‘txt’

  1. Interpret the dbSNP concordance section of the evaluation report
  1. View the ‘Eval Variants on data... (report)’
  2. The first section of the report lists the overlap in variants between the generated VCF files and known human variation sites from dbSNP
  3. The report needs a bit of explaining:
  1. The EvalRod column specifies which of the input VCFs is analysed (input_0 = first VCF, input_1 = second etc). If you followed the steps above exactly, then input_0 = GATK
  2. The CompRod column specifies the set of variants against which the input VCFs are being compared
  3. Novelty = whether the variants have been found in the supplied dbSNP file or not (known = in dbSNP, novel = not in dbSNP)
  4. nEvalVariants = # variant sites found in both EvalRod (input) VCF and CompRod VCF
  5. novelSites = # variant sites found in both EvalRod but not CompRod (ie dbSNP)
  6. CompOverlap = # variant sites found in both EvalRod and CompRod
  7. compRate = % of variant sites from EvalRod found in CompRod (=CompOverlap/nEvalVariants).
  • This is the important one, and it’s what people generally refer to as dbSNP concordance
  1. concordantRate = % overlapping variants with same genotype

  1. Interpret the TiTv section of the evaluation report
  1. The second section of the report lists the TiTv ratio for different groups of variants from the callsets
  2. It generally follows the format above but the interesting columns being ‘tiTvRatio’ for each of the callsets
  1. The thing to look for is a Ti/Tv close to the TiTvRatioStandard of 2.38. Generally, the higher the better, as a high Ti/Tv ratio indicates that most variants are likely to reflect our expectations from what we know about human variation.

Note that the Ti/Tv for all callsets is around 2.25 for the dbSNP concordant variants, but slightly less for the novel variants (GATK: 1.93). This is as expected, as the dbSNP concordant sites are almost certainly true variant sites.

It is of course easy to call the highly confident variants, the trick is in calling ALL the variants, including ones on the margin of false positive/false negative.

Completed Galaxy history for this section (in SharedData>Published Histories): CoPJun2015_Sec4

Section 5. Annotation [20 min]

Detected variants can be annotated against all registered variants in a pregenerated variant annotation database such as ENSEMBL. The resulting annotation will show the detected variant’s genomic location and whether the variant detected has happened in a gene, transcript or other parts of the genome. The ENSEMBL Variant Effect Predictor hosted on EBI has been used in this tutorial as the reference database.

  1. Annotate the detected variants against the ensembl database.

            NGS: Annotation>Ensembl variant effect predictor

  1. Input file: NA12878.GATK.chr20_2mb.vcf
  2. Name of the Species being annotated: Human
  3. Execute

  1. Interpret the annotation output

           View the output of the annotation step

At this stage the most interesting columns are ‘Consequence’ and ‘Extra’. ‘Consequence’ is a simple prediction of the type of variant by position. In most cases the variants are in intergenic regions, and the effect is of course very hard to predict.

In the first few variants we can see a missense mutation though: chr20:76771

View this variant in IGV and confirm that it is in a gene, and causes an amino acid change. Is it a heterozygotic or homozygotic variant?

The ‘Extra’ column, for potentially deletarious mutations, contains a prediction of how damaging that mutation would be to the protein product. The predictive algorithms are ‘SIFT’ and ‘Polyphen’.

  1. How many deletarious mutations are there in our callset?

 Filter and Sort>Filter

  1. Filter criteria: ‘c7==’missense_variant’
  2. How many missense variants are there? Is this a surprising number?
  3. Are there any nonsense mutations?

Completed Galaxy history for this section (in SharedData>Published Histories): CoPJun2015_Sec5

*Download the data directly to your computer using these URLs, then upload to Galaxy:

  3. Once file is downloaded, in Galaxy tools panel click on Get data>Upload File
  4. Click 'Choose File', locate the local copy of the fastq file and upload
  5. Make sure you select 'fastqsanger' under File Format or the alignment section won't work!
  6. Click Execute and wait for the files to upload to Galaxy. It can take a couple of minutes.
  7. Do the same for files that we’ll need for evaluation:

*Or, upload the data directly to Galaxy using the file URLs

  1. Follow the steps above but instead of downloading the files locally and then uploading, copy the URLs from the previous step into the URL/Text box and Galaxy will download and import them directly. Remember to select the ‘fastqsanger’ File Format for the sequence files (but not the annotated gene list file) and ‘vcf’ for the vcf file.
  2. When the files are finished uploading, rename them to ‘NA12878.hiseq.wgs_chr20_2mb.30xPE.fastq_1’, ‘NA12878.hiseq.wgs_chr20_2mb.30xPE.fastq_2’, ‘dbSNP135_excludingsitesafter129.chr20.vcf’ by clicking on the pencil icon to the top right of the file name in the right hand Galaxy panel (the history panel)