CollegeOfPathologists2015_Tutorial_Main

Variant Detection (Human)

Introductory genomics tutorial

Andrew Lonie

Contents

Section 1: Quality Control [10 mins]

Section 2: Alignment [20 mins]

Section 3. Calling Variants using the GATK Unified Genotyper [15 min]

Section 4. Evaluate known variations [15 min]

Section 5. Annotation [20 min]

Tutorial Overview

In this tutorial we cover the concepts of detecting small variants (SNVs and indels) in human genomic DNA using a small set of sequencing reads from chromosome 20

The fundamentals of short-read sequence analysis

Next-generation sequencing is a technology rather than an experimental approach; it generates data that can be used in a variety of ways. A common use of NGS is to characterise the sequence variation between samples. A typical experiment might involve:

Generate DNA sample (from genomic DNA, targetted DNA regions like exomes, mRNA)
Sequence on next-generation machine. Output is many millions of short (35-200bp) reads with quality information on each base.
Analyse quality metrics for read data and filter data if required
Align reads against a reference sequence. This is more complex than it sounds due to errors and variation between samples.
Interpret the aggregated evidence from the reads - identify differences between observations and the reference, including variations, expression levels. Normally involves statistics!
Biological validation: validate variations independently

What’s not covered

The tutorial is designed to introduce the tools, datatypes and workflow of variation detection. We filter the variations manually to understand what is actually happening in variant calling. In practice the datasets would be much larger and you would use more sophisticated tools to detect and call variants, based on statistical models of variant representation in a pool of reads.

Background [15 min]

Read a presentation on variant calling here

Where is the data in this tutorial from?

The data has been produced from human whole genomic DNA. Only reads that have mapped to a part of chromosome 20 have been used, to make the data suitable for an interactive tutorial. There are about one million 100bp reads in the dataset, produced on an Illumina HiSeq2000. This data was generated as part of the 1000 Genomes project: http://www.1000genomes.org/

Preparation [15 min]

Open a browser and go to the the Galaxy server designated by your tutor, or use the public Galaxy tutorial server: http://galaxy-tut.genome.edu.au

NOTE: Firefox/Safari/Chrome all work well, Internet Explorer not so well

Import the data for the workshop

You can do this in a few ways*, of which by far the easiest is:

Go to Shared Data -> Published Histories and click on CoPJun2015_Prep

Click 'Import History' at top right, wait for the history to be imported to your account, and then ‘start using this history’.
This will create a new Galaxy history in your account with all of the required data files
Proceed to Section 1

*there are other ways to get data into Galaxy. See alternative methods at the end of this tutorial.

You should now have these files in your history:

NA12878.hiseq.wgs_chr20_2mb.30xPE.fastq_1
NA12878.hiseq.wgs_chr20_2mb.30xPE.fastq_2
dbSNP135_excludingsitesafter129.chr20.vcf

Completed Galaxy history for this section (in SharedData>Published Histories): CoPJun2015_Prep

Section 1: Quality Control [10 mins]

The aim here is to evaluate the quality of the short data. If the quality is poor, then adjustments can be made - eg trimming the short reads, or adjusting your expectations of the final outcome!

Analyse the quality of the reads in the FASTQ file

From the left hand tool panel in Galaxy, select NGS: QC and manipulation>FASTQC: Read QC

Click on the eye icon to view the various quality metrics
Look at the generated FastQC metrics. This data looks pretty good - high per-base quality scores (most above 30)

Completed Galaxy history for this section (in SharedData>Published Histories): CoPJun2015_Sec1

Section 2: Alignment [20 mins]

The basic process here is to map each of the individual reads in the sample FASTQ readsets to a reference genome, so that we can then identify the sequence changes with respect to the reference genome.

Some of the variant callers need extra information regarding the source of reads in order to identify the correct error profiles to use in their statistical variant detection model, so we add more information into the alignment step so that that generated BAM file contains the metadata the variant caller expects.

Map/align the reads with BWA to Human reference genome 19 (hg19)

NGS: Mapping>Map with BWA for Illumina [2-3mins]

Reference genome: Human (hg19)
Is this library mate-paired? Paired-end
Forward FASTQ File: NA12878.hiseq.wgs_chr20_2mb.30xPE.fastq_1
Reverse FASTQ File: NA12878.hiseq.wgs_chr20_2mb.30xPE..fastq_2
BWA settings to use: Full Parameter List
Specify the read group for this file? (samse/sampe -r): Yes
Read Group Identifier (ID): Tutorial_readgroup
Library name (LB): Tutorial_library
Platform/technology used to produce the reads (PL): ILLUMINA
Sample (SM): NA12878
Use defaults for other fields
Execute

Compress the SAM file to BAM format

NGS: Sam Tools>SAM-to-BAM [1 min]

SAM File to Convert: The BAM file you just generated
Execute

Open the aligned BAM file in the Integrated Genome Viewer (IGV)

First, give your BAM dataset a more meaningful name - eg 'NA12878.chr20_2mb.30xPE.BWA_mapped'
In the history panel for the newly generated BAM file, click on the ‘display with IGV web current’

if you have IGV already running click ‘display with IGV local’

Select chr20 in the IGV chromosomal region drop down box (top of IGV, on the left next to the organism drop down box).

Zoom in to the left hand end of chromosome 20 to see the read alignments - remember our reads only cover the first 2mb of the chromosome.

Scroll around and zoom in and out in the IGV genome viewer to get a feel for genomic data. Note that coverage is variable, with some regions getting almost no coverage (e.g. try chr20:1,870,686-1,880,895 - if you zoom right in to base resolution you’ll see that this region is very GC rich, meaning it’s hard to sequence. Unfortunately it also contains the first few exons of a gene...)

Generate a genomic interval (BED) file that we will use to restrict further analyses to the first 2mb of chromosome 20

The next steps can be computationally intensive if performed on the entire genome. We can limit the analysis to the region that we know our data comes from, the first part of chromosome 20

Text manipulation>Create single interval

Chromosome: chr20
Start position: 0
End position: 2000000
Name: chr20_2mb
Strand: plus
Execute
When the file is created, rename it to: ‘chr20_2mb.bed’

This will generate an interval file specifying the first 2mb of chromosome 20, which we’ll use in the following steps

Evaluate the depth of coverage of the aligned region

NGS: GATK2 Tools>Depth of Coverage [2-5 min]

BAM files: Select the BAM file you just generated
Using reference genome: Human (hg19)
Output format: table
Basic or Advanced GATK options: Advanced
Add new Operate on Genomic intervals

Select the chr20_2mb.bed file from your history

Execute
When completed, view ‘Depth of Coverage on data.... (output summary sample)’

First, you will need to tell Galaxy that this is a text file. Click on the pencil icon, and change the Datatype of any ‘Depth of Coverage’ output you want to see to ‘txt’

Depth of Coverage on data.... (output summary sample) will tell you the total depth of coverage across your sample (for the first 2mb, as specified in the ‘Add new Operated on Genomic intervals’ parameter)

The columns in this report are: sample, total bases, mean depth
It should be ~24x. Note that ~89% of reference bases are covered by at least 15x coverage, which is a sort of informal agreed minimum for reasonable variant calling

The other tables give you more detailed statistics on the level of coverage, broken down by regions etc. We don’t really need them so to keep our Galaxy history clean we will delete all the outputs of this step except for the ‘Depth of Coverage on data.... (output summary sample)’ file.

Use the ‘X’ next to a history file to delete it.

Completed Galaxy history for this section (in SharedData>Published Histories): CoPJun2015_Sec2

Section 3. Calling Variants using the GATK Unified Genotyper [15 min]

The GATK Unified Genotyper is a Bayesian variant caller and genotyper from the Broad Institute. Many users consider the GATK to be best practice in human variant calling.

Section 3a: Try a different variant caller and compare

Try the alternative caller Mpileup

NGS: SAM Tools > Mpileup
Using reference genome: Human (hg19)
Genotype Likelihood Computation: Perform genotype likelihood computation
Set Advanced Options: Advanced
List of regions or sites on which to operate: chr20_2mb.bed
Execute
When finished, change datatype to VCF
There are ~3500 variants in this list - different to the 3700 from GATK in step 1

Browse around in IGV looking at places where variant calling is different between the callers: eg look at:

chr20:1,127,767-1,127,906

chr20:1,113,178-1,114,652

chr20:594,143-594,879

Completed Galaxy history for this section (in SharedData>Published Histories): CoPJun2015_Sec3

Section 4. Evaluate known variations [15 min]

How can we evaluate our variants?

We know a lot about variation in humans from many empirical studies, including the 1000Genomes project, so we have some expectations on what we should see when we detect variants in a new sample:

Expect to see true variations at the rate of about 1 per 1000bp against the reference genome
85% of variations ‘rediscovered’ - that is, 85% already known and recorded in dbSNP (% dependent on the version of dbSNP)
A transition/transversion (Ti/Tv) rate of >2 if the variants are high quality, even higher if the variants are in coding regions.

We can immediately see that each of the variant callers has more variants than we would have expected - we would have expected around 2000 in our 2 megabase region but we see between 3000 and 5000. This is normal for variant calling, where most callers err on the side of sensitivity to reduce false negatives (missed SNVs), expecting the user to do further filtering to remove false positives (false SNVs).

Evaluate dbSNP concordance and Ti/Tv ratio using the GATK VariantEval tool

NGS: GATK2 Tools -> Eval Variants

Variant 1: NA12878.UnifiedGeno.chr20_2mb.vcf
Using reference genome: Human (hg19)
Provide a dbsnp reference-ordered data file: set dbSNP

dbSNP ROD file: dbSNP135_excludingsitesafter129.chr20.vcf

Basic or Advanced Analysis options: Advanced

Eval modules to apply on the eval track(s):

CompOverlap
TiTvVariantEvaluator

Do not use the standard eval modules by default: check

Execute
When finished, change the Datatype of the generated Eval Variant (report) to ‘txt’

Interpret the dbSNP concordance section of the evaluation report

View the ‘Eval Variants on data... (report)’
The first section of the report lists the overlap in variants between the generated VCF files and known human variation sites from dbSNP
The report needs a bit of explaining:

The EvalRod column specifies which of the input VCFs is analysed (input_0 = first VCF, input_1 = second etc). If you followed the steps above exactly, then input_0 = GATK
The CompRod column specifies the set of variants against which the input VCFs are being compared
Novelty = whether the variants have been found in the supplied dbSNP file or not (known = in dbSNP, novel = not in dbSNP)
nEvalVariants = # variant sites found in both EvalRod (input) VCF and CompRod VCF
novelSites = # variant sites found in both EvalRod but not CompRod (ie dbSNP)
CompOverlap = # variant sites found in both EvalRod and CompRod
compRate = % of variant sites from EvalRod found in CompRod (=CompOverlap/nEvalVariants).

This is the important one, and it’s what people generally refer to as dbSNP concordance

concordantRate = % overlapping variants with same genotype

Interpret the TiTv section of the evaluation report

The second section of the report lists the TiTv ratio for different groups of variants from the callsets
It generally follows the format above but the interesting columns being ‘tiTvRatio’ for each of the callsets

The thing to look for is a Ti/Tv close to the TiTvRatioStandard of 2.38. Generally, the higher the better, as a high Ti/Tv ratio indicates that most variants are likely to reflect our expectations from what we know about human variation.

Note that the Ti/Tv for all callsets is around 2.25 for the dbSNP concordant variants, but slightly less for the novel variants (GATK: 1.93). This is as expected, as the dbSNP concordant sites are almost certainly true variant sites.

It is of course easy to call the highly confident variants, the trick is in calling ALL the variants, including ones on the margin of false positive/false negative.

Completed Galaxy history for this section (in SharedData>Published Histories): CoPJun2015_Sec4

Section 5. Annotation [20 min]

Detected variants can be annotated against all registered variants in a pregenerated variant annotation database such as ENSEMBL. The resulting annotation will show the detected variant’s genomic location and whether the variant detected has happened in a gene, transcript or other parts of the genome. The ENSEMBL Variant Effect Predictor hosted on EBI has been used in this tutorial as the reference database.

Annotate the detected variants against the ensembl database.

NGS: Annotation>Ensembl variant effect predictor

Input file: NA12878.GATK.chr20_2mb.vcf
Name of the Species being annotated: Human
Execute

Interpret the annotation output

View the output of the annotation step

At this stage the most interesting columns are ‘Consequence’ and ‘Extra’. ‘Consequence’ is a simple prediction of the type of variant by position. In most cases the variants are in intergenic regions, and the effect is of course very hard to predict.

In the first few variants we can see a missense mutation though: chr20:76771

View this variant in IGV and confirm that it is in a gene, and causes an amino acid change. Is it a heterozygotic or homozygotic variant?

The ‘Extra’ column, for potentially deletarious mutations, contains a prediction of how damaging that mutation would be to the protein product. The predictive algorithms are ‘SIFT’ and ‘Polyphen’.

How many deletarious mutations are there in our callset?

Filter and Sort>Filter

Filter criteria: ‘c7==’missense_variant’
How many missense variants are there? Is this a surprising number?
Are there any nonsense mutations?

Completed Galaxy history for this section (in SharedData>Published Histories): CoPJun2015_Sec5

*Download the data directly to your computer using these URLs, then upload to Galaxy:

https://swift.rc.nectar.org.au:8888/v1/AUTH_377/public/variantCalling_ADVNCD/NA12878.hiseq.wgs_chr20_2mb.30xPE.fastq_1
https://swift.rc.nectar.org.au:8888/v1/AUTH_377/public/variantCalling_ADVNCD/NA12878.hiseq.wgs_chr20_2mb.30xPE.fastq_2
Once file is downloaded, in Galaxy tools panel click on Get data>Upload File
Click 'Choose File', locate the local copy of the fastq file and upload
Make sure you select 'fastqsanger' under File Format or the alignment section won't work!
Click Execute and wait for the files to upload to Galaxy. It can take a couple of minutes.
Do the same for files that we’ll need for evaluation:

https://swift.rc.nectar.org.au:8888/v1/AUTH_377/public/variantCalling_ADVNCD/dbSNP135_excludingsitesafter129_chr20.vcf

*Or, upload the data directly to Galaxy using the file URLs

Follow the steps above but instead of downloading the files locally and then uploading, copy the URLs from the previous step into the URL/Text box and Galaxy will download and import them directly. Remember to select the ‘fastqsanger’ File Format for the sequence files (but not the annotated gene list file) and ‘vcf’ for the vcf file.
When the files are finished uploading, rename them to ‘NA12878.hiseq.wgs_chr20_2mb.30xPE.fastq_1’, ‘NA12878.hiseq.wgs_chr20_2mb.30xPE.fastq_2’, ‘dbSNP135_excludingsitesafter129.chr20.vcf’ by clicking on the pencil icon to the top right of the file name in the right hand Galaxy panel (the history panel)