Variant Detection

Introductory tutorial

Andrew Lonie

http://genome.edu.au

http://vlsci.org.au

Contents

Tutorial Overview

Background [15 min]

Preparation [15 min]

Section 1: Quality Control [10 mins]

Section 2: Alignment [20 mins]

Section 3. Calling single nucleotide variations (SNVs) [20 min]

Section 4. Calling single insertions and deletions (indels) [20 min]

Section 5. Evaluate known variations [15 min]

References

Tutorial Overview

In this tutorial we cover the concepts of detecting small variants (SNVs and indels) in human genomic DNA using a small set of reads from chromosome 22

What’s not covered

The tutorial is designed to introduce the tools, datatypes and workflow of variation detection. We filter the variations manually to understand what is actually happening in variant calling. In practice the datasets would be much larger and you would use more sophisticated tools to detect and call variants, based on statistical models of variant representation in a pool of reads.

Background [15 min]

Read the background to the workshop here

Where is the data in this tutorial from?

The workshop is based on analysis of short read data from the exome of chromosome 22 of a single human individual. There are one million 76bp reads in the dataset, produced on an Illumina GAIIx from exome-enriched DNA. This data was generated as part of the 1000 Genomes project: http://www.1000genomes.org/

Preparation [10 min]

  1. Register as a new user in Galaxy if you don’t already have an account (what is Galaxy?)
  1. Open a browser and go to the the Galaxy server: http://galaxy-tut.genome.edu.au
  1. NOTE: Firefox/Safari/Chrome all work well, Internet Explorer not so well
  1. Register as a new user: User>Register or login if you already have an account

  1. Import the data for the workshop 

You can do this in a few ways, of which by far the easiest is (assuming you’re on galaxy-tut.genome.edu.au):

  1. Go to Shared Data -> Published Histories and click on VariantDetection_BASIC_Prep
  1. Click 'Import History' at top right, wait for the history to be imported to your account, and then ‘start using this history’.
  2. This will create a new Galaxy history in your account with all of the required data files
  3. Proceed to Section 1

Alternatively:

  1. Download the data directly to your computer using these URLs, then upload to Galaxy:
  1. https://swift.rc.nectar.org.au:8888/v1/AUTH_a3929895f9e94089ad042c9900e1ee82/VariantDet_BASIC/NA12878.GAIIx.exome_chr22.1E6reads.76bp.fastq
  2. Once file is downloaded, in Galaxy tools panel click on Get data>Upload File
  3. Click 'Choose File', locate the local copy of the fastq file and upload
  4. Make sure you select 'fastqsanger' under File Format or the alignment section won't work!
  5. Click Execute and wait for the files to upload to Galaxy. It can take a couple of minutes.

Or:

  1. Upload the data directly to Galaxy using the file URLs
  1. Follow the steps above but instead of downloading the files locally and then uploading, copy the URLs from the previous step into the URL/Text box and Galaxy will download and import them directly. Remember to select the ‘fastqsanger’ File Format for the sequence files (but not the annotated gene list file)
  2. When the file is finished uploading, rename it to ‘NA12878.GAIIx.exome_chr22.1E6reads.76bp.fastq’ by clicking on the pencil icon to the top right of the file name in the right hand Galaxy panel (the history panel)

NOTE: If you log out of Galaxy and log back at a later time your data and results from previous experiments will be available in the right panel of your screen called the ‘History’

Completed Galaxy history for this section (in SharedData>Published Histories): VariantDetection_BASIC_Prep

OR

Import this History: https://swift.rc.nectar.org.au:8888/v1/AUTH_a3929895f9e94089ad042c9900e1ee82/VariantDet_BASIC/Galaxy-History-VariantDet_BASIC_Prep.tar.gz

Section 1: Quality Control [10 mins]

The aim here is to evaluate the quality of the short data. If the quality is poor, then adjustments can be made - eg trimming the short reads, or adjusting your expectations of the final outcome!

  1. View the short read FASTQ file
  1. Click on the eye icon to the top right of each fastq file to view the first part of the file
  2. Note that each read is represented by 4 lines:
  1. read identifier
  2. short read sequence
  3. separator
  4. short read sequence quality scores
  1. If you’re not familiar with the FASTQ format, click here for an overview

  1. Analyse the quality of the reads in the FASTQ file

From the left hand tool panel in Galaxy, select NGS: QC and manipulation>FASTQC: Read QC

  1. The fastq file will be selected by default. Keep the other defaults and click 'Execute'
  2. Note the batch processing interface of Galaxy:
  1.  grey = waiting in queue
  2.  yellow = running
  3.  green = finished
  4.  red = tried to run and failed
  1. Wait for job to be run
  2. Click on the eye icon to view the newly generated data (in this case a set of quality metric graphs for the fastq data)
  3. Look at the various quality scores. The data looks pretty good - high per-base quality scores (most above 30)
  1. Note that the 'Sequence Duplication Levels' are marked as high. Normally we would run another tool to remove duplicates (technical PCR artifacts) but for the sake of brevity we will omit this step

Completed Galaxy history for this section (in SharedData>Published Histories): VariantDetection_BASIC_Sec1

OR

Import this History: https://swift.rc.nectar.org.au:8888/v1/AUTH_a3929895f9e94089ad042c9900e1ee82/VariantDet_BASIC/Galaxy-History-VariantDet_BASIC_Sec1.tar.gz

Section 2: Alignment [20 mins]

The basic process here is to map each of the individual reads in the sample FASTQ readsets to a reference genome, so that we can then identify the sequence changes with respect to the reference genome

  1. Map/align the reads with BWA to Human reference genome 19 (hg19)

NGS: Mapping>Map with BWA for Illumina [3-5mins]

  1. Use the default FASTQ file
  2. Select the 'hg19' reference genome
  3. Keep all other defaults
  4. Execute
  5. NOTE: This is the longest step in the workshop and will take a few minutes, possibly more depending on how many people are also scheduling mappings

  1. Examine the generated Sequence Alignment Map (SAM) file
  1. click the eye icon next to the newly generated file
  2. Familiarise yourself with the SAM format
  3. Note that some reads have mapped to non-chr22 chromosomes (see column 3)

This is the essence of alignment - the aligner does the best it can, but because of compromises in accuracy vs performance and repetitive sequences in the genome, not all the reads will necessarily align to the ‘correct’ sequence

  1. Filter non chromosome 22 alignments out of the SAM file
  1. Column 3 records the chromosome the read mapped to, and as we're only interested in chr22, we'll remove the reads mapped to other chromosomes.
  2. Filter and Sort>Filter
  3. Use 'c3==”chr22”' [will generate a new file with all lines where column 3 = chr22]
  4. Note 93.4% mapped reads kept, so ~7% of reads weren't from chr22 or mapped incorrectly.

  1. Rename your datasets to something more meaningful than 'Filter on data 3'
  1. Click the pencil icon in the top right of a dataset entry
  2. Note that Galaxy keeps track of exactly what happened in this step so all steps are repeatable
  3. Change 'Filter on data 3' to something meaningful like 'NA12878.chr22_exome.BWA_mapped.chr22_filtered'

  1. Convert the SAM file to BAM file

NGS: Sam Tools>SAM-to-BAM [1 min]

  1. Note that you have 2 SAM files in your history: the orignial SAM file you generated with the mapping step, and the filtered SAM. We want the second one, which should be selected by default
  2. Click 'Execute’

  1. Check the number of aligned and unaligned reads + other quality metrics

NGS: Sam Tools>Flagstat

  1. Note that in this case the statistics are not very informative. This is because the dataset has been generated for this workshop and much of the noise has been removed (and in fact we just removed a lot more noise in the previous step); also we are using single ended read data rather than paired-end so some of the metrics are not relevant

  1. Open the aligned BAM file in the Integrated Genome Viewer (IGV)
  1. First, give your BAM dataset a more meaningful name - eg 'NA12878.chr22_exome.BWA_mapped.bam'
  2. In the history panel for the newly generated BAM file, click on the ‘display with IGV web current
  1. if you have IGV already running on the local computer you can click ‘display with IGV local’

  1. Visualise the aligned reads in IGV
  1. Select chr22 in the IGV chromosomal region drop down box (top of IGV, on the left next to the organism drop down box)
  1. You won't see any reads yet because we are are looking at the entire chromosome at too broad a scale to see 76bp reads; but you should see all of the RefSeq Genes in chr22 in the lower panel of IGV.
  1. Zoom in using the + button until you see the mapped reads appear. As these reads are from the exome, you would expect to see lots of reads where gene exons are and very few between exons.
  1. If you can't find a region with reads, copy and paste these coordinates into IGV: chr22:35,783,964-35,821,038
  2. Observe coverage against exons for chr22
  1. Zoom in further and look for variations between reads - these will appear in colour when you get to the level of individual reads.
  2. Zoom in further and look at region with SNPs in reads
  1. Try chr22:35,947,405-35,947,867

Completed Galaxy history for this section (in SharedData>Published Histories): VariantDetection_BASIC_Sec2

OR

Import this History: https://swift.rc.nectar.org.au:8888/v1/AUTH_a3929895f9e94089ad042c9900e1ee82/VariantDet_BASIC/Galaxy-History-VariantDet_BASIC_Sec2.tar.gz

Section 3. Calling single nucleotide variations (SNVs) [20 min]

  1. Generate a pileup file from the aligned reads

NGS: SAMtools>Generate Pileup [2-3mins]

  1. Keep all defaults except: Call consensus according to MAQ model = Yes
  1. This generates a called 'consensus base' for each chromosomal position
  1. Selecting this will make other options available. Keep all the defaults and Execute
  2. Tell Galaxy that you have just generated a pileup file by explicitly declaring the generated file to be ‘pileup’ format:
  1. click the pencil icon, select ‘Datatype’ link and select ‘pileup’ from the ‘New Type:’ drop-down box, then ‘Save’

  1. Examine the pileup file and become familiar with the format
  1. The pileup file summarises all data from the reads at each genomic region that is covered by at least one read

  1. Do some filtering on the pileup to get a SNP list that we have confidence in
  1. First, filter the pileup file you just generated by removing poor quality base calls, removing consensus base calls that have less than 10 reads supporting that position and removing all bases that are the same as the reference sequence (ie no evidence of them being potential SNPs)

NGS: SAM Tools>Filter Pileup [3-5mins]

  1. Select 'Pileup with ten columns (with consensus)'
  2. Do not report positions with coverage lower than = 10
  3. Convert coordinates to intervals = Yes

We still end up with ~16,000 SNPs, which is far too many. Our filtering criteria are not stringent enough.

  • how many SNPs would you expect? Generally you can expect 1 SNP per 1000 bases, and the exome of chromosome 22 is about 6-700kb (~2% of the 33500kb length of chr22), so probably ~600-700 SNPs
  1. Second, filter the dataset you just generated by SNP quality

Filter and Sort>Filter

  1. Filter on SNP quality (column 7) by entering 'c7>50'. This is a score generated by a combination of inputs including #reads, base quality, and others
  2. Note that the number of SNPs has gone down by ~95%, and we're down to 891, which is not too far off expectations
  3. If you examine the new list, you'll see that each SNP looks very convincing. A pileup file is not the best way to look at variation evidence though; a genome browser is far better

  1. Convert the pileup file to BED format and view in IGV
  1. First, give your SNP dataset a more meaningful name e.g. 'NA12878_high_confidence_SNPs'
  2. Select the pencil icon, and under ‘Datatype', select 'bed'
  1. A BED file is only slightly different to an interval file in that it has more restrictions, but the format is very similar - the first 3 columns specify the genomic location, and the columns after that contain information about the locus/variation itself
  1. Download the BED file by clicking on the disc icon and saving to local disk
  2. Open the file in IGV (File>Load From File, select the file you just downloaded).
  1. Zoom out to get a view of all the SNPS identified in the sample
  • Try chr22:32,154,733-39,745,059
  1. Use IGV to browse through the SNPs and look at the reads associated with a particular SNP
  • Try chr22:36,121,873-36,124,616
  1. Find a homozygous and a heterozygous SNP and examine the reads covering the SNP
  • Hets: chr22:36,006,155-36,008,007
  • Hom: chr22:35,947,405-35,947,867

Completed Galaxy history for this section (in SharedData>Published Histories): VariantDetection_Basic_Sec3

OR

Import this History: https://swift.rc.nectar.org.au:8888/v1/AUTH_a3929895f9e94089ad042c9900e1ee82/VariantDet_BASIC/Galaxy-History-VariantDet_BASIC_Sec3.tar.gz

Section 4. Calling small insertions and deletions (indels) [20 min]

  1. Extract indels from SAM

NGS: Indel Analysis>Indel Analysis

  1. The SAM file you generated in part B (dataset #4) should be automatically selected in the first drop down box
  2. All other defaults
  3. Execute.  You should get two files generated, one for deletions and one for insertions. The insertions file has a little more information in it, with a column detailing the bases that have been inserted.

  1. Look at the generated interval files
  1. First, rename the file with 600 regions to ‘NA12878_deletions’ and change it’s datatype to BED
  2. Then, rename the file with 575 regions to ‘NA12878_insertions’ and change it’s datatype to BED
  3. For each file, the first 3 columns are genomic coordinates, then depending on whether it’s a deletion or insertion:
  1. Deletions: c4 = #reads supporting the call, c5 = statistical measure of confidence
  2. Insertions: c4 = bases inserted/deleted, c5 = #reads supporting the call, c6/c7 are statistical measures of confidence
  3. Note that lots of entries in both files only have one read as evidence - so not very convincing

  1. Filter by #reads

Filter and Sort>Filter

  1. We'll require more than 10 reads to be confident in a call, so set: 'c5>10' for the ‘NA12878_insertions’ file
  2. After filtering there should be 15 regions. Examine them using the eye icon.
  3. Give your filtered dataset a more meaningful name (eg 'NA12878_high_confidence_Insertions')

  1. View in IGV
  1. As you did for the SNP list, download and load into IGV
  2. Examine some of the insertions and deletions at the level of aligned reads. Are they convincing?
  3. Find a homozygous and a heterozygous SNP and examine the reads covering the SNP
  1. Hom or Het? Try chr22:31,854,318-31,854,553

Completed Galaxy history for this section (in SharedData>Published Histories): VariantDetection_BASIC_Sec4

OR

Import this History: https://swift.rc.nectar.org.au:8888/v1/AUTH_a3929895f9e94089ad042c9900e1ee82/VariantDet_BASIC/Galaxy-History-VariantDet_BASIC_Sec4.tar.gz

Section 5. Evaluate known variations [15 min]

  1. Upload a list of known and accepted chr22 exome SNPs for this sample by going to 'Get Data>Upload File' in Galaxy and pasting in this URL: https://swift.rc.nectar.org.au:8888/v1/AUTH_377/public/variantCalling_BASIC/1000genomes.NA12878.chr22_exome.SNPs.bed
  1. Make sure you set the Datatype to ‘bed’ and Genome to ‘Human Feb. 2009 (GRCh37/hg19) (hg19)’

  1. Use IGV to look at some of the positions of the known variations, and visually compare to the variants you have generated. Find some positions where they are different and see if you can identify why.

Completed Galaxy history for this section (in SharedData>Published Histories): VariantDetection_BASIC_Sec5

OR

Import this History: https://swift.rc.nectar.org.au:8888/v1/AUTH_a3929895f9e94089ad042c9900e1ee82/VariantDet_BASIC/Galaxy-History-VariantDet_BASIC_Complete.tar.gz

References