Somatic Variant Calling
Protocol Overview / Introduction
In this protocol we discuss and outline the process of identifying somatic variants/mutations.
What is a variant?
A variant is something that is different from standard or type. These variations can be simple nucleotide variations (SNVs), including single-nucleotide polymorphisms (SNPs) and small insertions or deletions (INDELs) or even gross changes in the genome sequence (structural variants.)
These variations occur all the time in all living things. It is the mechanism of evolution and drives species delineation and adaptation. The measurement and analysis of genome variation can be used as a tool to understand disease and aid in the design and application of new treatments. It can answer questions like why a bacteria suddenly became drug resistant or why a virus is suddenly re-infective.
Somatic mutations are genetic alteration acquired by a cell that can be passed to the progeny of the mutated cell in the course of cell division. They are frequently caused by environmental factors, such as exposure to ultraviolet radiation or to certain chemicals. They differ from germ line mutations, which are inherited from the germ cells (i.e., sperm and oocytes).
Somatic mutations may occur in any cell division from the first cleavage of the fertilized egg to the cell divisions that replace cells in a senile individual. The mutation affects all cells descended from the mutated cell. A major part of an organism, such as the branch of a tree or a complete tissue layer of an animal, may carry the mutation; it may or may not be expressed visibly. Somatic mutations can give rise to various diseases, such as cancer.
Variant annotation is performed to combine the raw putative variant calls with auxiliary data to add meaning ("annotation") to the variants.
In many cases, the variant detection tool itself will add certain elements of annotation, such as a definition of the variant, a genotype call, a measure of likelihood, a haplotype score, and other measures of the raw data useful to reduce false positives. In other cases, the annotator will only require a vcf file combined with other auxiliary data.
This protocol aims to identify mutations that could potentially be correlated with cancer. For that purpose, normal and tumor samples are analyzed. As not all mutations will be cancer-related, a statistical analysis is needed to determine the relevance of results.
The protocol in a nutshell:
Flow chart describing a variant calling pipeline (subject to changes in the tools used).
Raw read sequence file formats
Raw read sequences can be stored in a variety of formats. The reads can be stored as text in a Fasta file or with their qualities as a FastQ file. They can also be stored as alignments to references in other formats such as SAM or its binary compressed implementation BAM. All of the file formats (with the exception of the binary BAM format) can be compressed easily and often are stored so (.gz for gzipped files).
The most common read file format is FastQ as this is what is produced by the Illumina sequencing pipeline. This will be the focus of our discussion henceforth.
Bioinformatics tools for this protocol
There are a number of tools available for each protocol step. These tools all have strengths and weaknesses and have their own application space. Suggestions rather than prescriptions for tools will be made for each of the steps. Other tools could be substituted in each case depending on user preference, experience or problem type.
Genomics Virtual Laboratory resources for this protocol.
Depending on your requirements and skill base there are two options for running this protocol using GVL computing resources. All of the suggested tools for this protocol are either installed or available from the Galaxy toolshed.
You can also use your own computing resources.
Section 1: Read Quality Control
The purpose of this section of the protocol is to show you how to understand your raw data, make informed decisions on how to handle it and maximise your chances of getting a good quality alignment. Knowledge of the read types, the number of reads, their GC content, possible contamination and other issues are important. This information will give you an idea of any quality issues with the data and guide you on the choice of data trimming/cleanup methods to use. Cleaning up the raw data before alignment can lead to much better results as contamination and low quality error prone reads will have been removed. It will also give you a better guide as to setting appropriate input parameters for the alignment software. It is a good idea to perform these steps on all of your read files as they could have very different qualities.
Steps involved and suggested tools:
For FastQ files (the most common), the suggested tool is FastQC. Details can be found here. FastQC can be run from within Galaxy or by command line. It has a GUI interface for the command line version.
Some of the important outputs of FastQC for our purposes are:
Now that you have some knowledge about the raw data, it is important to use this information to clean up and trim the reads to improve its overall quality before alignment. There are a number of tools available in Galaxy and by command line that can perform this step (to varying degree) but you’ll need one that can handle read pairing if you’ve got paired end reads. If one of the ends of a pair is removed, the orphaned read needs to be put into a separate “orphaned reads” file. This maintains the paired ordering of the reads in the paired read files so the alignment software can use them correctly. The suggested tool for this is a pair aware read trimmer called Trimmomatic. Details on Trimmomatic can be found here.
Trimmomatic on GVL systems:
Trimmomatic can perform many read trimming functions sequentially.
Suggested Trimmomatic functions to use:
Things to look for in the output:
Trimmomatic should produce 2 pairs files (1 left and 1 right hand end) and 1 or 2 single “orphaned reads” files if you trimmed a pair of read files using paired end mode. It only produces 1 output read file if you used it in single ended mode. Each read library (2 paired files or 1 single ended file) should be trimmed separately with parameters dependent on their own FastQC reports. The output files are the ones you should use for aligment.
Possible alternate tools:
Section 2: Genome Alignment
Once the reads have passed quality controls is time to align them to the reference genome, in this section we’ll discuss the importance of genome alignment and the tools available for it.
Why do we need to align the reads to the reference?
Although we are trying to pick up variants between the normal and tumor samples, we require some guide to know what “pieces” we need to compare. By aligning both samples to the reference genome, we will know what reads to compare as they’ll match the reference at similar positions.
BWA is a fast light-weighted tool that aligns relatively short sequences (queries) to a sequence database (large), such as the human reference genome. Details can be found here
BWA on GVL systems:
Usage: bwa <command> [options]
aln gapped/ungapped alignment
samse generate alignment (single ended)
sampe generate alignment (paired ended)
bwasw BWA-SW for long queries
fastmap identify super-maximal exact matches
After alignment is performed, we need to do a little tidying up of our files. We will perform three tasks using Picard and Samtools. Although these steps are optional, they are highly recommended to improve the quality of the overall results:
Section 3: Variant Calling
in this section we’ll discuss the tools we’ll use to find variations/mutations in our data. In our case the following tools are designed specifically to find somatic variants, while others call any type. Generally, you’ll want to use several methods to have more confidence in the final results
Both JointSNVMix and SomaticSniper take many parameters for tailoring each run depending on what information is available and what output is expected. Please refer to the documentation in order to obtain optimal results
The VCF format is used as the standard across variant callers to report results.
We will combine these VCF files into a single one for further analysis. For this task we’ll use a GATK tool called CombineVariants
Possible alternative software:
Section 4: Annotation
in this section we’ll discuss how to annotate the results found by the variant callers
To learn about best practices for variant detection using GATK, see this overview.
Variant Annotator on GVL systems:
 Encyclopedia Britanica:Somatic Mutation http://www.britannica.com/EBchecked/topic/553940/somatic-mutation