Somatic Variant Calling



Protocol Overview / Introduction

Section 1: Read Quality Control

Section 2: Genome Alignment and combine results

Section 3: Variant Callers

Section 4: Annotation


Protocol Overview / Introduction

In this protocol we discuss and outline the process of identifying somatic variants/mutations.

What is a variant?

A variant is something that is different from standard or type. These variations can be simple nucleotide variations (SNVs), including single-nucleotide polymorphisms (SNPs) and small insertions or deletions (INDELs) or even gross changes in the genome sequence (structural variants.)

These variations occur all the time in all living things. It is the mechanism of evolution and drives species delineation and adaptation. The measurement and analysis of genome variation can be used as a tool to understand disease and aid in the design and application of new treatments. It can answer questions like why a bacteria suddenly became drug resistant or why a virus is suddenly re-infective.

Somatic Mutations

Somatic mutations are genetic alteration acquired by a cell that can be passed to the progeny of the mutated cell in the course of cell division. They are frequently caused by environmental factors, such as exposure to ultraviolet radiation or to certain chemicals. They differ from germ line mutations, which are inherited from the germ cells (i.e., sperm and oocytes).

Somatic mutations may occur in any cell division from the first cleavage of the fertilized egg to the cell divisions that replace cells in a senile individual. The mutation affects all cells descended from the mutated cell. A major part of an organism, such as the branch of a tree or a complete tissue layer of an animal, may carry the mutation; it may or may not be expressed visibly. Somatic mutations can give rise to various diseases, such as cancer.[1]

Variant Annotation

Variant annotation is performed to combine the raw putative variant calls with auxiliary data to add meaning ("annotation") to the variants.

In many cases, the variant detection tool itself will add certain elements of annotation, such as a definition of the variant, a genotype call, a measure of likelihood, a haplotype score, and other measures of the raw data useful to reduce false positives. In other cases, the annotator will only require a vcf file combined with other auxiliary data.

Protocol Purpose

This protocol aims to identify mutations that could potentially be correlated with cancer. For that purpose, normal and tumor samples are analyzed. As not all mutations will be cancer-related, a statistical analysis is needed to determine the relevance of results.

The protocol in a nutshell:

  • Obtain sequence read file(s) from sequencing machine(s). Tumour and Normal.
  • Look at the reads - get an understanding of what you’ve got and what the quality is like.
  • Raw data cleanup/quality trimming if necessary.
  • Align reads to reference genome.
  • Mark Duplicates (optional).
  • Combine Tumour and Normal samples into a single file (optional).
  • Realign reads around insertions and deletions (indels)(optional).
  • Correct inaccurate base qualities (optional).
  • Separate Tumor and Normal samples (somatic variant callers generally take 2 files).
  • Make variant calls, SNV’s, Somatic Mutations using different tools.
  • Combine results into single variant file.
  • Annotate results.

Protocol Pipeline

Flow chart describing a variant calling pipeline (subject to changes in the tools used).

Raw read sequence file formats

Raw read sequences can be stored in a variety of formats. The reads can be stored as text in a Fasta file or with their qualities as a FastQ file. They can also be stored as alignments to references in other formats such as SAM or its binary compressed implementation BAM. All of the file formats (with the exception of the binary BAM format) can be compressed easily and often are stored so (.gz for gzipped files).

The most common read file format is FastQ as this is what is produced by the Illumina sequencing pipeline. This will be the focus of our discussion henceforth.

Bioinformatics tools for this protocol

There are a number of tools available for each protocol step. These tools all have strengths and weaknesses and have their own application space. Suggestions rather than prescriptions for tools will be made for each of the steps. Other tools could be substituted in each case depending on user preference, experience or problem type.

Genomics Virtual Laboratory resources for this protocol.

Depending on your requirements and skill base there are two options for running this protocol using GVL computing resources. All of the suggested tools for this protocol are either installed or available from the Galaxy toolshed.  

  • Run your own GVL Galaxy or Cloud BioLinux instance on the Research Cloud. You can install and configure the tools you’d prefer.

You can also use your own computing resources.

Section 1: Read Quality Control


The purpose of this section of the protocol is to show you how to understand your raw data, make informed decisions on how to handle it and maximise your chances of getting a good quality alignment. Knowledge of the read types, the number of reads, their GC content, possible contamination and other issues are important. This information will give you an idea of any quality issues with the data and guide you on the choice of data trimming/cleanup methods to use. Cleaning up the raw data before alignment can lead to much better results as contamination and low quality error prone reads will have been removed. It will also give you a better guide as to setting appropriate input parameters for the alignment software. It is a good idea to perform these steps on all of your read files as they could have very different qualities.

Steps involved and suggested tools:

  1. Examine the quality of your raw read files.

For FastQ files (the most common), the suggested tool is FastQC. Details can be found here. FastQC can be run from within Galaxy or by command line. It has a GUI interface for the command line version.

  • FastQC on GVL USE Galaxy is located in: NGS: QC and Manipulation → FastQC
  • Command line: fastqc. Details on installation and using can be found here.

Some of the important outputs of FastQC for our purposes are:

  • Quality encoding type - Important for quality trimming software.
  • Total number of reads - Gives you an idea of coverage.
  • Dips in quality near the beginning, middle or end of the reads - Determines possible trimming/cleanup methods and parameters and may indicate technical problems with the sequencing process/machine run.
  • Presence of highly recurring k-mers - May point to contamination of reads with barcodes, adapter sequences etc.
  • Presence of large numbers of N’s in reads - May point to poor quality sequencing run. You need to trim these reads to remove N’s.

  1. Quality trimming/cleanup of read files.

Now that you have some knowledge about the raw data, it is important to use this information to clean up and trim the reads to improve its overall quality before alignment. There are a number of tools available in Galaxy and by command line that can perform this step (to varying degree) but you’ll need one that can handle read pairing if you’ve got paired end reads. If one of the ends of a pair is removed, the orphaned read needs to be put into a separate “orphaned reads” file. This maintains the paired ordering of the reads in the paired read files so the alignment software can use them correctly. The suggested tool for this is a pair aware read trimmer called Trimmomatic. Details on Trimmomatic can be found here.

Trimmomatic on GVL systems:

  • GVL USE Galaxy: NGS: QC and manipulation → Trimmomatic
  • Command line: details and examples here.
  • java -cp <path to trimmomatic jar> org.usadellab.trimmomatic.TrimmomaticPE for Paired End Files
  • java -cp <path to trimmomatic jar> org.usadellab.trimmomatic.TrimmomaticSE for Single End Files

Trimmomatic can perform many read trimming functions sequentially.

Suggested Trimmomatic functions to use:

  • Adapter trimming.
  • This function trims adapters, barcodes and other contaminants from the reads.
  • You need to supply a fasta file of possible adapter sequences, barcodes etc to trim. See Trimmomatic website for detailed instructions.
  • The default quality settings are sensible.
  • This should always be the first trimming step if it is used.
  • Sliding window trimming.
  • This function uses a sliding window to measure average quality and trims accordingly.
  • The default quality parameters are sensible for this step.
  • Trailing bases quality trimming.
  • This function trims bases from the end of a read if they drop below a quality threshold. If base 69 of 75 drops below the threshold, the read is cut to 68 bases.
  • Use FastQC report to decide whether this step is warranted and what quality value to use. A quality threshold value of 10-15 is a good starting point.
  • Leading bases quality trimming.
  • This function works in a similar fashion to trailing bases trimming except it performs it at the start of the reads.
  • Use FastQC report to determine if this step is warranted. If the quality of bases is poor at the beginning of reads it might be necessary.
  • Minimum read length.
  • Once all trimming steps are complete, this function makes sure that the reads are still longer than this value. If not, the read is removed from the file and its pair is put into the orphan file.
  • The most appropriate value for this parameter will depend on the FastQC report, specifically the length of the high quality section of the Per Base Sequence Quality graph.

Things to look for in the output:

  • Number of reads orphaned by the trimming / cleanup process.
  • Number of pairs lost totally.

Trimmomatic should produce 2 pairs files (1 left and 1 right hand end) and 1 or 2 single “orphaned reads” files if you trimmed a pair of read files using paired end mode. It only produces 1 output read file if you used it in single ended mode. Each read library (2 paired files or 1 single ended file) should be trimmed separately with parameters dependent on their own FastQC reports. The output files are the ones you should use for aligment.

Possible alternate tools:

Section 2: Genome Alignment


Once the reads have passed quality controls is time to align them to the reference genome, in this section we’ll discuss the importance of genome alignment and the tools available for it.

Why do we need to align the reads to the reference?

Although we are trying to pick up variants between the normal and tumor samples, we require some guide to know what “pieces” we need to compare. By aligning both samples to the reference genome, we will know what reads to compare as they’ll match the reference at similar positions.

Suggested tools:

BWA is a fast light-weighted tool that aligns relatively short sequences (queries) to a sequence database (large), such as the human reference genome. Details can be found here

BWA on GVL systems:

  • In GVL USE Galaxy: NGS: Mapping → BWA

Command line:

     Usage:   bwa <command> [options]


             aln                           gapped/ungapped alignment

             samse             generate alignment (single ended)

             sampe             generate alignment (paired ended)

             bwasw             BWA-SW for long queries

             fastmap           identify super-maximal exact matches


After alignment is performed, we need to do a little tidying up of our files. We will perform three tasks using Picard and Samtools. Although these steps are optional, they are highly recommended to improve the quality of the overall results:

  1. MarkDuplicates: due to biases during amplification, some reads could be overrepresented, we should flag or delete read duplicates.
  • MarkDuplicates on GVL USE Galaxy:
  • NGS: Picard → MarkDuplicates


  1. MergeSamFiles (Optional): at this point we can merge the normal and tumor sample in a single SAM/BAM file. The indel realigner in the next step will yield better results as more information is given.
  • MergeSamFiles on GVL USE Galaxy:
  • NGS: SAM Tools → Merge BAM Files

  1. Realign around indels: focuses on improving the original alignment around indels.
  • Indel Realigner on GVL USE Galaxy:
  • NGS: GATK Tools →Indel Realigner

  1. Correct inaccurate base qualities: it is possible after an accurate genome alignment to reassess the quality of our reads. This is a 2 step process:
  • GVL USE Galaxy:
  • NGS: GATK Tools →Count Covariates
  • NGS: GATK Tools →Table Recalibration

  1. Separate Normal and Tumour samples: in order to perform the next step, samples need to be separated into Tumour and Normal files again.
  • Filter: In GVL USE Galaxy: Filter and Sort Filter
  1. Filter by sample name in Sam File. Convert BAM to SAM if necessary

Section 3: Variant Calling


in this section we’ll discuss the tools we’ll use to find variations/mutations in our data. In our case the following tools are designed specifically to find somatic variants, while others call any type. Generally, you’ll want to use several methods to have more confidence in the final results

Suggested tools:

  1. JointSNVMix implements a probabilistic graphical model to analyse sequence data from tumour/normal pairs. The model draws statistical strength by analysing both genomes jointly to more accurately classify germline and somatic mutations.
    Details can be found
    Command line: details and examples
    JointSNVMix output’s specification can be found
    here . For our purposes, we will translate this output into the more standard VCF format.
    It can be downloaded for Galaxy use from the Galaxy Toolshed.

  1. Somatic Sniper: the purpose of this program is to identify single nucleotide positions that are different between tumor and normal (or, in theory, any two bam files). For more details go here. It can be downloaded for Galaxy use from the Galaxy Toolshed.
    Command line: details and examples
    bam-somaticsniper [options] -f <ref.fasta> <tumor.bam> <normal.bam> <snv_output_file>
  2. Unified Genotyper: SNP and indel caller available on Galaxy.
  • In GVL USE Galaxy: NGS: GATK Tools → Unified Genotyper

Both JointSNVMix and SomaticSniper take many parameters for tailoring each run depending on what information is available and what output is expected. Please refer to the documentation in order to obtain optimal results

The VCF format is used as the standard across variant callers to report results.

We will combine these VCF files into a single one for further analysis. For this task we’ll use a GATK tool called CombineVariants 

  • In GVL USE Galaxy: NGS: GATK Tools → CombineVariants


Possible alternative software:

Section 4: Annotation


in this section we’ll discuss how to annotate the results found by the variant callers

Suggested tools:

  • GATK Variant Annotator: annotates variant calls with context information. Users can specify which of the available annotations to use.

For more information on using the VariantAnnotator, see this tool specific page.

To learn about best practices for variant detection using GATK, see this overview.

Variant Annotator on GVL systems:

  • GVL USE Galaxy: NGS: GATK Tools → Variant Annotator


[1] Encyclopedia Britanica:Somatic Mutation