Familial Variant Calling

Protocol

http://genome.edu.au

http://petermac.org

http://vlsci.org.au

Contents

Protocol Overview / Introduction

Section 1: Read Quality Control

Section 2: Genome Alignment and combine results

Section 3: Variant Callers

Section 4: Annotation

References

Protocol Overview / Introduction

In this protocol we discuss and outline the process of calling familial related mutations.

What is a variant?

A variant is something that is different from standard or type. These variations can be simple nucleotide variations (SNVs), including single-nucleotide polymorphisms (SNPs) and small insertions or deletions (INDELs) or even gross changes in the genome sequence (structural variants.)

These variations occur all the time in all living things. It is the mechanism of evolution and drives species delineation and adaptation. The measurement and analysis of genome variation can be used as a tool to understand disease and aid in the design and application of new treatments.

Germline Mutations

A germline mutation is any detectable and heritable variation in the lineage of germ cells. Mutations in these cells are transmitted to offspring, while, on the other hand, those in somatic cells are not. A germline mutation gives rise to a constitutional mutation in the offspring, that is, a mutation that is present in virtually every cell. [1]

Protocol Pipeline

Flow chart describing a familial variant calling pipeline (subject to changes in the tools used)

The protocol in a nutshell:

  • Inspect the reads - get an understanding of what you’ve got and what the quality is like.
  • Raw data cleanup/quality trimming if necessary.
  • Align reads to reference genome
  • Mark duplicates (optional)
  • Combine samples into a single file
  • Realign reads around insertions and deletions (indels) (optional)
  • Correct inaccurate base qualities (optional)
  • Make variant calls, SNV’s, indels
  • Annotate the results

Raw read sequence file formats.

Raw read sequences can be stored in a variety of formats. The reads can be stored as text in a Fasta file or with their qualities as a FastQ file. They can also be stored as alignments to references in other formats such as SAM or its binary compressed implementation BAM. All of the file formats (with the exception of the binary BAM format) can be compressed easily and often are stored so (.gz for gzipped files.)

The most common read file format is FastQ as this is what is produced by the Illumina sequencing pipeline. This will be the focus of our discussion henceforth.

Bioinformatics tools for this protocol.

There are a number of tools available for each protocol step. These tools all have strengths and weaknesses and have their own application space. Suggestions rather than prescriptions for tools will be made for each of the steps. Other tools could be substituted in each case depending on user preference, experience or problem type.

Genomics Virtual Laboratory resources for this protocol.

Depending on your requirements and skill base there are two options for running this protocol using GVL computing resources.

  • All of the suggested tools for this protocol are installed and available.

  • Run your own GVL Galaxy or Cloud BioLinux instance on the Research Cloud.
  • All of the suggested tools for this protocol are installed and available.  You can install and configure the tools you’d prefer.

You can also use your own computing resources.

Section 1: Read Quality Control

Purpose:

The purpose of this section of the protocol is to show you how to understand your raw data, make informed decisions on how to handle it and maximise your chances of getting a good quality alignment. Knowledge of the read types, the number of reads, their GC content, possible contamination and other issues are important. This information will give you an idea of any quality issues with the data and guide you on the choice of data trimming/cleanup methods to use. Cleaning up the raw data before alignment can lead to much better results as contamination and low quality error prone reads will have been removed. It will also give you a better guide as to setting appropriate input parameters for the alignment software. It is a good idea to perform these steps on all of your read files as they could have very different qualities.

Steps involved and suggested tools:

1. Examine the quality of your raw read files.

For FastQ files (the most common), the suggested tool is FastQC. Details can be found here. FastQC can be run from within Galaxy (it has a GUI interface for the command line version) or by command line.

  • FastQC on GVL USE Galaxy is located in: NGS: QC and Manipulation → FastQC
  • Command line: fastqc. Details on installation and using can be found here.

Some of the important outputs of FastQC for our purposes are:

  • Quality encoding type - Important for quality trimming software
  • Total number of reads - Gives you an idea of coverage..
  • Dips in quality near the beginning, middle or end of the reads - Determines possible trimming/cleanup methods and parameters and may indicate technical problems with the sequencing process/machine run.
  • Presence of highly recurring k-mers - May point to contamination of reads with barcodes, adapter sequences etc.
  • Presence of large numbers of N’s in reads - May point to poor quality sequencing run. You need to trim these reads to remove N’s.

2. Quality trimming/cleanup of read files.

Now that you have some knowledge about the raw data, it is important to use this information to clean up and trim the reads to improve its overall quality before alignment. There are a number of tools available in Galaxy and by command line that can perform this step (to varying degrees) but you’ll need one that can handle read pairing if you’ve got paired end reads. If one of the ends of a pair is removed, the orphaned read needs to be put into a separate “orphaned reads” file. This maintains the paired ordering of the reads in the paired read files so the alignment software can use them correctly. The suggested tool for this is a pair aware read trimmer called Trimmomatic. Details on Trimmomatic can be found here.

Trimmomatic on GVL systems:

  • GVL USE Galaxy: NGS: QC and manipulation → Trimmomatic
  • Command line: details and examples here.
  • java -cp <path to trimmomatic jar> org.usadellab.trimmomatic.TrimmomaticPE for Paired End Files
  • java -cp <path to trimmomatic jar> org.usadellab.trimmomatic.TrimmomaticSE for Single End Files

Trimmomatic can perform many read trimming functions sequentially.

Suggested Trimmomatic functions to use:

  • Adapter trimming.
  • This function trims adapters, barcodes and other contaminants from the reads.
  • You need to supply a fasta file of possible adapter sequences, barcodes etc to trim. See Trimmomatic website for detailed instructions.
  • The default quality settings are sensible.
  • This should always be the first trimming step if it is used.
  • Sliding window trimming.
  • This function uses a sliding window to measure average quality and trims accordingly.
  • The default quality parameters are sensible for this step.
  • Trailing bases quality trimming.
  • This function trims bases from the end of a read if they drop below a quality threshold. If base 69 of 75 drops below the threshold, the read is cut to 68 bases.
  • Use FastQC report to decide whether this step is warranted and what quality value to use. A quality threshold value of 10-15 is a good starting point.
  • Leading bases quality trimming.
  • This function works in a similar fashion to trailing bases trimming except it performs it at the start of the reads.
  • Use FastQC report to determine if this step is warranted. If the quality of bases is poor at the beginning of reads it might be necessary.
  • Minimum read length.
  • Once all trimming steps are complete, this function makes sure that the reads are still longer than this value. If not, the read is removed from the file and its pair is put into the orphan file.
  • The most appropriate value for this parameter will depend on the FastQC report, specifically the length of the high quality section of the Per Base Sequence Quality graph.

Things to look for in the output:

  • Number of reads orphaned by the trimming / cleanup process.
  • Number of pairs lost totally.

Trimmomatic should produce 2 pairs files (1 left and 1 right hand end) and 1 or 2 single “orphaned reads” files if you trimmed a pair of read files using paired end mode. It only produces 1 output read file if you used it in single ended mode. Each read library (2 paired files or 1 single ended file) should be trimmed separately with parameters dependent on their own FastQC reports. The output files are the ones you should use for alignment.

Possible alternative tools:

Section 2: Genome Alignment

Purpose:

Once the reads have passed quality controls is time to align them to the reference genome. In this section we’ll discuss the importance of alignment and the tools available for it.

Why do we need to align the reads to the reference genome?

Although we are trying to pick up variants between the samples, we require some guide to know what “pieces” we need to compare. By aligning both samples to the reference genome, we will know what reads to compare as they’ll match the reference sequence at similar positions.

Suggested tools:

BWA is a fast light-weighted tool that aligns relatively short sequences (queries) to a sequence database (large), such as the human reference genome.Details can be found here

BWA on GVL USE Galaxy

  • NGS: Mapping → BWA

Command line:

    Usage:   bwa <command> [options]

    Command: aln                           gapped/ungapped alignment

            samse             generate alignment (single ended)

            sampe             generate alignment (paired ended)

            bwasw             BWA-SW for long queries

            fastmap           identify super-maximal exact matches

After alignment is performed, we need to do a little tidying up of our files. We will perform three tasks using Picard and Samtools. Although these steps are optional, they are highly recommended to improve the quality of the overall results:

  • MarkDuplicates: due to biases during amplification, some reads could be overrepresented, we should flag or delete read duplicates.
  • MarkDuplicates on GVL USE Galaxy:
  • NGS: Picard → MarkDuplicates

  • MergeSamFiles (Optional): at this point we can merge the samples in a single SAM/BAM file. The indel realigner in the next step will yield better results as more information is given.
  • MergeSamFiles on GVL USE Galaxy:
  • NGS: SAM Tools → Merge BAM Files

  • Realign around indels: focuses on improving the original alignment around indels.
  • Indel Realigner on GVL USE Galaxy:
  • NGS: GATK Tools →Indel Realigner

  • Correct inaccurate base qualities: it is possible after an accurate genome alignment to reassess the quality of our reads. This is a 2 step process:
  • GVL USE Galaxy:
  • NGS: GATK Tools →Count Covariates
  • NGS: GATK Tools →Table Recalibration

Section 3: Variant Calling

Purpose:

in this section we’ll discuss the tools we’ll use to find variations/mutations in our data.

Suggested tools:

  • GATK’s Unified Genotyper: a variant caller which unifies the approaches of several disparate callers. Works for single-sample and multi-sample data. The user can choose from several different incorporated calculation models.

            For more information on the GATK Unified Genotyper, see this tool specific page.

            To learn about best practices for variant detection using GATK, see this overview.

            If you encounter errors, please view the GATK FAQ.

  • GVL USE Galaxy: NGS: GATK Tools → Unified Genotyper

Possible alternative software:

Section 4: Annotation

Purpose:

in this section we’ll discuss how to annotate the results found by the variant callers

Suggested tools:

  • GATK Variant Annotator: annotates variant calls with context information. Users can specify which of the available annotations to use.

For more information on using the VariantAnnotator, see this tool specific page.

To learn about best practices for variant detection using GATK, see this overview.

Variant Annotator on GVL systems:

  • GVL USE Galaxy: NGS: GATK Tools → Variant Annotator

References

[1] Wikipedia: Germline Mutation-http://en.wikipedia.org/wiki/Germline_mutation