AmpliconAlignmentProtocol

Custom Amplicon Alignment

Protocol

http://genome.edu.au

http://petermac.org

http://vlsci.org.au

Contents

Protocol Overview / Introduction

Section 1: Custom Amplicon Alignment

References

Protocol Overview / Introduction

In this protocol we discuss and outline the process of aligning custom amplicons using primers for high precision.

There are a few concepts you should understand in order to have a clear idea of the protocol’s functionality

Amplicons

An amplicon is a piece of DNA or RNA that is the source and/or product of natural or artificial amplification or replication events. It can be formed using various methods including polymerase chain reactions (PCR). In this context, "amplification" refers to the production of one or more copies of a genetic fragment or target sequence, specifically the amplicon. As the product of an amplification reaction, "Amplicon" is used interchangeably with common laboratory terms, such as PCR product.[1]

Primers

A primer is a strand of nucleic acid that serves as a starting point for DNA synthesis. It is required for DNA replication because the enzymes that catalyze this process, DNA polymerases, can only add new nucleotides to an existing strand of DNA.[2]

More importantly, specific primers are selected in order to produce (amplify) the targeted amplicons. Such primers are present in both amplicons and reads, and they’ll be used in order make a quicker and more accurate amplicon-read alignment.

Pairwise Alignment

Pairwise sequence alignment methods are used to find the best-matching piecewise (local) or global alignments of two query sequences. Pairwise alignments can only be used between two sequences at a time, but they are efficient to calculate and are often used for methods that do not require extreme precision (such as searching a database for sequences with high similarity to a query). [3]

Edit Distance

is a string metric for measuring the difference between two sequences. Informally, the edit distance between two words is the minimum number of single-character edits (insertion, deletion, substitution) required to change one word into the other.[4]

Reads must be paired-ended and the tested size has been of up to 150 x 150bp for the reads and 250bp for the targeted amplicon regions.

Tool Purpose

Some researchers are interested in rapid sequencing of certain regions of a genome. For example, exons of cancer-related genes can be screened for somatic mutation arisen in the tumour. Amplicon sequencing is one of the commonly used approaches for such task. In this approach regions of interest are amplified by PCR, and the resulted DNA fragments are sequenced on a nextGen sequencing machine. Each amplicon is defined by two primers, so the size and position in the genome are known for each amplicon. The protocol is designed for analysis of pair-end sequencing, when both ends of each fragment are sequenced. It is also assumed that the reads are stripped from barcode adapters.

In nutshell the protocol takes the sequencing data, aligns reads to the amplicons using primers for high accuracy and then clips such primers for further analysis of the aligned reads.

Aimed Technology

this tool was design for Illumina targeted amplicon sequencing where primers are left attached to the reads after demultiplexing.

Protocol Pipeline

Read our flowchart of the Protocol.

Tool Overview

The Custom Amplicon Alignment Tool follows the following steps:

Read mates FastQ files F1,F2

Perform BWA adaptive Q30 trimming of the mate reads R1,R2 on F1,F2
Test if length of trimmed reads T1 and T2 is at least half of originals R1,R2; discard if not.
Search amplicons using only primers for high accuracy (primers for amplicons and reads are compared using edit distance)
List closest matches for each read:

If common amplicon found for R1,R2 primers assign reads to amplicon
If not, search reference with whole reads (primer+read), repeat step

Check if reads overlap each other using a sliding window algorithm (considering reads may have gaps)

If reads overlap, trim R1,R2 with Q15 (a less strict trimming, adding more noise is acceptable as alignment has more information by using both reads)
if not, treat T1 and T2 as separate reads

Align reads to amplicons (Using Dynamic Programming Pairwise alignment [5] )

If reads R1,R2 overlap, align both to amplicon, do 3-way alignment (Also with Dynamic Programming)

Trim Primers from each read
Check if edit distance of reads and amplicons

If distance bigger than threshold: mark mapping quality as 0

Generate SAM file line including CIGAR
Write to BAM file

Section 1: Custom Amplicon Alignment

Purpose:

The purpose of this section is to show how to use the Custom Amplicon Alignment tool, understand the input data and the best practices to obtain optimal results.

The tool is in the Galaxy Toolshed, search for “GVL” or “Align Custom Amplicon”.

You’ll need 4 inputs:

The reference genome depending on the organism you’re working on
FastQ file with Mate1 reads
FastQ file with Mate2 reads

The sequence identifier should match the amplicon location file with the following format:

>Chr:start-end

Example

>1:115256345-115256520

ACTGACTGACTGACTGACTG......

…....

>1:115256436-115256606

GACTGACTGACTGACTGACT......

…....

TSV File (NO header) with amplicons locations , primer lengths and optionally the name of the amplicon

Note that the amplicon locations include the primers
Example:

#Name_of_amplicon length_left length_right amplicon_name
1:115256345-115256520 23 23 XDW-11
1:115256436-115256606 25 22 RTB-2
1:115256530-115256724 23 23 C1
1:115256532-115256723 23 23 RCW-X
4:55151914-55152086 21 23 TYC-8
4:55151935-55152132 20 23 YU-9
4:55151991-55152182 23 24 V7
4:55591944-55592136 23 24 W4X
4:55592065-55592263 20 23 RXT-22
4:55593504-55593674 24 25 RX-BAN
...

The output is a BAM file containing the alignment of the reads with the reference.

Although the Custom Amplicon Alignment Tool performs quality control checks and sequence trimming, it is possible to apply stricter filters and controls if considered necessary.

Quality control/Trimming tools:

Quality Control:

FastQC:Read QC. Available by default on Galaxy., details here

Read quality trimming:

Trimmomatic. Available from the Galaxy Toolshed, details here.
Nesoni clip, part of the nesoni suite of bioinformatics tools available from the Victorian Bioinformatics Consortium.

References

[1] Amplicon - Wikipedia: http://en.wikipedia.org/wiki/Amplicon

[2] Primer - Wikipedia: http://en.wikipedia.org/wiki/Primer_%28molecular_biology%29

[3] Sequence Alignment - Wikipedia: en.wikipedia.org/wiki/Sequence_alignment

[4] Levenshtein Distance - Wikipedia: http://en.wikipedia.org/wiki/Levenshtein_distance

[5] Needleman-Wunsch Algorithm - Wikipedia: http://en.wikipedia.org/wiki/Needleman%E2%80%93Wunsch_algorithm