Microbial de novo Assembly for Illumina Data
In this tutorial we cover the concepts of Microbial de novo assembly using a very small synthetic dataset from a well studied organism.
What’s not covered
This tutorial covers the basic aspects of microbial de novo assembly from Illumina paired end or single end reads.
It does not cover more complicated aspects of assembly such as:
Background [15 min]
Read the background to the workshop here
Where is the data in this tutorial from?
The data for this tutorial is from a whole genome sequencing experiment of a multi-drug resistant strain of the bacterium Staphylococcus aureus. The DNA was sequenced using an Illumina GAII sequencing machine. The data we are going to use consists of about 4 million x 75 base-pair, paired end reads (two FASTQ read files, one for each end of a DNA fragment.) The data was downloaded from the NCBI Short Read Archive (SRA) (http://www.ncbi.nlm.nih.gov/sra/). The specific sample is a public dataset published in April 2012 with SRA accession number ERR048396.
We will also use a FASTA file containing the sequences of the Illumina adapters used in the sequencing process. It is desirable to remove these as they are artificial sequences and not part of the bacterium that was sequenced.
We will use software called Velvet (Zerbino et al 2008) for the main de novo assembly, as well as some other peripheral software for pre- and post-processing of the data. Details of these can be found in the background document linked above.
We are performing a de novo assembly of the read data into contigs and then into scaffolds (appropriately positioned contigs loosely linked together). We firstly need to check the quality of the input data as this will help us choose the most appropriate range of input parameters for the assembly and will guide us on an appropriate quality trimming/cleanup strategy. We will then use an iterative method to assemble the reads using the Velvet Optimiser (a program that performs lots of Velvet assemblies searching for an optimum outcome.) Once this is complete we will obtain summary statistics on the final results (contigs) of the assembly.
The protocol in a nutshell:
Preparation [15 min]
NOTE: If you log out of Galaxy and log back at a later time your data and results from previous experiments will be available in the right panel of your screen called the ‘History’
Completed Galaxy history for this section (in SharedData>Published Histories): Microbial_assembly_complete
Section 1: Quality control [15 mins]
The basic process here is to collect statistics about the quality of the reads in the sample FASTQ readsets. We will then evaluate their quality and choose an appropriate regime for quality filtering using Trimmomatic (a fastq read quality trimmer.)
From the tools menu in the left hand panel of Galaxy, select NGS QC and manipulation > FastQ QC -> FastQC:ReadQC (down the bottom of this category) and run with these parameters:
Note: This may take a few minutes, depending on how busy Galaxy is.
Now repeat the above process on the second read file: ERR048396_2.fastq
It is important to do both read files as the quality can be very different between them.
You should have two output objects from the first step:
These are a html outputs which show the results of all of the tests FastQC performed on the read files.
Click on the “eye” of each of these objects in turn to see the FastQC output.
The main parts of the output to evaluate are:
From the tools menu in the left hand panel of Galaxy, select NGS QC and manipulation > Trimmomatic (down the bottom of this category) and run with these parameters (only the non-default selections are listed here):
You should have 3 objects from the output of Trimmomatic:
Click on the “eye” symbol on one of the objects to look at its contents. You’ll notice that not all of the reads are the same length now, as they have had the illumina adapters cut out of them and they’ve been quality trimmed.
Completed Galaxy history for this section (in SharedData>Published Histories): Microbial_assembly_section1
Section 2: Assemble reads into contigs with Velvet and the Velvet Optimiser [35 min]
The aim here is to:
We will use a single tool: Velvet Optimiser, which takes the trimmed reads from Trimmomatic and performs numerous Velvet assemblies to find the best one. We need to add the reads in two separate libraries. One for the still paired reads and the other for the singleton reads orphaned from their pairs by the trimming process.
From the tools menu in the left hand panel of Galaxy, select NGS: Assembly > Velvet Optimiser VLSCI and run with these parameters (only the non-default selections are listed here):
Once step 1 is complete, you should now have 3 new objects in your history:
Click on the “eye” icon of the various objects.
Contigs: You’ll see the first MB of the file. Note that the contigs are named NODE_XX_length_XXXX_cov_XXX.XXX. This information tells you how long (in k-mer length) each contig is and what it’s average k-mer coverage is. (See detailed explanation of Velvet and Velvet Optimiser for explanation of k-mer coverage and k-mer length.)
Contig stats: This shows a table of the contigs and their k-mer coverages and which read library contributed to the coverage. It is interesting to note that some of them have much higher coverage than the average. These are most likely to be repeated contigs. (Things like ribosomal RNA and IS elements.)
VelvetOptimiser Logfile: Shows detailed information about each Velvet run it performed during the optimisation. Scroll to the very bottom of the file to see a summary of the final assembly.
From the tools menu in the left hand panel of Galaxy, select FASTA Manipulation > Fasta Statistics and run with these parameters:
You should now have one more object in your history:
Click on the “eye” symbol next to this object and have a look at the output. You’ll see a statistical summary of the contigs including various length stats, the % GC content, the n50 as well as the number of contigs and the number of N bases contained in them.
Completed Galaxy history for this section (in SharedData>Published Histories): Microbial_assembly_section2
Section 3: Extension.
Examine the contig coverage depth and blast a high coverage contig against a protein database.
Look at the Contig Stats data (Velvet Optimiser vlsci on data 8, data 9, and data 7: Contig stats) by clicking on the “eye” symbol. Note that column 2 contig length (lgth), shows a number of very short contigs (some are length 1). We can easily filter out these short contigs from this information list by using the Filter and Sort > Filter tool.
The new data object in the history is called: Filter on data 14. Click on its “eye” icon to view it. Look through the list taking note of the coverages. Note that the average of the coverages (column 6) seems to be somewhere between 16 and 32. There are a lot of contigs with coverage 16. We could say that these contigs only appear once in the genome of the bacteria. Therefore, contigs with double this coverage would appear twice. Note that some of the coverages are >400! These contigs will appear in the genome more than 20 times! Lets have a look at one of these contigs and see if we can find out what it is.
Note the contig number (column 1 in the Contig stats file) of a contig with a coverage of over 300. There should be a few of them. We need to extract the fasta sequence of this contig from the contigs multifasta so we can see it more easily. To do this we will use the tool: Fasta manipulation > Fasta Extract Sequence
The new data object in the history is called: Fasta Extract Sequence on data 13: Fasta. Click on its “eye” icon to view it. It is a single sequence in fasta format.
We want to find out what this contig is or what kind of coding sequence (if any) it contains. So we will blast the sequence using the NCBI blast website. (External to Galaxy). To do this:
After a while the website will present a report of the blast run. Note that the sequence we blasted (if you chose NODE_86) is identical to a transposase gene from similar Staphylococcus aureus bacteria. These transposases occur frequently as repeats in bacterial genomes and so we shouldn’t be surprised at its very high coverage.
Completed Galaxy history for this tutorial (in SharedData>Published Histories): Microbial_assembly_complete