Microbial de novo Assembly for Illumina Data

Introductory tutorial

Simon Gladman

http://genome.edu.au

http://vlsci.org.au

Contents

Tutorial Overview

Background [15 min]

Preparation [15 min]

Section 1: Quality control [30 mins]

Section 2: Assemble reads into contigs with Velvet and the Velvet Optimiser [45 min]

Section 3. Extension [20 min]

References

Tutorial Overview

In this tutorial we cover the concepts of Microbial de novo assembly using a very small synthetic dataset from a well studied organism.

What’s not covered

This tutorial covers the basic aspects of microbial de novo assembly from Illumina paired end or single end reads.

It does not cover more complicated aspects of assembly such as:

  • Incorporation of other raw data types (454 reads, Sanger reads)
  • Gap filling techniques for “finishing” an assembly
  • Measuring the accuracy of assemblies

Background [15 min]

Read the background to the workshop here

Where is the data in this tutorial from?

The data for this tutorial is from a whole genome sequencing experiment of a multi-drug resistant strain of the bacterium Staphylococcus aureus. The DNA was sequenced using an Illumina GAII sequencing machine. The data we are going to use consists of about 4 million x 75 base-pair, paired end reads (two FASTQ read files, one for each end of a DNA fragment.) The data was downloaded from the NCBI Short Read Archive (SRA) (http://www.ncbi.nlm.nih.gov/sra/). The specific sample is a public dataset published in April 2012 with SRA accession number ERR048396.

We will also use a FASTA file containing the sequences of the Illumina adapters used in the sequencing process. It is desirable to remove these as they are artificial sequences and not part of the bacterium that was sequenced.

We will use software called Velvet (Zerbino et al 2008) for the main de novo assembly, as well as some other peripheral software for pre- and post-processing of the data. Details of these can be found in the background document linked above.

The protocol:

We are performing a de novo assembly of the read data into contigs and then into scaffolds (appropriately positioned contigs loosely linked together). We firstly need to check the quality of the input data as this will help us choose the most appropriate range of input parameters for the assembly and will guide us on an appropriate quality trimming/cleanup strategy. We will then use an iterative method to assemble the reads using the Velvet Optimiser (a program that performs lots of Velvet assemblies searching for an optimum outcome.) Once this is complete we will obtain summary statistics on the final results (contigs) of the assembly.

Follow this link for an overview of the protocol

The protocol in a nutshell:

  • Input: Raw reads from sequencer run on microbial DNA sample.
  • Output: File of assembled scaffolds/contigs and associated information.

Preparation [15 min]

  1. Register as a new user in Galaxy if you don’t already have an account (what is Galaxy?)
  1. Open a browser and go to the the Galaxy server: http://galaxy-tut.genome.edu.au
  1. NOTE: Firefox/Safari/Chrome all work well, Internet Explorer not so well
  1. Register as a new user: User>Register or login if you already have an account
  1. Import the DNA read data for the workshop. You can do this in a few ways, of which by far the easiest is:
  1. Go to shared Data -> Published Histories and click on  ‘Microbial_assembly_input_data. Then click 'Import History' at top right, wait for the history to be imported to your account, and then ‘start using this history’.
  1. This will create a new Galaxy history in your account with all of the required data files
  2. Proceed to step 3.

Alternately:

  1. You can download the data from the GVL repository using the following URLs, then upload it to Galaxy for use.
  1. Download the following files:
  1. https://swift.rc.nectar.org.au:8888/v1/AUTH_377/public/Assembly/ERR048396_1.fastq.gz
  2. https://swift.rc.nectar.org.au:8888/v1/AUTH_377/public/Assembly/ERR048396_2.fastq.gz
  3. https://swift.rc.nectar.org.au:8888/v1/AUTH_377/public/Assembly/illumina_adapters.fna
  1. Once files are downloaded, for each of them:
  1. In Galaxy tools panel click on Get data>Upload File
  2. Click 'Choose File', locate the local copy file and upload
  3. For the ‘.fastq.gz’ files, make sure you select ‘fastqsanger’ under the file format to tell Galaxy this is a fastq file. Galaxy will unzip it automatically. For the ‘.fna’ file, make sure you select ‘fasta’.
  4. Click ‘Execute’ to upload. (This may take a couple of minutes for each file.)

Or:

  1. You can upload the data directly to Galaxy using the file URLs.
  1. In Galaxy tools panel click on Get data>Upload File
  2. copy the URLs from the previous step into the URL/Text box one at a time, and Galaxy will download and import them directly. Remember to select the ‘fastqsanger’ File Format for the sequence files and ‘fasta’ for the ‘.fna’ file.
  3. When the files have finished uploading, rename them to ‘ERR048396_1.fastq’, ‘ERR048396_2.fastq’ and ‘illumina_adapters.fna’ respectively by clicking on the pencil icon to the top right of the file name in the right hand Galaxy panel (the history panel)

  1. You should now have the following files in your Galaxy history:
  1. A set of paired end reads consisting of 2 files:
  • ERR048396_1.fastq
  • ERR048396_2.fastq
  1. Illumina adapter list as illumina_adapters.fa

  1. View the fastq files
  1. Click on the eye icon to the top right of each fastq file to view the first part of the file
  2. If you’re not familiar with the FASTQ format, click here for an overview

NOTE: If you log out of Galaxy and log back at a later time your data and results from previous experiments will be available in the right panel of your screen called the ‘History’

Completed Galaxy history for this section (in SharedData>Published Histories): Microbial_assembly_complete

Section 1: Quality control [15 mins]

The basic process here is to collect statistics about the quality of the reads in the sample FASTQ readsets. We will then evaluate their quality and choose an appropriate regime for quality filtering using Trimmomatic (a fastq read quality trimmer.)

More detailed description of FastQC quality analysis can be found here.

More detailed description of Trimmomatic read quality filtering can be found here.

  1. Run FastQC on both input read files

From the tools menu in the left hand panel of Galaxy, select NGS QC and manipulation > FastQ QC -> FastQC:ReadQC (down the bottom of this category) and run with these parameters:

  1. FASTQ reads: eg. ERR048396_1.fastq
  2. Use default for other field
  3. Execute

Screenshot of this process can be seen here.

Note: This may take a few minutes, depending on how busy Galaxy is.

Now repeat the above process on the second read file: ERR048396_2.fastq

It is important to do both read files as the quality can be very different between them.

  1. Examine the FastQC output

You should have two output objects from the first step:

  • FastQC_ERR048396_1.fastqc.html
  • FastQC_ERR048396_2.fastqc.html

These are a html outputs which show the results of all of the tests FastQC performed on the read files.

Click on the “eye” of each of these objects in turn to see the FastQC output.

The main parts of the output to evaluate are:

  1. Basic statistics. This section tells us that the ASCII quality encoding format used was Sanger/Illumina 1.9 and the reads are length 75 and the percent GC content of the entire file is 35%.
  2. Per base sequence quality. In the plot you should see that most of the early bases are up around the '32' mark and then increase to 38-40, which is very high quality; The spread of quality values for the last few bases increases and some of the outliers have quality scores of less than 30. This is a very good quality dataset. 20 is often used as a cutoff for reliable quality.

Screenshot can be seen here

  1. Quality trim the reads using Trimmomatic.

From the tools menu in the left hand panel of Galaxy, select NGS QC and manipulation > Trimmomatic (down the bottom of this category) and run with these parameters (only the non-default selections are listed here):

  1. Direction 1 fastq reads to trim: ERR048396_1.fastq
  2. Direction 1 fastq reads to trim: ERR048396_2.fastq
  3. Quality encoding: phred33
  4. Clip Illumina adapters? checkbox: checked
  5. Fasta of adapters to clip: illumina_adapters.fa
  6. Trim leading bases, minimum quality: 15
  7. Trim trailing bases, minimum quality: 15
  8. Minimum length read: 35
  9. Execute

Screenshot of this process can be seen here.

  1. Examine the Trimmomatic output FastQ files.

You should have 3 objects from the output of Trimmomatic:

  • Trimmomatic on data 3, data 1, and data 2: Dir1 trimmed pairs
  • Trimmomatic on data 3, data 1, and data 2: Dir2 trimmed pairs
  • Trimmomatic on data 3, data 1, and data 2: trimmed reads

Click on the “eye” symbol on one of the objects to look at its contents. You’ll notice that not all of the reads are the same length now, as they have had the illumina adapters cut out of them and they’ve been quality trimmed.

Completed Galaxy history for this section (in SharedData>Published Histories): Microbial_assembly_section1

Section 2: Assemble reads into contigs with Velvet and the Velvet Optimiser [35 min]

The aim here is to:

  • Assemble the trimmed reads into contigs/scaffolds using Velvet and the Velvet Optimiser.

We will use a single tool: Velvet Optimiser, which takes the trimmed reads from Trimmomatic and performs numerous Velvet assemblies to find the best one. We need to add the reads in two separate libraries. One for the still paired reads and the other for the singleton reads orphaned from their pairs by the trimming process.

More detailed explanation of Velvet assemblies and the Velvet Optimiser

  1. De novo assembly of the reads into contigs

From the tools menu in the left hand panel of Galaxy, select NGS: Assembly > Velvet Optimiser VLSCI and run with these parameters (only the non-default selections are listed here):

  1. Start k-mer value: 55
  2. End k-mer value: 69
  3. Click the “Add new Input read libraries” button.
  4. File type selector: shortPaired
  5. Are the reads paired and in two separate files? checkbox: checked
  6. Read dataset for direction 1: 7: Trimmomatic on da..immed pairs
  7. Read dataset for direction 2: 8: Trimmomatic on da..immed pairs
  8. Click the “Add new Input read libraries” button.
  9. Read dataset: 9: Trimmomatic on da..immed reads
  10. Execute

Screenshot of this process can be seen here.

  1. Examine assembly output

Once step 1 is complete, you should now have 3 new objects in your history:

  • Velvet Optimiser vlsci on data 8, data 9, and data 7: Contigs
  • Velvet Optimiser vlsci on data 8, data 9, and data 7: Contig stats
  • Velvet Optimiser vlsci on data 8, data 9, and data 7: VelvetOptimiser Logfile

Click on the “eye” icon of the various objects.

Contigs: You’ll see the first MB of the file. Note that the contigs are named NODE_XX_length_XXXX_cov_XXX.XXX. This information tells you how long (in k-mer length) each contig is and what it’s average k-mer coverage is. (See detailed explanation of Velvet and Velvet Optimiser for explanation of k-mer coverage and k-mer length.)

Contig stats: This shows a table of the contigs and their k-mer coverages and which read library contributed to the coverage. It is interesting to note that some of them have much higher coverage than the average. These are most likely to be repeated contigs. (Things like ribosomal RNA and IS elements.)

VelvetOptimiser Logfile: Shows detailed information about each Velvet run it performed during the optimisation. Scroll to the very bottom of the file to see a summary of the final assembly.

Screenshots can be seen here.

  1. Calculate some statistics on the assembled contigs

From the tools menu in the left hand panel of Galaxy, select FASTA Manipulation > Fasta Statistics and run with these parameters:

  1. Fasta or multifasta file: Velvet Optimiser .. 7: Contigs
  2. Execute

 

  1. Examine the Fasta Stats output

You should now have one more object in your history:

  • Fasta Statistics on data 13: Fasta summary stats

Click on the “eye” symbol next to this object and have a look at the output. You’ll see a statistical summary of the contigs including various length stats, the % GC content, the n50 as well as the number of contigs and the number of N bases contained in them.

Completed Galaxy history for this section (in SharedData>Published Histories): Microbial_assembly_section2

Section 3: Extension.

Examine the contig coverage depth and blast a high coverage contig against a protein database.

  1. Examine the contig coverage depth.

Look at the Contig Stats data (Velvet Optimiser vlsci on data 8, data 9, and data 7: Contig stats) by clicking on the “eye” symbol. Note that column 2 contig length (lgth), shows a number of very short contigs (some are length 1). We can easily filter out these short contigs from this information list by using the Filter and Sort > Filter tool.

  1. Filter: Velvet Optimiser vlsci on data 8, data 9, and data 7: Contig stats
  2. With the following condition: c2 > 100
  3. Execute

The new data object in the history is called: Filter on data 14. Click on its “eye” icon to view it. Look through the list taking note of the coverages. Note that the average of the coverages (column 6) seems to be somewhere between 16 and 32.  There are a lot of contigs with coverage 16. We could say that these contigs only appear once in the genome of the bacteria. Therefore, contigs with double this coverage would appear twice. Note that some of the coverages are >400! These contigs will appear in the genome more than 20 times! Lets have a look at one of these contigs and see if we can find out what it is.

  1. Extract a single sequence from the contigs file.

Note the contig number (column 1 in the Contig stats file) of a contig with a coverage of over 300. There should be a few of them. We need to extract the fasta sequence of this contig from the contigs multifasta so we can see it more easily. To do this we will use the tool: Fasta manipulation > Fasta Extract Sequence

  1. Fasta or multifasta file: Velvet Optimiser .. 7: Contigs
  2. Sequence ID (or partial): NODE_1_ (for example)
  3. Execute

The new data object in the history is called: Fasta Extract Sequence on data 13: Fasta. Click on its “eye” icon to view it. It is a single sequence in fasta format.

  1. Blast sequence to determine what it contains.

We want to find out what this contig is or what kind of coding sequence (if any) it contains. So we will blast the sequence using the NCBI blast website. (External to Galaxy). To do this:

  1. Bring up the sequence of the contig into the main window of the browser by clicking on the “eye” icon if it isn’t already.
  2. Select the entire sequence by clicking and dragging with the mouse or by pressing ctrl-a in the browser.
  3. Copy the selected sequence to the clipboard.
  4. Open a new tab of your browser and point it to: http://blast.ncbi.nlm.nih.gov/Blast.cgi
  5. Under the BASIC BLAST section, click “blastx”.
  6. Paste the sequence into the large text box labelled: Enter Accession number(s), gi(s) or FASTA sequence(s).
  7. Change the Genetic code to: Bacteria and Archaea (11)
  8. Click the button labelled: BLAST

After a while the website will present a report of the blast run. Note that the sequence we blasted (if you chose NODE_86) is identical to a transposase gene from similar Staphylococcus aureus bacteria. These transposases occur frequently as repeats in bacterial genomes and so we shouldn’t be surprised at its very high coverage.

A screenshot of the output can be seen here

Completed Galaxy history for this tutorial (in SharedData>Published Histories): Microbial_assembly_complete

References