Information provided by Dr. Umer Zeeshan Ijaz

For previous tutorials, please check

Tutorials (Click to go to the relevant tutorial):

28/09/2016: Tutorial: Illumina Amplicons Processing using DADA2

16/02/2017: Tutorial: Marker-gene (MetaPhlAn2 pipelines) based whole-genome shotgun sequencing data analysis

18/05/2017: Some more software for metagenomics analysis (Acknowledgment:  Zihan Dai)

13/06/2017: Tutorial: Making a custom taxonomy reference database with BOLDSYSTEMS to work with Qiime

19/06/2017: Tutorial: RNAseq pipeline for generating and processing abundance tables using STAR/TOPHAT2/HISAT2 (human transcripts)

28/09/2016: Tutorial: Illumina Amplicons Processing using DADA2

Required R Package: DADA2

Relevant Paper: DADA2: High-resolution sample inference from Illumina amplicon data

Here is the workflow you can try on Orion cluster ( on my Crohn's disease dataset using DADA2 pipeline which is an OTU free approach and resolves variants by as little as one nucleotide. I have slightly modified the tutorial given at

The dataset is the same as used in the previous tutorial: 09,16/04/2015: Tutorial: Illumina Amplicons OTU Construction with Noise Removal

[uzi@hammett ~]$ mkdir DADA2_TUTORIAL
[uzi@hammett ~]$ cd DADA2_TUTORIAL
[uzi@hammett ~/DADA2_TUTORIAL]$
[uzi@hammett ~/DADA2_TUTORIAL]$ cp /home/opt/tutorials/Raw/*.fastq .

The reference databases for assigning taxonomy are located at:

[uzi@hammett ~/DADA2_TUTORIAL]$ ls -1 /home/opt/DADA2_assignTaxonomy

Run R inside DADA2_TUTORIAL to pick files from the current directory

[uzi@hammett ~/DADA2_TUTORIAL]$ R

R version 3.3.1 (2016-06-21) -- "Bug in Your Hair"
Copyright (C) 2016 The R Foundation for Statistical Computing
Platform: x86_64-redhat-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

  Natural language support but running in an English locale

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.


The code is:


#Get the list of all files in the current directory and sort to ensure forward/reverse reads are in the same order

#Extract the forward and reverse reads
<- fastqs[grepl("_R1", fastqs)]
<- fastqs[grepl("_R2", fastqs)]

#Get the sample names
<- sapply(strsplit(fnFs, "_"), `[`, 1)

# Fully specify the path for the fnFs and fnRs
<- file.path(".", fnFs)
<- file.path(".", fnRs)

#Filter the forward and reverse reads
# Make directory and filenames for the filtered fastqs
<- file.path(".", "filtered")
if(!file_test("-d", filt_path)) dir.create(filt_path)
<- file.path(filt_path, paste0(sample.names, "_F_filt.fastq.gz"))
<- file.path(filt_path, paste0(sample.names, "_R_filt.fastq.gz"))
# Filter
for(i in seq_along(fnFs)) {
 fastqPairedFilter(c(fnFs[i], fnRs[i]), c(filtFs[i], filtRs[i]),
=c(10, 10), truncLen=c(240,160),
=0, maxEE=2, truncQ=2,
=TRUE, verbose=TRUE)

#Dereplicate the filtered fastq reads
<- derepFastq(filtFs, verbose=TRUE)
<- derepFastq(filtRs, verbose=TRUE)
# Name the derep-class objects by the sample names
<- sample.names
<- sample.names

#Perform joint sample inference and error rate estimation (takes a few minutes):
<- dada(derepFs, err=NULL, selfConsist = TRUE)
<- dada(derepRs, err=NULL, selfConsist = TRUE)

#The learned error rates are as follows

errF <- dadaFs[[1]]$err_out

errR <- dadaRs[[1]]$err_out

#Rerun with the error rates

dadaFs <- dada(derepFs, err=errF)
<- dada(derepRs, err=errR)

#Merge the denoise forward and reverse reads:
<- mergePairs(dadaFs, derepFs, dadaRs, derepRs, verbose=TRUE)

#Construct sequence table
<- makeSequenceTable(mergers)

#Remove chimeric sequences
<- removeBimeraDenovo(seqtab, verbose=TRUE)

#Assign taxonomy:
<- assignTaxonomy(seqtab.nochim, "/home/opt/DADA2_assignTaxonomy/silva_nr_v123_train_set.fa.gz")
<- c("Kingdom", "Phylum", "Class", "Order", "Family", "Genus")
#Save sequence file:

#Save the sequence table:

#Save the sequence taxonomy:

Come out of R and check the files:

[uzi@hammett ~/DADA2_TUTORIAL]$ ls seq*
seqs.fa  seq_table.csv  seq_Taxonomy.csv
[uzi@hammett ~/DADA2_TUTORIAL]$ head seqs.fa
[uzi@hammett ~/DADA2_TUTORIAL]$ head seq_table.csv
[uzi@hammett ~/DADA2_TUTORIAL]$ head seq_Taxonomy.csv
[uzi@hammett ~/DADA2_TUTORIAL]$

16/02/2017: Tutorial: Marker-gene (MetaPhlAn2 pipelines) based whole-genome shotgun sequencing data analysis

This tutorial gives a step-by-step description of the whole-genome shotgun metagenomics sequencing data analysis. Some steps of this tutorial were inspired/adopted from MetaPhlAn2 Pipelines Tutorial ( The main goal is to perform taxonomic assignment using a marker gene method (MetaPhlAn2). Additionally, we also discuss auxillary tools (GraPhlAn:; LEfSe:, and HUMAnN2: that provide different visualisation options for taxonomic assignments, and accurately determine (HUMAnN2) the presence/absence and abundance of microbial pathways in a community from metagenomic data. Please note that this is an updated version of "Whole-genome Shotgun Metagenomics Sequencing Data Analysis" tutorial shared at with the updated version of the above mentioned software.

Since the data generated from whole-genome shotgun sequencing experiments are often huge, and comprising millions of reads for a single sample, it takes very long to process the data. So, for demonstration purposes, we subsample our datasets (20 fecal samples, 10 from healthy children, and 10 from those diagnosed with Crohns disease) to smaller number of reads that can be processed in few hours. We have tried to generalize the processing steps as much as we could by automating the workflow with command-line bash scripting. Thus, it is anticipated that the instructions are repeatable (with minor modifications) on your datasets provided if the input files follow a consistent naming conventions like what we used.

The data is from the following publications (pre-reading required). The sequencing data are available on the European Nucleotide Archive under the study accession number: PRJEB18780 ( with

information about the samples given in Supplementary Table S1 of [2]:

[1] C Quince, UZ Ijaz, N Loman, AM Eren, D Saulnier, J Russell, SJ Haig, ST Calus, J Quick, A Barclay, M Bertz, M Blaut, R Hansen, P McGrogan, RK Russell, C Edwards, and K Gerasimidis. Extensive modulation of the fecal metagenome in children with Crohn's disease during exclusive enteral nutrition. American Journal of Gastroenterology, 110:1718-1729, 2015. DOI:10.1038/ajg.2015.357

[2] UZ Ijaz, C Quince, L Hanske, N Loman, ST Calus, M Bertz, CA Edwards, DR Gaya, R Hansen, P McGrogan, RK Russell, and K Gerasimidis. The microbial 'dysbiosis' of Crohn's disease patients does not occur in their unaffected, genetically linked kindred. PLoS ONE, accepted for publication (2017)

Also note that you don't need to run the commands for "Whole data" as they will take forever to run. Instead use the alternative commands listed under "Subsampled data"

Initial setup and datasets for this tutorial:

On Windows machine (including teaching labs at the University), you will use mobaxterm (google it and download the portable version if you have not done so already) AND NOT putty to log on to the server as it has XServer built into it, allows X11 forwarding, and you can run the softwares that have a GUI. If you are on linux/Mac OS, you can use -Y or -X switch in the ssh command below to allow the same capability.

You can log on to Orion cluster through either of the compute nodes (outside the university, you will need to establish a VPN connection)

ssh -Y

ssh -Y

ssh -Y

When you have successfully logged on, create a folder [YourName], go inside it and do all your processing there!

[MScBioinf@becker ~]$ mkdir uzi

[MScBioinf@becker ~]$ mkdir uzi/WGSTutorial

[MScBioinf@becker ~]$ cd uzi/WGSTutorial

[MScBioinf@becker ~/uzi/WGSTutorial]$

You can find the required files in the folder /home/opt/Mtutorial/Raw/. These files have been preprocessed to ensure high read quality, and have the names [C/H]ID_R[1/2].fasta. You can see that the total number of reads in *_R1.fasta matches the total number of reads in *_R2.fasta for each sample.

[MScBioinf@becker ~/uzi/WGSTutorial]$ grep -c ">" /home/opt/Mtutorial/Raw/*.fasta









































See how we are using match() function to extract the sample names. RSTART and RLENGTH global variables are set when a match is found:

Whole data:

[MScBioinf@becker ~/uzi/WGSTutorial]$ for i in $(ls /home/opt/Mtutorial/Raw/*.fasta | awk 'match($0,/[HC][0-9]+/){ print substr( $0, RSTART, RLENGTH )}'| sort | uniq); do echo $i; done

Subsampled data:

[MScBioinf@becker ~/uzi/WGSTutorial]$ for i in $(ls /home/opt/Mtutorial/MTutorial_Umer/subsampled_s100000/*/Raw/*.fasta | awk 'match($0,/[HC][0-9]+/){ print substr( $0, RSTART, RLENGTH )}'| sort | uniq); do echo $i; done





















The following instructions change the version of the python as MetaPhlAn2 pipelines are available only for 3.4.0.

[MScBioinf@becker ~/uzi/WGSTutorial]$ export PYENV_ROOT="/home/opt/.pyenv"

[MScBioinf@becker ~/uzi/WGSTutorial]$ export PATH="$PYENV_ROOT/bin:$PATH"

[MScBioinf@becker ~/uzi/WGSTutorial]$ eval "$(pyenv init -)"

[MScBioinf@becker ~/uzi/WGSTutorial]$ pyenv local 3.4.0

[MScBioinf@becker ~/uzi/WGSTutorial]$ python -V

Python 2.7.11 :: Continuum Analytics, Inc.

MetaPhlAn2 offers species-level resolution for bacterial and archaeal organisms by looking for clade-specific markers identified from 7K (check the website for latest count) reference genomes. Run the following commands

Whole data:

[MScBioinf@becker ~/uzi/WGSTutorial]$ for i in $(ls /home/opt/Mtutorial/Raw/*.fasta | awk 'match($0,/[HC][0-9]+/){ print substr( $0, RSTART, RLENGTH )}'| sort | uniq); do echo Processing $i; /home/opt/Mtutorial/Raw/${i}_R1.fasta,/home/opt/Mtutorial/Raw/${i}_R2.fasta -o ${i}_profile.txt --input_type multifasta --nproc 2 --bowtie2out ${i}_metagenome.bowtie2.bz2; done

Subsampled data:

[MScBioinf@becker ~/uzi/WGSTutorial]$ for i in $(ls /home/opt/Mtutorial/MTutorial_Umer/subsampled_s100000/*/Raw/*.fasta | awk 'match($0,/[HC][0-9]+/){ print substr( $0, RSTART, RLENGTH )}'| sort | uniq); do echo Processing $i; /home/opt/Mtutorial/MTutorial_Umer/subsampled_s100000/${i}/Raw/${i}_R12.fasta -o ${i}_profile.txt --input_type fasta --nproc 2 --bowtie2out ${i}_metagenome.bowtie2.bz2; done

Finally, the MetaPhlAn2 distribution includes a utility script that will create a single tab-delimited table from these files

[MScBioinf@becker ~/uzi/WGSTutorial]$ /home/opt/metaphlan2/utils/  *_profile.txt > merged_abundance_table.txt

Run the following command to create a species only abundance table, providing the abundance table

[MScBioinf@becker ~/uzi/WGSTutorial]$ grep -E "(s__)|(^ID)" merged_abundance_table.txt | grep -v "t__" | sed 's/^.*s__//g' > merged_abundance_table_species.txt

There are three parts to this command. The first grep searches the file for the regular expression "(s__)|(^ID)" which matches to those lines with species information and also to the header. The second grep does not print out any lines with strain information (labeled as t__). The sed removes the full taxonomy from each line so the first column only includes the species name.

Next generate the species only heatmap by running the following command:

[MScBioinf@becker ~/uzi/WGSTutorial]$ /home/opt/metaphlan2/utils/hclust2/ -i merged_abundance_table_species.txt -o abundance_heatmap_species.png --ftop 50 --f_dist_f braycurtis --s_dist_f braycurtis --cell_aspect_ratio 0.5 -l --flabel_size 6 --slabel_size 6 --max_flabel_len 100 --max_slabel_len 100 --minv 0.1 --dpi 300


The command above includes options to select the top 50 features, use Bray-Curtis as the distance measure both between samples and between features (microbes), set the ratio between the width/height of cells to 0.5, use a log scale for assigning heatmap colors, set the sample and feature label size to 6, set the max sample and feature label length to 100, select the minimum value to display as 0.1, and select an image resolution of 300.

GraPhlAn requires two inputs: (i) a tree structure to represent and (ii) graphical annotation options for the tree.

Run the following command to generate the two input files for GraPhlAn (the tree and annotation files) providing the abundance table (merged_abundance_table.txt) created in prior tutorial steps:

[MScBioinf@becker ~/uzi/WGSTutorial]$ /home/opt/metaphlan2/utils/export2graphlan/ --skip_rows 1,2 -i merged_abundance_table.txt --tree merged_abundance.tree.txt --annotation merged_abundance.annot.txt --most_abundant 100 --abundance_threshold 1 --least_biomarkers 10 --annotations 5,6 --external_annotations 7 --min_clade_size 1

The command above has options to skip rows 1 and 2, select the top 100 most abundance clades to highlight, set a minimum abundance threshold for clades to be annotated, extract a minimum of 10 biomarkers, select taxonomic levels 5 and 6 to be annotated within the tree, select taxonomic level 7 to be used in the external legend, and set the minimum size of clades annotated as biomarkers to 1. The output files created are merged_abundance.tree.txt and merged_abunance.annot.txt.

Run the following commands to generate the cladogram providing the tree ( merged_abundance.tree.txt ) and annotation ( merged_abundance.annot.txt ) files from the prior step

[MScBioinf@becker ~/uzi/WGSTutorial]$ --annot merged_abundance.annot.txt merged_abundance.tree.txt merged_abundance.xml

[MScBioinf@becker ~/uzi/WGSTutorial]$ --dpi 300 merged_abundance.xml merged_abundance.png --external_legends




The first command creates an xml file from the tree and annotation inputs. The second command creates the image, setting the image resolution to 300 and requesting external legends.

To do Lefse analysis we need to reformat the data by changing sample names to reflect the conditions only. This can be done using the following command:

[MScBioinf@becker ~/uzi/WGSTutorial]$ sed 's/\([CH]\)[0-9]*_profile/\1/g' merged_abundance_table.txt > merged_abundance_table_lefse.txt

[MScBioinf@becker ~/uzi/WGSTutorial]$ /home/opt/lefse/ merged_abundance_table_lefse.txt -c 1 -s 2 -u 3 -o 1000000

[MScBioinf@becker ~/uzi/WGSTutorial]$ merged_abundance_table_lefse.res

[MScBioinf@becker ~/uzi/WGSTutorial]$ /home/opt/lefse/ merged_abundance_table_lefse.res merged_abundance_table_lefse.png


[MScBioinf@becker ~/uzi/WGSTutorial]$ /home/opt/lefse/ merged_abundance_table_lefse.res merged_abundance_table_lefse.cladogram.png --format png


HUMAnN2 is a pipeline for efficiently and accurately profiling the presence/absence and abundance of microbial pathways in a community from metagenomic or metatranscriptomic sequencing data. The wiki is located at We will run the processing in another folder WGSTutorial2.

Whole data:

[MScBioinf@becker ~/uzi/WGSTutorial2]$ for i in $(ls /home/opt/Mtutorial/Raw/*.fasta | awk 'match($0,/[HC][0-9]+/){ print substr( $0, RSTART, RLENGTH )}'| sort | uniq); do echo Processing $i; cat /home/opt/Mtutorial/Raw/${i}_R1.fasta /home/opt/Mtutorial/Raw/${i}_R2.fasta > tmp.fa; humann2 --input tmp.fa --protein-database /home/opt/humann2_databases/uniref --nucleotide-database /home/opt/humann2_databases/chocophlan --output ${i}_humann2 --threads 20; rm tmp.fa; done

Subsampled data:

[MScBioinf@becker ~/uzi/WGSTutorial2]$ for i in $(ls /home/opt/Mtutorial/Raw/*.fasta | awk 'match($0,/[HC][0-9]+/){ print substr( $0, RSTART, RLENGTH )}'| sort | uniq); do echo Processing $i; humann2 --input /home/opt/Mtutorial/MTutorial_Umer/subsampled_s100000/${i}/Raw/${i}_R12.fasta --protein-database /home/opt/humann2_databases/uniref_ecfiltered --nucleotide-database /home/opt/humann2_databases/chocophlan --output ${i}_humann2 --threads 2; done

It takes a bit of time to run HUMAnN2 on subsampled data therefore, each student will run only one sample just to get familiar with the syntax. If I run the above with an "echo" statement, I can print the individual commands that students can copy/paste to their terminals

[MScBioinf@becker ~/uzi/WGSTutorial2]$ for i in $(ls /home/opt/Mtutorial/Raw/*.fasta | awk 'match($0,/[HC][0-9]+/){ print substr( $0, RSTART, RLENGTH )}'| sort | uniq); do echo humann2 --input /home/opt/Mtutorial/MTutorial_Umer/subsampled_s100000/${i}/Raw/${i}_R12.fasta --protein-database /home/opt/humann2_databases/uniref_ecfiltered --nucleotide-database /home/opt/humann2_databases/chocophlan --output ${i}_humann2 --threads 2; done

humann2 --input /home/opt/Mtutorial/MTutorial_Umer/subsampled_s100000/C1/Raw/C1_R12.fasta --protein-database /home/opt/humann2_databases/uniref_ecfiltered --nucleotide-database /home/opt/humann2_databases/chocophlan --output C1_humann2 --threads 2

humann2 --input /home/opt/Mtutorial/MTutorial_Umer/subsampled_s100000/C13/Raw/C13_R12.fasta --protein-database /home/opt/humann2_databases/uniref_ecfiltered --nucleotide-database /home/opt/humann2_databases/chocophlan --output C13_humann2 --threads 2

humann2 --input /home/opt/Mtutorial/MTutorial_Umer/subsampled_s100000/C27/Raw/C27_R12.fasta --protein-database /home/opt/humann2_databases/uniref_ecfiltered --nucleotide-database /home/opt/humann2_databases/chocophlan --output C27_humann2 --threads 2

humann2 --input /home/opt/Mtutorial/MTutorial_Umer/subsampled_s100000/C32/Raw/C32_R12.fasta --protein-database /home/opt/humann2_databases/uniref_ecfiltered --nucleotide-database /home/opt/humann2_databases/chocophlan --output C32_humann2 --threads 2

humann2 --input /home/opt/Mtutorial/MTutorial_Umer/subsampled_s100000/C45/Raw/C45_R12.fasta --protein-database /home/opt/humann2_databases/uniref_ecfiltered --nucleotide-database /home/opt/humann2_databases/chocophlan --output C45_humann2 --threads 2

humann2 --input /home/opt/Mtutorial/MTutorial_Umer/subsampled_s100000/C51/Raw/C51_R12.fasta --protein-database /home/opt/humann2_databases/uniref_ecfiltered --nucleotide-database /home/opt/humann2_databases/chocophlan --output C51_humann2 --threads 2

humann2 --input /home/opt/Mtutorial/MTutorial_Umer/subsampled_s100000/C56/Raw/C56_R12.fasta --protein-database /home/opt/humann2_databases/uniref_ecfiltered --nucleotide-database /home/opt/humann2_databases/chocophlan --output C56_humann2 --threads 2

humann2 --input /home/opt/Mtutorial/MTutorial_Umer/subsampled_s100000/C62/Raw/C62_R12.fasta --protein-database /home/opt/humann2_databases/uniref_ecfiltered --nucleotide-database /home/opt/humann2_databases/chocophlan --output C62_humann2 --threads 2

humann2 --input /home/opt/Mtutorial/MTutorial_Umer/subsampled_s100000/C68/Raw/C68_R12.fasta --protein-database /home/opt/humann2_databases/uniref_ecfiltered --nucleotide-database /home/opt/humann2_databases/chocophlan --output C68_humann2 --threads 2

humann2 --input /home/opt/Mtutorial/MTutorial_Umer/subsampled_s100000/C80/Raw/C80_R12.fasta --protein-database /home/opt/humann2_databases/uniref_ecfiltered --nucleotide-database /home/opt/humann2_databases/chocophlan --output C80_humann2 --threads 2

humann2 --input /home/opt/Mtutorial/MTutorial_Umer/subsampled_s100000/H117/Raw/H117_R12.fasta --protein-database /home/opt/humann2_databases/uniref_ecfiltered --nucleotide-database /home/opt/humann2_databases/chocophlan --output H117_humann2 --threads 2

humann2 --input /home/opt/Mtutorial/MTutorial_Umer/subsampled_s100000/H119/Raw/H119_R12.fasta --protein-database /home/opt/humann2_databases/uniref_ecfiltered --nucleotide-database /home/opt/humann2_databases/chocophlan --output H119_humann2 --threads 2

humann2 --input /home/opt/Mtutorial/MTutorial_Umer/subsampled_s100000/H128/Raw/H128_R12.fasta --protein-database /home/opt/humann2_databases/uniref_ecfiltered --nucleotide-database /home/opt/humann2_databases/chocophlan --output H128_humann2 --threads 2

humann2 --input /home/opt/Mtutorial/MTutorial_Umer/subsampled_s100000/H130/Raw/H130_R12.fasta --protein-database /home/opt/humann2_databases/uniref_ecfiltered --nucleotide-database /home/opt/humann2_databases/chocophlan --output H130_humann2 --threads 2

humann2 --input /home/opt/Mtutorial/MTutorial_Umer/subsampled_s100000/H132/Raw/H132_R12.fasta --protein-database /home/opt/humann2_databases/uniref_ecfiltered --nucleotide-database /home/opt/humann2_databases/chocophlan --output H132_humann2 --threads 2

humann2 --input /home/opt/Mtutorial/MTutorial_Umer/subsampled_s100000/H134/Raw/H134_R12.fasta --protein-database /home/opt/humann2_databases/uniref_ecfiltered --nucleotide-database /home/opt/humann2_databases/chocophlan --output H134_humann2 --threads 2

humann2 --input /home/opt/Mtutorial/MTutorial_Umer/subsampled_s100000/H139/Raw/H139_R12.fasta --protein-database /home/opt/humann2_databases/uniref_ecfiltered --nucleotide-database /home/opt/humann2_databases/chocophlan --output H139_humann2 --threads 2

humann2 --input /home/opt/Mtutorial/MTutorial_Umer/subsampled_s100000/H140/Raw/H140_R12.fasta --protein-database /home/opt/humann2_databases/uniref_ecfiltered --nucleotide-database /home/opt/humann2_databases/chocophlan --output H140_humann2 --threads 2

humann2 --input /home/opt/Mtutorial/MTutorial_Umer/subsampled_s100000/H142/Raw/H142_R12.fasta --protein-database /home/opt/humann2_databases/uniref_ecfiltered --nucleotide-database /home/opt/humann2_databases/chocophlan --output H142_humann2 --threads 2

humann2 --input /home/opt/Mtutorial/MTutorial_Umer/subsampled_s100000/H144/Raw/H144_R12.fasta --protein-database /home/opt/humann2_databases/uniref_ecfiltered --nucleotide-database /home/opt/humann2_databases/chocophlan --output H144_humann2 --threads 2

Now collate all the data using humann2_join_tables. -s switch below searches subdirectories for any files with the name given in --file_name

[MScBioinf@becker ~/uzi/WGSTutorial2]$ humann2_join_tables -s  --input ./ -o collated_pathway_coverage.tsv --file_name pathcoverage

[MScBioinf@becker ~/uzi/WGSTutorial2]$ humann2_join_tables -s  --input ./ -o collated_pathway_abundance.tsv --file_name pathabundance

[MScBioinf@becker ~/uzi/WGSTutorial2]$ humann2_join_tables -s  --input ./ -o collated_gene_families.tsv --file_name genefamilies

You can normalise the tables using humann2_renorm_table. This script provides the choice to normalize to relative abundance or copies per million (CPM) units. Both of these represent "total sum scaling (TSS)"-style normalization: in the former case, each sample is constrained to sum to 1, whereas in the latter case (CPMs) samples are constrained to sum to 1 million. Units out of 1 million are often more convenient for tables with many, many features (such as genefamilies.tsv tables). CPM as used here does not refer to unnormalized COUNTS per million, but rather copies per million. CPMs as used here are a generic analog of the TPM (transcript per million) unit in RNA-seq.

[MScBioinf@becker ~/uzi/WGSTutorial2]$ humann2_renorm_table --input collated_gene_families.tsv --output collated_gene_families_relab.tsv --units relab

To further simplify the exploration of gene family abundance data, users can regroup gene families into other functional categories using humann2_regroup_table. This script takes as arguments a gene family abundance table and a mapping (groups) file that indicates which gene families belong to which groups. Mappings are available for both UniRef90 and UniRef50 gene families to the following systems:

[MScBioinf@becker ~/uzi/WGSTutorial2]$ humann2_regroup_table -i collated_gene_families.tsv -g uniref90_go -o collated_gene_families_go.tsv

[MScBioinf@becker ~/uzi/WGSTutorial2]$ humann2_regroup_table -i collated_gene_families.tsv -g uniref90_ko -o collated_gene_families_ko.tsv

[MScBioinf@becker ~/uzi/WGSTutorial2]$ humann2_regroup_table -i collated_gene_families.tsv -g uniref90_pfam -o collated_gene_families_pfam.tsv

Please note that the final directory structure should look like these folders:

[MScBioinf@becker ~/uzi/WGSTutorial2]$ ls /home/opt/Mtutorial/metaphlan2_whole/WGSTutorial

abundance_heatmap_species.png  C45_metagenome.bowtie2.bz2  C68_profile.txt                  H130_metagenome.bowtie2.bz2  H140_profile.txt                 merged_abundance_table_lefse.cladogram.png

C13_metagenome.bowtie2.bz2         C45_profile.txt                 C80_metagenome.bowtie2.bz2   H130_profile.txt                 H142_metagenome.bowtie2.bz2

C13_profile.txt                    C51_metagenome.bowtie2.bz2  C80_profile.txt                  H132_metagenome.bowtie2.bz2  H142_profile.txt                 merged_abundance_table_lefse.png

C1_metagenome.bowtie2.bz2          C51_profile.txt                 H117_metagenome.bowtie2.bz2  H132_profile.txt                 H144_metagenome.bowtie2.bz2  merged_abundance_table_lefse.res

C1_profile.txt                     C56_metagenome.bowtie2.bz2  H117_profile.txt                 H134_metagenome.bowtie2.bz2  H144_profile.txt                 merged_abundance_table_lefse.txt

C27_metagenome.bowtie2.bz2         C56_profile.txt                 H119_metagenome.bowtie2.bz2  H134_profile.txt                 merged_abundance_annot.png   merged_abundance_table_species.txt

C27_profile.txt                    C62_metagenome.bowtie2.bz2  H119_profile.txt                 H139_metagenome.bowtie2.bz2  merged_abundance.annot.txt   merged_abundance_table.txt

C32_metagenome.bowtie2.bz2         C62_profile.txt                 H128_metagenome.bowtie2.bz2  H139_profile.txt                 merged_abundance_legend.png  merged_abundance.tree.txt

C32_profile.txt                    C68_metagenome.bowtie2.bz2  H128_profile.txt                 H140_metagenome.bowtie2.bz2  merged_abundance.png             merged_abundance.xml

[MScBioinf@becker ~/uzi/WGSTutorial2]$ ls /home/opt/Mtutorial/metaphlan2_whole/WGSTutorial2

C13_humann2  C32_humann2  C56_humann2  C80_humann2                        collated_gene_families_pfam.tsv   collated_pathway_abundance.tsv  H119_humann2  H132_humann2  H140_humann2

C1_humann2   C45_humann2  C62_humann2  collated_gene_families_go.tsv  collated_gene_families_relab.tsv  collated_pathway_coverage.tsv   H128_humann2  H134_humann2  H142_humann2

C27_humann2  C51_humann2  C68_humann2  collated_gene_families_ko.tsv  collated_gene_families.tsv            H117_humann2                        H130_humann2  H139_humann2  H144_humann2

[MScBioinf@becker ~/uzi/WGSTutorial2]$ ls /home/opt/Mtutorial/metaphlan2_subsampled/WGSTutorial

abundance_heatmap_species.png  C45_metagenome.bowtie2.bz2  C68_profile.txt                  H130_metagenome.bowtie2.bz2  H140_profile.txt                 merged_abundance_table_lefse.cladogram.png

C13_metagenome.bowtie2.bz2         C45_profile.txt                 C80_metagenome.bowtie2.bz2   H130_profile.txt                 H142_metagenome.bowtie2.bz2

C13_profile.txt                    C51_metagenome.bowtie2.bz2  C80_profile.txt                  H132_metagenome.bowtie2.bz2  H142_profile.txt                 merged_abundance_table_lefse.png

C1_metagenome.bowtie2.bz2          C51_profile.txt                 H117_metagenome.bowtie2.bz2  H132_profile.txt                 H144_metagenome.bowtie2.bz2  merged_abundance_table_lefse.res

C1_profile.txt                     C56_metagenome.bowtie2.bz2  H117_profile.txt                 H134_metagenome.bowtie2.bz2  H144_profile.txt                 merged_abundance_table_lefse.txt

C27_metagenome.bowtie2.bz2         C56_profile.txt                 H119_metagenome.bowtie2.bz2  H134_profile.txt                 merged_abundance_annot.png   merged_abundance_table_species.txt

C27_profile.txt                    C62_metagenome.bowtie2.bz2  H119_profile.txt                 H139_metagenome.bowtie2.bz2  merged_abundance.annot.txt   merged_abundance_table.txt

C32_metagenome.bowtie2.bz2         C62_profile.txt                 H128_metagenome.bowtie2.bz2  H139_profile.txt                 merged_abundance_legend.png  merged_abundance.tree.txt

C32_profile.txt                    C68_metagenome.bowtie2.bz2  H128_profile.txt                 H140_metagenome.bowtie2.bz2  merged_abundance.png             merged_abundance.xml

[MScBioinf@becker ~/uzi/WGSTutorial2]$ ls /home/opt/Mtutorial/metaphlan2_subsampled/WGSTutorial2

C13_humann2  C32_humann2  C56_humann2  C80_humann2                        collated_gene_families_pfam.tsv   collated_pathway_abundance.tsv  H119_humann2  H132_humann2  H140_humann2

C1_humann2   C45_humann2  C62_humann2  collated_gene_families_go.tsv  collated_gene_families_relab.tsv  collated_pathway_coverage.tsv   H128_humann2  H134_humann2  H142_humann2

C27_humann2  C51_humann2  C68_humann2  collated_gene_families_ko.tsv  collated_gene_families.tsv            H117_humann2                        H130_humann2  H139_humann2  H144_humann2

18/05/2017: Some more software for metagenomics analysis (Acknowledgment:  Zihan Dai)


Fast alignment tool to align DNA reads or protein sequences against a protein reference database:



[zihan@becker ~]$ diamond help

diamond v0.8.2.64 | by Benjamin Buchfink <>

Check for updates.


Syntax: diamond COMMAND [OPTIONS]



makedb Build DIAMOND database from a FASTA file

blastp Align amino acid query sequences against a protein reference database

blastx Align DNA query sequences against a protein reference database

view View DIAMOND alignment archive (DAA) formatted file

help Produce help message

version Display version information


General options:

--threads (-p)             number of CPU threads

--db (-d)              database file

--daa (-a)                 DIAMOND alignment archive (DAA) file

--verbose (-v)         verbose console output

--log                  enable debug log

--quiet                disable console output


Makedb options:

--in                   input reference file in FASTA format

--block-size (-b)          sequence block size in billions of letters (default=2)


Aligner options:

--query (-q)               input query file

--max-target-seqs (-k) maximum number of target sequences to report alignments for

--top                  report alignments within this percentage range of top alignment score (overrides --max-target-seqs)

--compress             compression for output files (0=none, 1=gzip)

--evalue (-e)              maximum e-value to report alignments

--min-score            minimum bit score to report alignments (overrides e-value setting)

--id                   minimum identity% to report an alignment

--query-cover              minimum query cover% to report an alignment

--sensitive            enable sensitive mode (default: fast)

--index-chunks (-c)    number of chunks for index processing

--tmpdir (-t)              directory for temporary files

--gapopen              gap open penalty (default=11 for protein)

--gapextend            gap extension penalty (default=1 for protein)

--matrix               score matrix for protein alignment

--seg                  enable SEG masking of queries (yes/no)

--salltitles               print full subject titles in output files


Advanced options:

--seed-freq            maximum seed frequency

--run-len (-l)             mask runs between stop codons shorter than this length

--max-hits (-C)            maximum number of hits to consider for one seed

--id2                  minimum number of identities for stage 1 hit

--window (-w)              window size for local hit search

--xdrop (-x)           xdrop for ungapped alignment

--gapped-xdrop (-X)    xdrop for gapped alignment in bits

--ungapped-score           minimum raw alignment score to continue local extension

--hit-band             band for hit verification

--hit-score            minimum score to keep a tentative alignment

--band                 band for dynamic programming computation

--shapes (-s)              number of seed shapes (0 = all available)

--index-mode               index mode (0=4x12, 1=16x9)

--fetch-size               trace point fetch size

--single-domain            Discard secondary domains within one target sequence

--dbsize               effective database size (in letters)

--no-auto-append           disable auto appending of DAA and DMND file extensions


View options:

--out (-o)             output file

--outfmt (-f)              output format (tab/sam)

--forwardonly              only show alignments of forward strand



Taxonomic ssequence classifier that assigns taxonomic labels to short DNA reads:



[zihan@becker ~]$ export PATH=/home/opt/kraken/:$PATH

[zihan@becker ~]$ export PATH=/home/opt/jellyfish-1.1.11/bin/:$PATH


[zihan@becker ~]$ kraken --help

Usage: kraken [options] <filename(s)>



  --db NAME               Name for Kraken DB

                          (default: none)

  --threads NUM               Number of threads (default: 1)

  --fasta-input               Input is FASTA format

  --fastq-input               Input is FASTQ format

  --gzip-compressed           Input is gzip compressed

  --bzip2-compressed          Input is bzip2 compressed

  --quick                 Quick operation (use first hit or hits)

  --min-hits NUM              In quick op., number of hits req'd for classification

                          NOTE: this is ignored if --quick is not specified

  --unclassified-out FILENAME

                          Print unclassified sequences to filename

  --classified-out FILENAME

                          Print classified sequences to filename

  --output FILENAME           Print output to filename (default: stdout); "-" will

                          suppress normal output


                          Print no Kraken output for unclassified sequences

  --preload               Loads DB into memory before classification

  --paired                The two filenames provided are paired-end reads

  --check-names               Ensure each pair of reads have names that agree

                          with each other; ignored if --paired is not specified

  --help                  Print this message

  --version               Print version information


If none of the *-input or *-compressed flags are specified, and the

file is a regular file, automatic format detection is attempted.



A platform to analyse and visualize different types of 'omics data:



[zihan@becker ~]$ export PATH=/home/opt/hmmer-3.1b2/src:$PATH

[zihan@becker ~]$ export PATH=/home/opt/Prodigal-2.6.3/:$PATH


[zihan@becker ~]$ export PYENV_ROOT="/home/opt/.pyenv"

[zihan@becker ~]$ export PATH="$PYENV_ROOT/bin:$PATH"

[zihan@becker ~]$ eval "$(pyenv init -)"

[zihan@becker ~]$ pyenv local 3.5.0

[zihan@becker ~]$ anvi-interactive -v

Anvi'o version ...............................: 2.2.2

Profile DB version ...........................: 20

Contigs DB version ...........................: 8

Pan DB version ...............................: 5

Samples information DB version ...............: 2

Genome data storage version ..................: 1

Auxiliary data storage version ...............: 3

Anvi'server users data storage version .......: 1



Tools for assessing the quality of genomes recovered from isolates, single cells, or metagenomes:



[zihan@becker ~]$ export PATH=/home/opt/qiime_software/pplacer-1.1-release:$PATH


[zihan@becker ~]$ export PYENV_ROOT="/home/opt/.pyenv"

[zihan@becker ~]$ export PATH="$PYENV_ROOT/bin:$PATH"

[zihan@becker ~]$ eval "$(pyenv init -)"

[zihan@becker ~]$ pyenv local 2.7.9


[zihan@becker ~]$ checkm


                ...::: CheckM v1.0.7 :::...


  Lineage-specific marker set:

        tree         -> Place bins in the reference genome tree

        tree_qa          -> Assess phylogenetic markers found in each bin

    lineage_set  -> Infer lineage-specific marker sets for each bin


  Taxonomic-specific marker set:

    taxon_list   -> List available taxonomic-specific marker sets

        taxon_set        -> Generate taxonomic-specific marker set


  Apply marker set to genome bins:

        analyze          -> Identify marker genes in bins

        qa               -> Assess bins for contamination and completeness


  Common workflows (combines above commands):

    lineage_wf   -> Runs tree, lineage_set, analyze, qa

    taxonomy_wf  -> Runs taxon_set, analyze, qa


  Bin QA plots:

    bin_qa_plot  -> Bar plot of bin completeness, contamination, and strain heterogeneity


  Reference distribution plots:

        gc_plot          -> Create GC histogram and delta-GC plot

    coding_plot  -> Create coding density (CD) histogram and delta-CD plot

    tetra_plot   -> Create tetranucleotide distance (TD) histogram and delta-TD plot

        dist_plot        -> Create image with GC, CD, and TD distribution plots together


  General plots:

        nx_plot          -> Create Nx-plots

        len_plot         -> Cumulative sequence length plot

        len_hist         -> Sequence length histogram

    marker_plot  -> Plot position of marker genes on sequences

        par_plot         -> Parallel coordinate plot of GC and coverage

        gc_bias_plot -> Plot bin coverage as a function of GC


  Sequence subspace plots:

        cov_pca          -> PCA plot of coverage profiles

        tetra_pca        -> PCA plot of tetranucleotide signatures


  Bin exploration and modification:

        unique           -> Ensure no sequences are assigned to multiple bins

        merge        -> Identify bins with complementary sets of marker genes

    bin_compare  -> Compare two sets of bins (e.g., from alternative binning methods)

        bin_union        -> [Experimental] Merge multiple binning efforts into a single bin set

        modify           -> [Experimental] Modify sequences in a bin

        outliers         -> [Experimental] Identify outlier in bins relative to reference distributions


  Utility functions:

        unbinned         -> Identify unbinned sequences

        coverage         -> Calculate coverage of sequences

        tetra        -> Calculate tetranucleotide signature of sequences

        profile          -> Calculate percentage of reads mapped to each bin

    join_tables  -> Join tab-separated value tables containing bin information

        ssu_finder   -> Identify SSU (16S/18S) rRNAs in sequences


  Use: 'checkm data' to find, download and install database updates


  Use: checkm <command> -h for command specific help



An automated phylogenomic inference application for large-scale protein phylogenetic analysis:



[zihan@becker ~]$ perl /home/opt/AMPHORA2/Scripts/ -h


This tool will search for bacterial and archaeal phylogenetic markers for a given fasta sequence file.

The output of this tool is a collection of marker protein sequences in fasta format. For example, rpoB.pep, dnaG.pep.


When DNA sequences are used, this program first identifies ORFs longer than 100 bp in all six reading frames, then scans the translated peptide sequences for the phylogenetic markers.


The default is assuming that the query sequences contain both bacterial and archaeal sequences. If you know your query sequences only contain bacterial sequences, then use the -Bacteria option. If you know your query sequences only contain archaeal sequences, then use the -Archaea option. This makes the program run faster and the results will be more accurate.


Usage: /home/opt/AMPHORA2/Scripts/ <options> sequence-file



-DNA: input sequences are DNA. Default: no.

-Evalue: HMMER evalue cutoff. Default: 1e-7

-Bacteria: input sequences are Bacterial sequences

-Archaea: input sequences are Archaeal sequences

-ReferenceDirectory: the file directory that contain the reference alignments, hmms and masks. Default: /home/opt/AMPHORA2/Marker

-Help: print help;

[zihan@becker ~]$ perl /home/opt/AMPHORA2/Scripts/ -h


Usage:  /home/opt/AMPHORA2/Scripts/ <options>


Assume the Phylogenetic Marker Database is located at /home/opt/AMPHORA2



-Trim: trim the alignment using masks embedded with the marker database. Default: no

-Cutoff: the Zorro masking confidence cutoff value (0 - 1.0; default: 0.4);

-ReferenceDirectory: the file directory that contain the reference alignments, hmms and masks. Default: /home/opt/AMPHORA2/Marker

-Directory: the file directory where sequences to be aligned are located. Default: current directory

-OutputFormat:  output alignment format. Default: phylip. Other supported formats include: fasta, stockholm, selex, clustal

-WithReference:  keep the reference sequences in the alignment. Default: no

-Help: print help


[zihan@becker ~]$ perl /home/opt/AMPHORA2/Scripts/ -h


This tool will assign each identified marker sequence a phylotype using the evolutionary placment algorithm of raxml


Usage: /home/opt/AMPHORA2/Scripts/ <options>



-Method: use 'maximum likelihood' (ml) or 'maximum parsimony' (mp) for phylotyping. Default: ml

-CPUs: turn on the multiple thread option and specify the number of CPUs/cores to use. If your computer has multiple CPUs/cores, you can speed up the phylotyping process by running this script on multiple cores. Important: Make sure raxmlHPC-PTHREADs is installed. If the number specified here is larger than the number of cores that are free and available, it will actually slow down the script.

-Help: print help;

13/06/2017: Tutorial: Making a custom taxonomy reference database with BOLDSYSTEMS to work with Qiime

Acknowledgement: John Carey (NUI Galway)

The BOLD platform contains a set of integrated databases that offer users access to primary BOLD data as well as supplementary information important for their research. It has sequences from the 5’ region of the mitochondrial gene COI, ITS  for fungal barcodes and rbcL and matK  for plant barcodes. To download the sequences and make a custom taxonomy file to work with Qiime, follow these steps:

Step 1: Go to BOLDSYSTEMS website and click on “Public Data Portal”


Step 2: Search for taxonomic level of interest, for example, here we would like to obtain all the sequences associated with the phylum Arthropoda


Step 3: Download the FASTA sequences (fasta.fas) and their annotation (bold_data.txt) by clicking on the links


Step 4: Format the data using the following one-liners so that it is acceptable by Qiime, i.e., we generate arthropoda.fasta and the corresponding taxonomy file

First get rid of “-” as some of these are aligned sequences, and just retain the IDs

bioawk -cfastx '{gsub("\\|.*","",$name);gsub("-","",$seq);print ">"$name"\n"$seq}' fasta.fas > arthropoda_tmp.fasta

Now pull out the taxonomy from the annotation file

awk -F"\t" 'NR>1{a=$1"\t"$9";"$11";"$13";"$15";"$17";"$19";"$21";";gsub("[; ]+$","",a);print a}' bold_data.txt >

We then filter out the records from taxonomy file that don’t exist in arthropoda.fasta, as well as remove any redundant sequences (i.e., where the FASTA file contains multiple sequences with the same ID, we only consider the first one)

bioawk -cfastx '{print $name}' arthropoda_tmp.fasta > IDS.txt

awk -F"\t" 'BEGIN{while((getline k < "IDS.txt")>0)i[k]=1}{if(i[$1]) print $0}' >

awk -F"\t" '{print $1}' > IDS.txt

bioawk -cfastx 'BEGIN{while((getline k < "IDS.txt")>0)i[k]=1}{if(i[$1] && (!j[$1]) ) {print ">"$name"\n"$seq; j[$1]=1;}}' arthropoda_tmp.fasta > arthropoda.fasta

And then finally delete the temporary files

rm IDS.txt; rm arthropoda_tmp.fasta; rm 

Step 5: Generate BLAST database

makeblastdb -in arthropoda.fasta -dbtype 'nucl'

Step 6: If you have an abundance table, otu_table.csv, and corresponding otu sequences, otus.fa,  then assign taxonomy in Qiime as -i otus.fa -b /path_to_ref_data/arthropoda.fasta -t /path_to_ref_data/ -m blast -o rep_set_taxonomy

Note: We have two amplicon workflows, OTUs based: 09,16/04/2015: Tutorial: Illumina Amplicons OTU Construction with Noise Removal, and OTUs free: 28/09/2016: Tutorial: Illumina Amplicons Processing using DADA2. Depending on which workflow you have used, change the data accordingly.

Step 7: We can’t use our abundance table with Qiime as is, so we convert the table in a format that understands. Here is my workaround:

awk -F"," 'NR==1{for(i=2;i<=NF;i++){conv[i]=$i}}NR>1{printf $1;for(i=2;i<=NF;i++){for(j=1;j<=$i;j++){printf "\t"conv[i];}}printf "\n";}' otu_table.csv | tr -d "\r" > map.txt

Step 8: Generate the biome file -i map.txt -t rep_set_taxonomy/otus_tax_assignments.txt  -o otu_table.biom

19/06/2017: Tutorial: RNAseq pipeline for generating and processing abundance tables using STAR/TOPHAT2/HISAT2 (human transcripts)

Acknowledgements: Elizabeth McDonald, Hussain Jaffery, Kim Bain

Here is the summary of the workflow:


We trim and filter the paired-end sequencing reads using Sickle by applying a sliding window approach and trimming regions where the average base quality drops below 20 and also discarded reads where length went below 10bp after trimming. We then used either of the STAR, TOPHAT2 or HISAT2, considered as reputable splice-aware alignment algorithms, by aligning reads against the Human Genome Release 25 (GRCh38.p7) comprising a total of 58037 genes (198093 transcripts) including non-coding RNA genes. The sorted alignments and the GTF file for Release 25 are then used with htseq-count from HTSeq with the settings -m union -i gene_id -a 10 --stranded=no (as per recommendations given in the manual) to generate sample-wise transcripts counts with ENSEMBL gene id as identifiers. We then use a custom script ( to collate the transcripts from all samples together to generate an N X P transcripts abundance table, which can then be subjected to downstream statistical analysis.


The statistical analysis is performed in R. The abundance table is first subjected to differential analysis using the DESeq2 package, which is based on a model using the negative binomial distribution to report transcripts that are log-fold different between treatment groups after using Wald statistics. To get the relative importance of these differentially-expressed transcripts, we train a Random Forest classifier using the randomForest package and then generate statistics such as Mean Decreasing Accuracy or Mean Gini Index to assign importance. Afterwards, we use the biomaRt package to map the differentially-expressed transcripts to entrez gene ids and use them in topGo package, a tool for enrichment analysis of GO terms, to obtain GO terms that were over-expressed as a result of enrichment of transcripts while accounting for the topology of the GO graph. Next, the transcripts IDs are mapped to KEGG K enzyme IDs and a list of reference pathways are obtained using the kegg.sets.hs database  from the gageData package. Finally, the  pathview package is used to render pathway graphs with the mapped enzymes showing log fold changes using the list of reference pathways generated earlier.

Code (you can run these instructions on Orion cluster):

My philosophy is to have a folder structure where the folder name are sample names, and within each folder I have a subfolder called "Raw" where I keep the raw paired-end FASTQ files (can be from multiple runs). I use bash scripting by embedding commands in for i in $(ls -d *); do cd $i; <COMMAND>; cd ..; done structure so that they get applied to each folder. Here is the RNAseq pipeline then:

Step 1: Unzipping all the files by going through each folder:

[uzi@becker ~/Tutorial]$ for i in $(ls -d *); do cd $i; gunzip *.gz; cd ..; done

Step 2: Organize FASTQ files in a Raw folder within each folder

[uzi@becker ~/Tutorial]$ for i in $(ls -d *); do cd $i; mkdir Raw; mv *.fastq Raw/. ;cd ..; done

Please note that Steps 1 and 2 are for the case when you already have respective folders for each sample containing all the forward and reverse sequences in the zipped format. However, if the forward and reverse sequences for all the samples are zipped individually and dumped in a single folder, then you can use this one-liner instead instead to generate the folders from the names of the files, and move the sequences to the Raw folder within each folder: for i in $(awk -F"_" '{print $1}' <(ls *.fastq) | sort | uniq); do mkdir $i; mkdir $i/Raw; mv $i*.fastq $i/Raw/.; done

Step 3: We need to combine all the lane information for forward and reverse reads:

[uzi@becker ~/Tutorial]$ for i in $(ls -d *); do cd $i; cat Raw/*_L00?_R1_*.fastq > ${i}_R1.fastq; cat Raw/*_L00?_R2_*.fastq > ${i}_R2.fastq; cd ..; done

Step 4: Check if we require trimming (if you come up with a distribution then it means that your data is already trimmed. Some sequencing centres trim the data for you).

[uzi@becker ~/Tutorial]$ for i in $(ls -d *); do echo $i":";bioawk -cfastx '{i[length($seq)]++}END{for(j in i)print j","i[j]}' $i/*_R1.fastq | sort -nrk2 -t",";done

Step 5: If your dataset requires trimming then use sickle and chop off the end bits when the average Phred Quality Score goes below 20.

[uzi@becker ~/Tutorial]$ for i in $(ls -d *); do cd $i; R1=$(ls *_R1.fastq); R2=$(ls *_R2.fastq); cd .. ; sickle pe -f $R1 -r $R2 -o ${R1%.*}_trim.fastq -p ${R2%.*}_trim.fastq -s ${R1%.*}_singlet.fastq -q 20 -l 10 -t "sanger";cd ..; done

Step 6: Now change python version for STAR/TopHat2/HISAT2/HTSeq to work.

[uzi@becker ~/Tutorial]$ bash

[uzi@becker ~/Tutorial]$ python -V

Python 2.6.6

[uzi@becker ~/Tutorial]$ export PYENV_ROOT="/home/opt/.pyenv"

[uzi@becker ~/Tutorial]$ export PATH="$PYENV_ROOT/bin:$PATH"

[uzi@becker ~/Tutorial]$ eval "$(pyenv init -)"

[uzi@becker ~/Tutorial]$ pyenv version 3.4.0

The reference FASTA (sequence) and GTF file (annotation) for the latest release are located here (that I used in HISAT2):



Note that for other software, I am using the previous version of the human database.

We can also use this draft pipeline to analyse transcripts from Mus musculus, Rattus norvegicus, Bos taurus, Canis familiaris, Gallus galls, Drosophilia melanogaster, Arabidopsis thaliana, Caenorhabditis elegant, and Saccharomyces cerevisiae. The reference FASTA and GTF files are available at

Now use any of the following three steps:

Step 7A: Run TOPHAT2 with 20 processors

[uzi@becker ~/Tutorial]$ for i in $(ls -d *); do cd $i;tophat2 -p 20 /home/opt/Tophat_Reference_DB/Homo_sapiens/Ensembl/GRCh37/Sequence/Bowtie2Index/genome *_R1.fastq *_R2.fastq;cd ..; done

Step 7B: Run TOPHAT2 with 20 processors using parameters recommendations given at

[uzi@becker ~/Tutorial]$ for i in $(ls -d *); do cd $i; tophat2 -p 20 --read-mismatches 2  --b2-very-fast --mate-inner-dist 200 --splice-mismatches 1 --microexon-search --no-novel-juncs /home/opt/Tophat_Reference_DB/Homo_sapiens/Ensembl/GRCh37/Sequence/Bowtie2Index/genome *_R1.fastq *_R2.fastq;cd ..; done

Step 7C: Run STAR with 20 processors

[uzi@becker ~/Tutorial]$ for i in $(ls -d *); do cd $i; STAR --genomeDir  /home/opt/Tophat_Reference_DB/Homo_sapiens/Ensembl/GRCh37/Sequence/STARIndex --readFilesIn *_R1.fastq *_R2.fastq --runThreadN 40 --outSAMtype BAM SortedByCoordinate;cd ..; done

Step 8: (Only required for TOPHAT2) We need to sort the hits file for HTSeq to work

[uzi@becker ~/Tutorial]$ for i in $(ls -d *); do cd $i;samtools view -H tophat_out/accepted_hits.bam > tophat_out/header.sam; samtools reheader tophat_out/header.sam tophat_out/accepted_hits.bam > tophat_out/output.bam;samtools sort tophat_out/output.bam tophat_out/sorted_output;cd ..; done

Step 9: Now generate the transcripts count

The settings I have used are as follows:

-m union: union is one of three options for htseq count for binning of reads to features (genes/exons).  it is the recommended option per htseq count’s manual (  It is the most stringent option as it will only count a read if it aligns to only 1 gene.

-i gene_id : tells htseq-count to use the ENSEMBL gene id in output file

-a 10 : tells htseq-count to skip reads with alignment score less than 10.

–stranded=no : the RNAseq reads are not strand specific

Here is the visual description of the -m union switch (from the above link):



[uzi@becker ~/Tutorial]$ for i in $(ls -d *); do cd $i;htseq-count -f bam -m union -i gene_id -a 10 --stranded=no tophat_out/sorted_output.bam /home/opt/Tophat_Reference_DB/Homo_sapiens/Ensembl/GRCh37/Annotation/Genes/genes.gtf > output_basename.counts;cd ..; done


[uzi@becker ~/Tutorial]$ for i in $(ls -d *); do cd $i;htseq-count -f bam -m union -i gene_id -a 10 --stranded=no Aligned.sortedByCoord.out.bam /home/opt/Tophat_Reference_DB/Homo_sapiens/Ensembl/GRCh37/Annotation/Genes/genes.gtf > output_basename.counts;cd ..; done

The count file output_basename_counts generated in each folder looks like this:

[uzi@becker ~/Tutorial/S1]$ head output_basename.counts

ENSG00000000003 0

ENSG00000000005 0

ENSG00000000419 121

ENSG00000000457 20

ENSG00000000460 41

ENSG00000000938 932

ENSG00000000971 38

ENSG00000001036 252

ENSG00000001084 187

ENSG00000001167 395

Step 7-9 all in one step for HISAT2 (with the new database): Run HISAT2 with 20 processors

[uzi@becker ~/Tutorial]$ for i in $(ls -d *); do cd $i;hisat2 /home/opt/Tophat_Reference_DB/GRCh38/Hisat2/GRCh38.p7 -p 20 -1 *_R1.fastq -2 *_R2.fastq > hisat2_alignments.sam;samtools view -b -S hisat2_alignments.sam | samtools sort - hisat2_alignments.sorted;htseq-count -f bam -m union -i gene_id -a 10 --stranded=no hisat2_alignments.sorted.bam /home/opt/Tophat_Reference_DB/GRCh38/gencode.v25.chr_patch_hapl_scaff.annotation.gtf > output_basename_HISAT2.counts;cd ..; done

Step 10: Collate all the abundance counts together. We can use my to collate all the counts together to generate a single abundance table. Check  for usage information.

[uzi@becker ~/Tutorial]$ (for i in $(ls -d *); do cd $i; if [ -f output_basename.counts ]; then awk -v k=$i '{if ($2>0){print k"\t"$1"\t"$2}}' output_basename.counts; fi; cd ..; done) | | tr "\t" "\n" > ~/abund_table.csv

Your abundance table (abund_table.csv) should look like this (if you open it up in Microsoft Excel):


Imagine we also have a meta data file (meta_table.csv) that contains categorical information about which treatment groups these samples belong to:


We next perform the statistical analysis in R by comparing these two treatment groups. Here is the code (and in between the output it generates):

#Load abundance table



#Load metadata table with treatment groups info


#Extract samples that are common and in same order in both tables

abund_table<-abund_table[rownames(abund_table) %in% rownames(meta_table),]



# We have used the following awk script to generate transcript lengths (biomart didn't return lengths for some of the transcripts and hence using this awk code)

# awk '$3=="exon"{gsub(";|\"","",$10);gsub("\\..*$","",$10);L[$10]+=($5-$4+1);M[$10]+=1} END {for (i in L){print i"\t"L[i]"\t"M[i]}}' gencode.v25.chr_patch_hapl_scaff.annotation.gtf > Length_Transcripts.tsv



# counts versus TPM/RPKM/FPKM







for(i in rownames(abund_table)){




  # Calculate RPKM using ddply

  rpkm.X<-ddply(X, .(gene), summarize, rpkm = (count*1e6)/((sum(X$count)*length)))




  # Calculate TPM

  (Tn.X <- sum(ddply(X, .(gene), summarize, Tn = count/length)[2]))

  tpm.X <- ddply(X, .(gene), summarize, tpm = (count*1e6)/(Tn.X*length))





  if(is.null(abund_table_RPKM)){abund_table_RPKM<-rpkm.X} else {abund_table_RPKM<-cbind(abund_table_RPKM,rpkm.X)}

  if(is.null(abund_table_TPM)){abund_table_TPM<-tpm.X} else {abund_table_TPM<-cbind(abund_table_TPM,tpm.X)}






#/counts versus TPM/RPKM/FPKM

#We create a dummy variable called Groups to which we assign the column

#containing treatment types of the samples


#Parameters that you can play with########

sig = 0.05 #Adjusted p-value

fold = 2   #Minimum log-fold differences



#We will convert our table to DESeqDataSet object

countData = round(as(abund_table, "matrix"), digits = 0)

# We will add 1 to the countData otherwise DESeq will fail with the error:

# estimating size factors

# Error in estimateSizeFactorsForMatrix(counts(object), locfunc = locfunc,  :

# every gene contains at least one zero, cannot compute log geometric means


dds <- DESeqDataSetFromMatrix(countData, meta_table, as.formula(~ Groups))


#Differential expression analysis based on the Negative Binomial (a.k.a. Gamma-Poisson) distribution

#Some reason this doesn't work: data_deseq_test = DESeq(dds, test="wald", fitType="parametric")

#If it doesn't work use betaPrior=F

data_deseq_test = DESeq(dds,betaPrior=F)

#data_deseq_test = DESeq(dds)

## Extract the results

res = results(data_deseq_test, cooksCutoff = FALSE)

res_tax = cbind(, as.matrix(countData[rownames(res), ]), TRANSCRIPT = rownames(res))

plot.point.size = 2


tax.display = NULL

tax.aggregate = "TRANSCRIPT"

res_tax_sig = subset(res_tax, padj < sig & fold < abs(log2FoldChange))

res_tax_sig <- res_tax_sig[order(res_tax_sig$padj),]


## Plot the data

### MA plot

res_tax$Significant <- ifelse(rownames(res_tax) %in% rownames(res_tax_sig) , "Yes", "No")

res_tax$Significant[$Significant)] <- "No"

p1 <- ggplot(data = res_tax, aes(x = baseMean, y = log2FoldChange, color = Significant)) +

  geom_point(size = plot.point.size) +

  scale_x_log10() +

  scale_color_manual(values=c("black", "red")) +

  labs(x = "Mean abundance", y = "Log2 fold change")+theme_bw()

if(label == T){

  if (!is.null(tax.display)){

        rlab <- data.frame(res_tax, Display = apply(res_tax[,c(tax.display, tax.aggregate)], 1, paste, collapse="; "))

  } else {

        rlab <- data.frame(res_tax, Display = res_tax[,tax.aggregate])


  p1 <- p1 + geom_text(data = subset(rlab, Significant == "Yes"), aes(label = Display), size = 4, vjust = 1)





res_tax_sig_abund = cbind([rownames(res_tax_sig), ]), TRANSCRIPT = rownames(res_tax_sig), padj = res_tax[rownames(res_tax_sig),"padj"])

#For Visualisation, Apply normalisation (use either of the log-relative/RPKM/TPM transformation)








ensembl = useMart("ENSEMBL_MART_ENSEMBL",dataset="hsapiens_gene_ensembl", host="")

mart<- useDataset("hsapiens_gene_ensembl", ensembl)

#use listAttributes(mart) or listFilters(mart)

mapping_table<-getBM(filters="ensembl_gene_id", attributes= c("ensembl_gene_id", "entrezgene","external_gene_name", "description"),values=colnames(abund_table),mart=mart)

mapping_table$description<-gsub(" \\[Source:.*","",mapping_table$description)

#Find the ENSEMBL IDs that match to kegg database and make a mapping table


kegg_mapping_table<-kegg_mapping_table[!grepl(" NA$",kegg_mapping_table)]

kegg_mapping_table<-data.frame(row.names=NULL,t(," "))))



#Now we plot transcripts significantly different between the categories


for(i in sig_transcripts){


  tmp<-data.frame(data[,i],meta_table$Groups,rep(paste(i," padj = ",sprintf("%.5g",res_tax[i,"padj"]),"\n",mapping_table[mapping_table$ensembl_gene_id==i,"external_gene_name"]," ",mapping_table[mapping_table$ensembl_gene_id==i,"description"],sep=""),dim(data)[1])) 

  if(is.null(df)){df<-tmp} else { df<-rbind(df,tmp)}




p<-p+geom_boxplot(outlier.size = 0)+geom_jitter(position = position_jitter(height = 0, width=0),alpha=0.5,outlier.colour = NULL )+theme_bw()+

  facet_wrap( ~ Transcripts , scales="free",nrow=1)

p<-p+theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))+theme(strip.text.x = element_text(size = 16, colour = "black", angle = 90))



I am just showing the left half of the PDF:


#Now we will use Breiman's random forest algorithm that can be used in unsupervised mode for assessing proximities

#among data points and will be used to rank the differentially abundant transcripts in terms of their importance.

#For this purpose we will use the normalised data (you can use the other possible normalisations as well).


#Extract only those transcripts that were found to be significant in the previous step<-data[,as.character(res_tax[rownames(res_tax_sig),"TRANSCRIPT"])]

#Generate a mapping from ENSEMBL IDs to their detailed description (will be required for visualisation)

IDs_map<-data.frame(row.names=colnames(,To=as.character(sapply(colnames(, function(x) paste(x,mapping_table[mapping_table$ensembl_gene_id==x,"external_gene_name"]," ",mapping_table[mapping_table$ensembl_gene_id==x,"description"],sep=""))))


val<-randomForest( meta_table$Groups ~ .,, importance=T, proximity=T,ntree=1500,keep.forest=F)

#We then extract the importance measures as produced by randomForest()

#From the manual of importance():

#"The first measure is computed from permuting OOB data:

#For each tree, the prediction error on the out-of-bag portion of the data is recorded

#(error rate for classification, MSE for regression). Then the same is done after

#permuting each predictor variable. The difference between the two are then averaged

#over all trees, and normalized by the standard deviation of the differences. If the

#standard deviation of the differences is equal to 0 for a variable, the division is

#not done (but the average is almost always equal to 0 in that case).

#The second measure is the total decrease in node impurities from splitting on the

#variable, averaged over all trees. For classification, the node impurity is measured

#by the Gini index. For regression, it is measured by residual sum of squares."


df_accuracy<-data.frame(row.names=NULL,Sample=rownames(imp),Value=abs(as.numeric(imp[,"MeanDecreaseAccuracy"])),Index=rep("Mean Decrease Accuracy",dim(imp)[1]))

df_gini<-data.frame(row.names=NULL,Sample=rownames(imp),Value=as.numeric(imp[,"MeanDecreaseGini"]),Index=rep("Mean Decrease Gini",dim(imp)[1]))

#Rearrange the features in terms of importance for ggplot2 by changing factor levels







#Plot Mean Decrease Accuracy statistic



r<-r+theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))+ylab("Mean Decrease Accuracy")




I am just showing the left half of the PDF:


#Plot "Mean Decrease Gini" Statistic



s<-s+theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))+ylab("Mean Decrease Gini")




I am just showing the left half of the PDF:






GOdata <- new("topGOdata", ontology = "BP", allGenes = setNames(res[,"padj"], rownames(res)), geneSel = function(p) p <= 0.05, description = "Test", annot =, mapping="", ID="Ensembl")

resultFisher <- runTest(GOdata, algorithm = "classic", statistic = "fisher")

GenTable(GOdata, classicFisher = resultFisher, topNodes = 20)

Here is the output from running the above two commands:

> GOdata <- new("topGOdata", ontology = "BP", allGenes = setNames(res[,"padj"], rownames(res)), geneSel = function(p) p <= 0.05, description = "Test", .... [TRUNCATED]

Building most specific GOs .....    ( 9600 GO terms found. )

Build GO DAG topology ..........    ( 12994 GO terms and 30228 relations. )

Annotating nodes ...............    ( 14383 genes annotated to the GO terms. )

> resultFisher <- runTest(GOdata, algorithm = "classic", statistic = "fisher")

                     -- Classic Algorithm --

             the algorithm is scoring 2653 nontrivial nodes


                     test statistic:  fisher

> GenTable(GOdata, classicFisher = resultFisher, topNodes = 20)

            GO.ID                                            Term Annotated Significant Expected

1  GO:0044707           single-multicellular organism process          5392              83           NA

2  GO:0048731                              system development          3552              62           NA

3  GO:0032501                multicellular organismal process          5564              83           NA

4  GO:0007275            multicellular organismal development          4046              67           NA

5  GO:0007423                       sensory organ development           425              17           NA

6  GO:0009887                             organ morphogenesis           754              23           NA

7  GO:0022409 positive regulation of cell-cell adhesio...            36               6           NA

8  GO:0098602                   single organism cell adhesion           352              15           NA

9  GO:0007155                                   cell adhesion           973              26           NA

10 GO:0048513                               organ development          2595              48           NA

11 GO:0022610                             biological adhesion           977              26           NA

12 GO:0009653              anatomical structure morphogenesis          2154              42           NA

13 GO:0098609                              cell-cell adhesion           158              10           NA

14 GO:0045165                            cell fate commitment           202              11           NA

15 GO:0042471                               ear morphogenesis           102               8           NA

16 GO:0043583                                 ear development           177              10           NA

17 GO:0016337            single organismal cell-cell adhesion           308              13           NA

18 GO:0097186                                    amelogenesis            15               4           NA

19 GO:0044767           single-organism developmental process          4615              68           NA

20 GO:0007399                      nervous system development          1771              35           NA


1            3.1e-08

2            1.1e-07

3            1.6e-07

4            1.9e-07

5            5.9e-07

6            6.6e-07

7            1.0e-06

8            1.3e-06

9            1.4e-06

10           1.4e-06

11           1.5e-06

12           2.3e-06

13           2.7e-06

14           3.6e-06

15           5.7e-06

16           7.4e-06

17           7.7e-06

18           9.9e-06

19           1.4e-05

20           1.6e-05




log2FoldChangeTable<-res_tax_sig[rownames(res_tax_sig) %in% kegg_mapping_table$ensembl_gene_id,c("log2FoldChange"),drop=F]

log2FoldChangeTable<-log2FoldChangeTable[rownames(log2FoldChangeTable) %in% kegg_mapping_table$ensembl_gene_id,,drop=F]




significant_pathways<-data.frame(row.names=gsub(" .*","",names(kegg.sets.hs)), significant=as.character(sapply(names(kegg.sets.hs),function(x){sum(kegg.sets.hs[[x]] %in% rownames(log2FoldChangeTable))/length(kegg.sets.hs[[x]])})))


#Extract significant pathways that have the most changed enzymes from the analysis done




for(i in pathways_to_draw){


  if (file.exists(directory_to_create)){

  } else {

        dir.create(file.path(".", directory_to_create)) 




  tryCatch(pv.out <- pathview( = log2FoldChangeTable,

                                  species = "hsa",

                                  out.suffix = paste(i,"_",paste(levels(meta_table$Groups),collapse="_vs_"),sep=""),



                                  limit = list(gene = max(abs(min(log2FoldChangeTable)),max(log2FoldChangeTable)), cpd = 1),

                                  kegg.native = T,

                                  low=list(gene = "red", cpd = "red"),

                                  mid=list(gene = "white", cpd = "white"),

                                  high=list(gene = "blue", cpd = "blue"),




                                  res=600,new.signature=F),error=function(e) NULL)





The above commands should generate a folder where you can see the pathways (as separate images) with enzymes that are log2 fold different, some of these images are as follows: