RNAseq graphical output with cummeRbund

Introductory tutorial

Mark Crowe

http://genome.edu.au

http://qfab.org

Contents

Tutorial Overview

Background [15 min]

Preparation [30 min]

Section 1: Generate graphs with cummeRbund [30 mins]

Section 2: Repeat with a larger dataset [20 min]

References

Tutorial Overview

In this tutorial we introduce how to produce graphical output from RNA-Seq analysis using the R package cummeRbund. Visualising RNA-Seq data in this way may help in the interpretation of features of the analysis, and the images can also be used to supplement publications or posters describing the work.

What’s not covered

  • The tutorial is designed to introduce some basic commands using the cummeRbund package of R programming language, but we will not cover the use of R for other analysis, or indeed the full range of cummeRbund functions.
  • The input data required are the analysis files from a Cuffdiff analysis and it is unlikely that other RNA-Seq analysis software will be compatible with cummeRbund. We will not discuss options for the visualisation of RNA-Seq data analysed with other software.
  • This tutorial assumes that you have previously completed the GVL “RNAseq Differential Gene Expression Introductory tutorial”, or at the very least are comfortable using the Galaxy interface.

Background [15 min]

Read the background to the workshop here

Where is the data in this tutorial from?

The data used for this tutorial is the output from the GVL RNAseq Differential Gene Expression Basic tutorial - a D. melanogaster (fruit fly) differential gene expression analysis experiment of three replicates each of two experimental conditions. This data is described in more detail here. A second, larger, set of data is provided for further practice in the techniques covered. This comes from the paper: “Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks (Trapnell et al 2012), which was written by the developers of both the Cuffdiff pipeline and the cummeRbund package.

The protocol:

The protocols are described in the Trapnell paper mentioned above, the inspiration for this tutorial. This paper also describes the command line protocol for Tophat and Cuffdiff, which was used to generate the practice data set used in section 2.

Preparation [30 min]

  1. Set up R and RStudio
  1. Option 1 - run R and RStudio on your PC
  1. Open a browser and go to “The Comprehensive R Archive Network”: http://cran.r-project.org (Australian mirror at http://cran.csiro.au)
  1. Follow the relevant download link for your operating system (Linux, Windows, or Mac)
  2. Download and run the most recent version of the installer, accepting default settings. If you do not have administrative rights to your computer, you may need assistance from your local IT section. In this case, you will need to ask them to ensure that you have full permissions to the R directories created, otherwise you may not be able to install or run cummeRbund.
  1. Return to the browser and go to the RStudio download page at http://www.rstudio.com/ide/download/desktop
  1. Download and run the appropriate RStudio Desktop installer version. Do not run the installer until the R installation has completed successfully, otherwise RStudio may not be able to find the R executables.
  1. Option 2 - run R and RStudio using a personal GVL instance
  1. Launch a GVL instance using your personal NeCTAR research cloud allocation (N.B. available only to Australian researchers at institutions subscribing to AAF).
  1. Follow the Get GVL guide or the Get GVL video guide to start up a GVL instance
  1. Once your GVL instance is running, start R-Studio and other GVL apps following the Get the GVL workbench guide

N.B. VNC is not supported by all universities, and you may receive an “error 1006”. If this occurs, you can access an equivalent interface through the NeCTAR dashboard:

  1. Open the NeCTAR dashboard interface and click on the “Instances” tab
  2. Click on the Instance Name of your GVL image
  3. Click on the “Console” tab
  4. Continue with the Get the GVL workbench process from step 2

  1. Start RStudio
  1. Option 1 - RStudio on your PC
  1. Start RStudio by double-clicking on the program icon.
  1. Option 2 - RStudio using a personal GVL instance
  1. Open the GVL RStudio interface by opening a browser window to http://<ip address>/rstudio, where <ip address> is the address of your GVL instance. Log in using the username “researcher” and the password you chose when starting RStudio in step 1b.

  1. Install the cummeRbund package
  1. Run RStudio through your standard user account (i.e. log out as an administrator if you have used that to install R/RStudio). It should open up a window with three panes. In the main (left) pane, type the following three commands while connected to a network - make sure capitalisation and quotes are exactly as given here:

source('http://bioconductor.org/biocLite.R')

biocLite('cummeRbund')

library(cummeRbund)

  1. If you are asked to update packages, select "a" for all. If you have any errors about required permissions, you will need to get help from your IT support group. Unfortunately it is not always straightforward to identify errors from warnings (non-critical) and standard notifications in RStudio, since they all appear in red, so you may need to read through the output carefully. If in doubt, contact your local IT support.

  1. Download the Cuffdiff output files
  1. This is a rather slow and tedious process. To simplify this tutorial, a pre-prepared zip archive of already renamed Cuffdiff output files can be downloaded here. Extract this archive to your desktop.
  2. Otherwise, to generate cummeRbund graphs from your own dataset, you will need to download the files as follows:
  1. For each of the fifteen Cuffdiff output files from the ‘with replicates’ analysis, click on the entry name in the history to expand it and then on the disk icon to download it to your computer.
  2. Rename the Cuffdiff output files. Downloads from Galaxy have a name of the format:
    “Galaxy32-[Cuffdiff_on_data_27,_data_7,_and_others__splicing_differential_expression_testing].tabular”. While not essential, it simplifies later stages to trim this file name to something more manageable. For this tutorial, we will remove the initial text up to and including the double underscore (i.e.
    “Galaxy32-[Cuffdiff_on_data_27,_data_7,_and_others__”) and the final text from the closing square bracket (“].tabular”). The example filename mentioned above will then become “splicing_differential_expression_testing”. Repeat the comparable renaming process for all fifteen downloaded files
  3. Move all downloaded and renamed files to a single new folder named “diff_out” on your desktop. You may use a different folder name or location if you prefer, but you will need to modify subsequent instructions to use your alternatives.

  1. If you are using RStudio on GVL (“Preparation”, Option 2), you will need to upload the data to the GVL instance.
  1. Compress the diff_out folder to a zip archive (or use the pre-prepared zip archive directly)
  1. Right click on the folder and select ‘compress’ (Mac) or ‘send to->compressed (zipped) folder’ (Windows)
  1. In the RStudio window, click on the “Files” tab in the lower right window, the click “Upload”. Select your compress diff_out file. It will be uploaded and automatically expanded.

Section 1: Generate graphs with cummeRbund [30 mins]

For this stage, we will create an R object containing all the data from the Cuffdiff analysis files. We will then use some of the cummeRbund tools to generate graphs which display various aspects of this data.

  1. Upload analysis data into cummeRbund

If you have closed RStudio since Preparation step 2, reopen it and type the command library(cummeRbund)in the main window. The two commands which preceded it in the initial setup should not need to be repeated again.

  1. Direct RStudio to the location of the data for analysis
  1. In the bottom right window, select the “Files” tab.
  2. Navigate to the folder containing the diff_out folder, not into the diff_out folder itself. For example, if you created the diff_out folder on your desktop (as described in 4b(iii), then choose the Desktop folder from the file listing. The small panel at the top of the file window (with a house icon in it) should say something like “Home > Desktop”. If it says “Home > Desktop > diff_out” you are in the diff_out, and need to move one level up.
  3. Click on the cog icon, labelled “More” and select “Set As Working Directory”.

  1. Read in Cuffdiff data. Because the Cuffdiff interface in Galaxy generates non-standard output files, we need to tell cummeRbund exactly what data each file contains. This means a long command line, which must be typed in exactly. As we will see later, Cuffdiff from the command line integrates much more easily with cummeRbund.
  1. Enter the following command - it can be entered all at once, or one line at a time hitting the <return> key after each comma. R will recognise that the command is not complete until the last line because the opening bracket (parenthesis) in the first line is not closed until the end.
  • N.B. copy/paste of the whole command directly from your web browser is likely to fail; try going via a text editor like Notepad or TextEdit; or you may be able to right-click “paste as plain text”.

cuff_data <- readCufflinks (dir='diff_out',

geneFPKM = 'gene_FPKM_tracking',

geneRep = 'genes_read_group_tracking',

geneDiff = 'gene_differential_expression_testing',

isoformFPKM = 'transcript_FPKM_tracking',

isoformDiff = 'transcript_differential_expression_testing',

isoformRep = 'isoforms_read_group_tracking',

TSSFPKM = 'TSS_groups_FPKM_tracking',

TSSDiff = 'TSS_groups_differential_expression_testing',

TSSRep = 'tss_groups_read_group_tracking',

CDSFPKM = 'CDS_FPKM_tracking',

CDSExpDiff = 'CDS_FPKM_differential_expression_testing',

CDSDiff = 'CDS_overloading_diffential_expression_testing',

CDSRep = 'CDs_read_group_tracking',

promoterFile = 'promoters_differential_expression_testing',

splicingFile = 'splicing_differential_expression_testing')

The <- notation is used in R to enter data (in this case, the output of the ‘readCufflinks’ function) into a variable. So the command above runs readCufflinks on the data from our fifteen input files, which are in the diff_out directory as specified in the first line, and enters the output of that process into the cuff_data variable.

Note: this command creates a database file called “cuffData.db” in your diff_out directory.  If, for any reason, you need to re-run this command, first delete cuffData.db, otherwise cuff_data will not be correctly re-created.

  1. Draw a density plot

A density plot provides an indication of the distribution of abundance of transcripts in your samples. While there is no defined standard shape that should be expected from a density plot, if one condition in your experiment looks noticeably different from the others, that may indicate that there is a problem with the data, or at the very least that extra caution is required in interpreting results associated with that condition. More information on density plots.

  1. To generate a gene-level density plot, use the command:

csDensity(genes(cuff_data))

  1. For a transcript-level plot, instead use:

csDensity(isoforms(cuff_data))

  1. A density plot can also be run using transcription start site data, if that was provided in the original gene annotation file used by Cuffdiff.

csDensity(TSS(cuff_data))

Screenshot: gene, transcript and transcription start site density plots

  1. Draw a scatter plot

A scatter plot is a gene-by-gene plot of the apparent expression level under condition one against the level under condition two for all genes in the experiment. Typically, most genes will have similar expression levels and will lie on a single line. Genes with are over- or under-expressed in one sample will lie above or below that line.

  1. Enter the command csScatter(genes(cuff_data), 'C1', 'C2')
  2. Once again, you can repeat with ‘isoforms’ or ‘TSS’ replacing ‘genes’

C1 and C2 are the names of the experimental treatments defined during the Cuffdiff analysis. If you have carried out the analysis on your own data and have different condition names, you will need to change them accordingly.

A single scatter plot can only compare two experimental conditions. CummeRbund provides an additional command csScatterMatrix, which will automatically draw a scatter plot for all possible comparisons in experiments with more than two conditions, along with density plots for each condition. The format of the csScatterMatrix command is simply csScatterMatrix(genes(cuff_data))

Screenshot: scatter plot of C1 against C2

  1. Draw a volcano plot

A volcano plot shows the relationship between the fold-change of genes and the statistical significance of that change. Intuitively, it is evident that a larger fold change (either up- or down-regulation) is likely to be associated with a higher significance, resulting in the characteristic ‘volcano’ shape of these graphs.

  1. Enter the command csVolcano(genes(cuff_data), 'C1', 'C2')
  1. The resulting graph is not obviously a volcano shape. This is because the horizontal axis is inappropriately scaled - all the data is shown closely clustered around zero. To address this, we need to manually specify a suitable scale. Unfortunately, by setting this, the default option of highlighting statistically significant genes is disabled, so we also need to re-enable that setting.

  1. Enter the command csVolcano(genes(cuff_data),'C1','C2', xlimits = c(-1.5,1.5), alpha=0.05, showSignificant=TRUE)
  1. xlimits sets the minimum and maximum values for the X-axis. The values do not need to be equal either side of zero, but we will make them so since we expect our data to be approximately symmetrical about the Y-axis
  2. showSignificant=TRUE tells cummeRbund to highlight significant genes. We also need to specify what significance threshold we are using with alpha=0.05 (the conventional threshold value of 0.05).
  3. The graph now appears more volcano shaped, with the two significantly differentially expressed genes marked in red in the top right corner.

Screenshot: volcano plot with default and manually-selected horizontal scaling

  1. Draw a bar plot

The graphs in the previous three steps have all been summary graphs of all of the analysis data. CummeRbund also allows us to look at individual genes with the bar plot function

  1. Enter the commands as follows
  1. mygene<-getGene(cuff_data,'CG2177')
  2. expressionBarplot(mygene)
  3. The first command gets the data for the single gene CG2177 (one of those identified as differentially expressed by Cuffdiff) and records it in the variable ‘mygene’. The second command generates a bar plot with error bars, illustrating the difference in expression of CG2177 between the two conditions.
  4. As with the summary plots, bar plots can also be prepared for isoforms or transcription start sites.
  1. expressionBarplot(isoforms(mygene))
  2. expressionBarplot(TSS(mygene))

Screenshot: bar plots of genes and isoforms for CG2177

Section 2: Repeat with a larger dataset [20 min]

For further experience with cummeRbund, you may like to carry out a similar analysis to that described above with a second, larger, Cuffdiff output dataset. This particular dataset was generated with the command line version of Cuffdiff and not through Galaxy. This greatly simplifies step 1, the upload of data into cummeRbund, but all subsequent stages are similar. The main difference will be that, because of the much larger size of the dataset (more than 14,000 genes, compared to 88 for the set used above), each step will take noticeably longer to complete.

  1. Download data and load it into cummeRbund
  1. Download the compressed folder archive from here, and extract it to a suitable location - for example the same folder in which you saved diff_out in section 1. The folder should extract with the name “diff_out_section2”, to distinguish it from the one used earlier.

  1. Repeat Section 1 step 1a to navigate to the directory containing diff_out_section2.

  1. Read in Cuffdiff data using the command cuff_data_2<-readCufflinks('diff_out_section2')
  1. This much shorter command is possible because the file structure in diff_out_section2 is the default recognised by readCufflinks, so it is not necessary to define the data contained in each file.

  1. Repeat plots used in Section 1
  1. Remember to replace ‘cuff_data’ (which still contains the data from diff_out) with ‘cuff_data_2’, the name of the new variable we have created.
  2. You may want to experiment with different xlimits values for the volcano plot
  3. CG2177 is not differentially expressed in this dataset. You may instead want to look at ‘regucalcin’ with the bar plot function.

References

Trapnell C, Roberts A, Pachter L, et al. Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks. Nature Protocols [serial online]. March 1, 2012;7(3):562-578.

The cummeRbund Manual - http://compbio.mit.edu/cummeRbund/manual_2_0.html