RNAseq graphical output with cummeRbund Introductory tutorial Mark Crowe | ||
Contents Section 1: Generate graphs with cummeRbund [30 mins] Section 2: Repeat with a larger dataset [20 min] |
Tutorial OverviewIn this tutorial we introduce how to produce graphical output from RNA-Seq analysis using the R package cummeRbund. Visualising RNA-Seq data in this way may help in the interpretation of features of the analysis, and the images can also be used to supplement publications or posters describing the work. What’s not covered
|
Background [15 min]Read the background to the workshop here Where is the data in this tutorial from? The data used for this tutorial is the output from the GVL RNAseq Differential Gene Expression Basic tutorial - a D. melanogaster (fruit fly) differential gene expression analysis experiment of three replicates each of two experimental conditions. This data is described in more detail here. A second, larger, set of data is provided for further practice in the techniques covered. This comes from the paper: “Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks (Trapnell et al 2012), which was written by the developers of both the Cuffdiff pipeline and the cummeRbund package. The protocol: The protocols are described in the Trapnell paper mentioned above, the inspiration for this tutorial. This paper also describes the command line protocol for Tophat and Cuffdiff, which was used to generate the practice data set used in section 2. |
Preparation [30 min] |
N.B. VNC is not supported by all universities, and you may receive an “error 1006”. If this occurs, you can access an equivalent interface through the NeCTAR dashboard:
|
|
source('http://bioconductor.org/biocLite.R') biocLite('cummeRbund') library(cummeRbund)
|
|
|
Section 1: Generate graphs with cummeRbund [30 mins]For this stage, we will create an R object containing all the data from the Cuffdiff analysis files. We will then use some of the cummeRbund tools to generate graphs which display various aspects of this data. |
If you have closed RStudio since Preparation step 2, reopen it and type the command library(cummeRbund)in the main window. The two commands which preceded it in the initial setup should not need to be repeated again.
cuff_data <- readCufflinks (dir='diff_out', geneFPKM = 'gene_FPKM_tracking', geneRep = 'genes_read_group_tracking', geneDiff = 'gene_differential_expression_testing', isoformFPKM = 'transcript_FPKM_tracking', isoformDiff = 'transcript_differential_expression_testing', isoformRep = 'isoforms_read_group_tracking', TSSFPKM = 'TSS_groups_FPKM_tracking', TSSDiff = 'TSS_groups_differential_expression_testing', TSSRep = 'tss_groups_read_group_tracking', CDSFPKM = 'CDS_FPKM_tracking', CDSExpDiff = 'CDS_FPKM_differential_expression_testing', CDSDiff = 'CDS_overloading_diffential_expression_testing', CDSRep = 'CDs_read_group_tracking', promoterFile = 'promoters_differential_expression_testing', splicingFile = 'splicing_differential_expression_testing') The <- notation is used in R to enter data (in this case, the output of the ‘readCufflinks’ function) into a variable. So the command above runs readCufflinks on the data from our fifteen input files, which are in the diff_out directory as specified in the first line, and enters the output of that process into the cuff_data variable. Note: this command creates a database file called “cuffData.db” in your diff_out directory. If, for any reason, you need to re-run this command, first delete cuffData.db, otherwise cuff_data will not be correctly re-created. |
A density plot provides an indication of the distribution of abundance of transcripts in your samples. While there is no defined standard shape that should be expected from a density plot, if one condition in your experiment looks noticeably different from the others, that may indicate that there is a problem with the data, or at the very least that extra caution is required in interpreting results associated with that condition. More information on density plots.
csDensity(genes(cuff_data))
csDensity(isoforms(cuff_data))
csDensity(TSS(cuff_data)) Screenshot: gene, transcript and transcription start site density plots |
A scatter plot is a gene-by-gene plot of the apparent expression level under condition one against the level under condition two for all genes in the experiment. Typically, most genes will have similar expression levels and will lie on a single line. Genes with are over- or under-expressed in one sample will lie above or below that line.
C1 and C2 are the names of the experimental treatments defined during the Cuffdiff analysis. If you have carried out the analysis on your own data and have different condition names, you will need to change them accordingly. A single scatter plot can only compare two experimental conditions. CummeRbund provides an additional command csScatterMatrix, which will automatically draw a scatter plot for all possible comparisons in experiments with more than two conditions, along with density plots for each condition. The format of the csScatterMatrix command is simply csScatterMatrix(genes(cuff_data)) Screenshot: scatter plot of C1 against C2 |
A volcano plot shows the relationship between the fold-change of genes and the statistical significance of that change. Intuitively, it is evident that a larger fold change (either up- or down-regulation) is likely to be associated with a higher significance, resulting in the characteristic ‘volcano’ shape of these graphs.
Screenshot: volcano plot with default and manually-selected horizontal scaling |
The graphs in the previous three steps have all been summary graphs of all of the analysis data. CummeRbund also allows us to look at individual genes with the bar plot function
Screenshot: bar plots of genes and isoforms for CG2177 |
Section 2: Repeat with a larger dataset [20 min]For further experience with cummeRbund, you may like to carry out a similar analysis to that described above with a second, larger, Cuffdiff output dataset. This particular dataset was generated with the command line version of Cuffdiff and not through Galaxy. This greatly simplifies step 1, the upload of data into cummeRbund, but all subsequent stages are similar. The main difference will be that, because of the much larger size of the dataset (more than 14,000 genes, compared to 88 for the set used above), each step will take noticeably longer to complete. |
|
|
ReferencesTrapnell C, Roberts A, Pachter L, et al. Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks. Nature Protocols [serial online]. March 1, 2012;7(3):562-578. The cummeRbund Manual - http://compbio.mit.edu/cummeRbund/manual_2_0.html |