RNAseq Differential Gene Expression Introductory tutorial Mahtab Mirmomeni & Andrew Lonie | ||
Contents Section 1: Align reads with Tophat [30 mins] Section 2: Test differential expression with Cufflinks [45 min] Section 3. Repeat without replicates [20 min] Section 4. Extensions [10 min] |
Tutorial OverviewIn this tutorial we cover the concepts of RNAseq differential gene expression (DGE) analysis using a very small synthetic dataset from a well studied organism. What’s not covered The tutorial is designed to introduce the tools, datatypes and workflow of DGE analysis. In practice the datasets would be much larger and would contain sequencing and alignment errors that may confound analysis. There are also a number of different tools and/or statistical approaches that are widely used in RNAseq DGE that we do not cover. Specifically, we have left out the following steps that you would almost certainly do in a real RNAseq DGE analysis:
These steps have been omitted because the data we use in this tutorial is synthetic and has no quality issues, unlike real data |
Background [15 min]Where is the data in this tutorial from? The data for this tutorial is from an RNA-seq experiment looking for differentially expressed genes in D. melanogaster (fruit fly) between two experimental conditions. The experiment and analysis protocol we will follow are derived from a paper in Nature Protocols by the research group responsible for one of the most widely used set of RNA-seq analysis tools: “Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks” (Trapnell et al 2012). The sequence datasets are single-end Illumina synthetic short reads, filtered to only include chromosome 4 gene data to facilitate faster mapping (which would otherwise take ~6 hours). We’ll use data from 3 biological replicates from each of the two experimental conditions. The protocol: Two protocols are described in the paper inspiring this tutorial: the full ‘Tuxedo protocol’, in which a transcriptome is assembled from all the experimental sequence data and gene expression is measured against that, and a shorter ‘Alternate protocol’ in which expression is measured against an existing annotated transcriptome. In this tutorial we use the simpler protocol as the D. melanogaster transcriptome is very well characterised. Follow this link for an overview/background on this protocol The protocol in a nutshell:
|
Preparation [15 min] |
|
Alternatively:
https://swift.rc.nectar.org.au:8888/v1/AUTH_a3929895f9e94089ad042c9900e1ee82/RNAseqDGE_BASIC/C1_R1.chr4.fq https://swift.rc.nectar.org.au:8888/v1/AUTH_a3929895f9e94089ad042c9900e1ee82/RNAseqDGE_BASIC/C1_R2.chr4.fq https://swift.rc.nectar.org.au:8888/v1/AUTH_a3929895f9e94089ad042c9900e1ee82/RNAseqDGE_BASIC/C1_R3.chr4.fq
https://swift.rc.nectar.org.au:8888/v1/AUTH_a3929895f9e94089ad042c9900e1ee82/RNAseqDGE_BASIC/C2_R1.chr4.fq https://swift.rc.nectar.org.au:8888/v1/AUTH_a3929895f9e94089ad042c9900e1ee82/RNAseqDGE_BASIC/C2_R2.chr4.fq https://swift.rc.nectar.org.au:8888/v1/AUTH_a3929895f9e94089ad042c9900e1ee82/RNAseqDGE_BASIC/C2_R3.chr4.fq
https://swift.rc.nectar.org.au:8888/v1/AUTH_a3929895f9e94089ad042c9900e1ee82/RNAseqDGE_BASIC/ensembl_dm3.chr4.gtf
Or:
|
|
NOTE: since the reads in this dataset are synthetic, they do not have real quality scores. NOTE: If you log out of Galaxy and log back at a later time your data and results from previous experiments will be available in the right panel of your screen called the ‘History’ |
Completed Galaxy history for this section (in SharedData>Published Histories): RNAseq_DGE_BASIC_Prep OR Import this history: https://swift.rc.nectar.org.au:8888/v1/AUTH_a3929895f9e94089ad042c9900e1ee82/RNAseqDGE_BASIC/Galaxy-History-RNAseqDGE_BASIC_Prep.tar.gz |
Section 1: Align reads with Tophat [30 mins]The basic process here is to map each of the individual reads in the sample FASTQ readsets to a reference genome, so that we can count which reads came from which gene transcript. These read counts are the basis for differential gene expression. As the reads are generated from mRNA rather than DNA, some of them will cross exon/intron boundaries in the genomic reference. The Tophat tool maps the read to the reference genome and uses the mapping information to identify exon/intron boundaries (splice junctions). More detailed description of Tophat-based read alignment/mapping |
From the tools menu in the left hand panel of Galaxy, select NGS: RNA Analysis>Tophat For Illumina and run with these parameters:
Screenshots of this process can be seen here. Note: This may take a few minutes, depending on how busy Galaxy is. |
You should have four output files for the alignment step:
|
Repeat step 1 and 2 for the each of the sequence files, generating four files for each for a total of 24 output files form the alignment step. |
The purpose of this step is to become aware of the quantitative exon-based peak nature of RNA-seq data and the differences between samples
Screenshot of the accepted_hits file loaded into IGV
Screenshot: splice junctions file loaded into IGV with the accepted_hits file already loaded. |
The aim of this tutorial is to statistically test differential expression, but first it’s useful to reassure ourselves that the data looks right at this stage by comparing the aligned reads for biological condition 1 (C1) and condition 2 (C2) in the context of the reference genome. We’ll try an inbuilt Galaxy genome browser for this activity, called Trackster. It’s not as powerful as IGV but is sometimes more convenient.
ALTERNATIVELY: Use IGV
The middle gene in this area clearly looks like it has many more reads mapped in condition 2 than condition 1, whereas for the surrounding genes the reads look about the same. The middle gene looks like it is differentially expressed. But, of course, it may be that there are many more reads in the readsets for C1 and C2, and the other genes are underexpressed in condition 2. So we need to statistically normalise the read counts before we can say anything definitive, which we will do in the next step. Note: You can cross compare other replicates of the conditions as well by adding them to the visualisation |
Completed Galaxy history for this section (in SharedData>Published Histories): RNAseqDGE_BASIC_Sec1 OR Import this history: https://swift.rc.nectar.org.au:8888/v1/AUTH_a3929895f9e94089ad042c9900e1ee82/RNAseqDGE_BASIC/Galaxy-History-RNAseqDGE_BASIC_Sec1.tar.gz |
Section 2: Test differential expression with Cuffdiff [45 min]The aim here is to:
All these steps are rolled up into a single tool: Cufflinks, which takes the aligned reads from Tophat and generates normalised read counts + a list of differentially expressed genes based on a reference transcriptome - in this case, the curated Ensembl list of D. melanogaster genes from chromosome 4 that we supply as a GTF (Gene Transfer Format) file More detailed explanation of Cufflinks DGE testing Note: Before starting the next section, leave the Trackster interface and return to the analysis view of Galaxy by clicking ‘Analyse Data’ at the top of the window. |
The ensembl_dm3.chr4.gtf file contains ~4900 features which together define the 92 known genes on chromosome 4 of Drosophila melanogaster Cufflinks uses the reference transcriptome to aggregate read counts per gene, transcript, transcription start site and CoDing Sequence (CDS) For this tutorial, we’ll only consider differential gene testing, but it is also possible to test for differential expression of transcripts or transcription start sites |
Galaxy tool panel: NGS: RNA Analysis>Cuffdiff
Screenshot: loading files in Cuffdiff
Output files generated are in the following categories
|
|
Note: You can rename this file (screenshot) by clicking on the pencil icon in front of the file name with ‘Edit Attribute’ caption to Significant_DiffExpGenes (or something similar) to make the filename more meaningful. |
What do the genes do? As this is a artificial data set, it doesn’t really matter; the point is to understand the process of differential gene expression testing using RNAseq data. |
Completed Galaxy history for this section (in SharedData>Published Histories): RNAseqDGE_BASIC_Sec2 OR Import this history: https://swift.rc.nectar.org.au:8888/v1/AUTH_a3929895f9e94089ad042c9900e1ee82/RNAseqDGE_BASIC/Galaxy-History-RNAseqDGE_BASIC_Sec2.tar.gz |
Section 3. Repeat without replicates [20 min]What happens if we only use one sample from each condition for differential gene expression testing? |
Galaxy tool panel: NGS->RNA Analysis ->Cuffdiff
Then filter the generated geneset for significantly differentially expressed genes (Filter with c14=’yes’ as for previous section) |
You should get _no_ differentially expressed genes at statistical significance of 0.05 The ‘Ank’ gene, which was found to be significantly expressed in our first analysis, is not identified in this analysis as differentially expressed. The ‘CG1277’ gene has a q_value of 0.091, just above the cutoff of 0.05 for statistical significance, so also is a ‘false negative’ in the sense that it doesn’t reach our definition of significance but we know it is differentially expressed if we have more data as in the previous section. If we ranked the genes by q_value, CG1277 would clearly come out as the _most_ significantly differentially expressed. There are no false positives; but this can depend on the version of Cuffdiff you are using, or in fact any differential expression analysis tool. For example, the previous version of Cufflinks identified 9 genes that were differentially expressed that were not identified when replicates are used i.e. false positives. As the algorithms for testing differential expression improve, it is likely that genes may slip into/out of significance - but you will always have more confidence if you have more replicates showing a difference in a genes expression |
Galaxy tool panel: NGS->RNA Analysis ->Cuffdiff
You now see ‘CG2177’ appear again in the list as significantly differentially expressed, but not Ank. How do we interpret this? There is more difference in CG1277 expression (after normalisation) between samples 1 and 5 than samples 1 and 4, even though samples 4 and 5 are replicates; enough of a difference, in fact, to get CG1277 over the ‘statistical significance’ line of 0.05 in the case of (sample 1 vs 5) but not in (1 vs 4). The size of the (normalised) difference in expression of Ank is much smaller, so we need to see it consistently across multiple replicates to be confident it actually exists. One replicate is not enough. The identification of significantly different genes is based on (at least) the size of the difference in expression and the observation of a difference across multiple replicates. This demonstrates how important it is to have biological replicates in differential gene expression experiments. |
What do you get? How many replicates do we need to identify Ank as differentially expressed? |
Completed Galaxy history for this section (in SharedData>Published Histories): RNAseqDGE_BASIC_Sec3 OR Import this history: https://swift.rc.nectar.org.au:8888/v1/AUTH_a3929895f9e94089ad042c9900e1ee82/RNAseqDGE_BASIC/Galaxy-History-RNAseqDGE_BASIC_Sec3.tar.gz |
Section 4. Extensions [10 min]The Full Tuxedo Protocol The full Tuxedo protocol includes two other tools, Cufflinks and Cuffmerge. Cufflinks and Cuffmerge build a new reference transcriptome by identify ALL transcripts and genes in the RNAseq dataset - ie using these tools will allow you to identify new genes and transcripts, and then analyse them for differential expression. This is critical for organisms in which the transcriptome is not well characterised. Read more on the full Tuxedo protocol here. Transcript-level differential expression One can think of a scenario in an experiment aiming to investigate the differences between two experimental conditions, where a gene had the same number of read counts in the two conditions but these read counts were derived from different transcripts; this gene would not be identified in a differential gene expression test, but would be in a differential transcript expression test. The choice of what ‘unit of aggregation’ to use in differential expression testing is one that should be made by the biological investigator, and will affect the bioinformatics analysis done (and probably the data generation too). |
The reason that we get generally the same results from the full Tuxedo protocol is, in this instance, a result of D. melanogaster having a well characterized transcriptome AND this being a synthetic dataset generated from known genes, so of course building the transcriptome again is unlikely to add extra information for finding the differentially expressed genes (and may actually lose information - note that our gene model as produced by cuffmerge has 3,183 transcripts identified in chr4, but the Ensembl gene model has 4.903 identified in the same chromosome). This however, is not true for other genomes that are not so well characterized or in cases such as cancer, where the transcriptomes could have changed significantly compared to the reference. In such cases the full Tuxedo protocol should be followed. |
Completed Galaxy history for this section (in SharedData>Published Histories): RNAseqDGE_BASIC_Sec4 OR Import this history: https://swift.rc.nectar.org.au:8888/v1/AUTH_a3929895f9e94089ad042c9900e1ee82/RNAseqDGE_BASIC/Galaxy-History-RNAseqDGE_BASIC_Sec4.tar.gz |
ReferencesTrapnell C, Roberts A, Pachter L, et al. Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks. Nature Protocols [serial online]. March 1, 2012;7(3):562-578. |