RNAseq Differential Gene Expression

Introductory tutorial

Mahtab Mirmomeni & Andrew Lonie

http://genome.edu.au

http://vlsci.org.au

Contents

Tutorial Overview

Background [15 min]

Preparation [15 min]

Section 1: Align reads with Tophat [30 mins]

Section 2: Test differential expression with Cufflinks [45 min]

Section 3. Repeat without replicates [20 min]

Section 4. Extensions [10 min]

References

Tutorial Overview

In this tutorial we cover the concepts of RNAseq differential gene expression (DGE) analysis using a very small synthetic dataset from a well studied organism.

What’s not covered

The tutorial is designed to introduce the tools, datatypes and workflow of DGE analysis. In practice the datasets would be much larger and would contain sequencing and alignment errors that may confound analysis. There are also a number of different tools and/or statistical approaches that are widely used in RNAseq DGE that we do not cover.

Specifically, we have left out the following steps that you would almost certainly do in a real RNAseq DGE analysis:

  • QC (quality checking) of the raw data
  • trimming the reads for quality and for adaptor fragments

These steps have been omitted because the data we use in this tutorial is synthetic and has no quality issues, unlike real data

Background [15 min]

Where is the data in this tutorial from?

The data for this tutorial is from an RNA-seq experiment looking for differentially expressed genes in D. melanogaster (fruit fly) between two experimental conditions. The experiment and analysis protocol we will follow are derived from a paper in Nature Protocols by the research group responsible for one of the most widely used set of RNA-seq analysis tools: “Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks” (Trapnell et al 2012). The sequence datasets are single-end Illumina synthetic short reads, filtered to only include chromosome 4 gene data to facilitate faster mapping (which would otherwise take ~6 hours). We’ll use data from 3 biological replicates from each of the two experimental conditions.

The protocol:

Two protocols are described in the paper inspiring this tutorial: the full ‘Tuxedo protocol’, in which a transcriptome is assembled from all the experimental sequence data and gene expression is measured against that, and a shorter ‘Alternate protocol’ in which expression is measured against an existing annotated transcriptome.

In this tutorial we use the simpler protocol as the D. melanogaster transcriptome is very well characterised.

Follow this link for an overview/background on this protocol

The protocol in a nutshell:

  • Input: Raw RNA-Seq reads from different experimental conditions
  • Output: List of significantly differentially expressed genes

Preparation [15 min]

  1. Register as a new user in Galaxy if you don’t already have an account (what is Galaxy?)
  1. Open a browser and go to the the Galaxy server: http://galaxy-tut.genome.edu.au
  1. NOTE: Firefox/Safari/Chrome all work well, Internet Explorer not so well
  1. Register as a new user: User>Register or login if you already have an account
  1. Import the RNAseq data for the workshop. You can do this in a few ways, of which by far the easiest is:
  1. Go to shared Data -> Published Histories and click on  ‘RNAseq_DGE_BASIC_Prep. Then click 'Import History' at top right, wait for the history to be imported to your account, and then ‘start using this history’.
  1. This will create a new Galaxy history in your account with all of the required data files

Alternatively:

  1. Download the data directly to your computer using these URLs, then upload to Galaxy:
  1. Short read files for condition 1:

https://swift.rc.nectar.org.au:8888/v1/AUTH_a3929895f9e94089ad042c9900e1ee82/RNAseqDGE_BASIC/C1_R1.chr4.fq

https://swift.rc.nectar.org.au:8888/v1/AUTH_a3929895f9e94089ad042c9900e1ee82/RNAseqDGE_BASIC/C1_R2.chr4.fq

https://swift.rc.nectar.org.au:8888/v1/AUTH_a3929895f9e94089ad042c9900e1ee82/RNAseqDGE_BASIC/C1_R3.chr4.fq

  1. Short read files for condition 2:

https://swift.rc.nectar.org.au:8888/v1/AUTH_a3929895f9e94089ad042c9900e1ee82/RNAseqDGE_BASIC/C2_R1.chr4.fq

https://swift.rc.nectar.org.au:8888/v1/AUTH_a3929895f9e94089ad042c9900e1ee82/RNAseqDGE_BASIC/C2_R2.chr4.fq

https://swift.rc.nectar.org.au:8888/v1/AUTH_a3929895f9e94089ad042c9900e1ee82/RNAseqDGE_BASIC/C2_R3.chr4.fq

  1. Once files are downloaded, in Galaxy click on Get data>Upload File
  2. Click 'Choose File', locate the local copy of each of the six fastq files and upload
  3. Make sure you select 'fastqsanger' under File Format or the next steps won't work!
  4. Click Execute and wait for the files to upload to Galaxy. It can take a couple of minutes.
  1. Download the annotated gene list reference for Drosophila melanogaster chromosome 4:

https://swift.rc.nectar.org.au:8888/v1/AUTH_a3929895f9e94089ad042c9900e1ee82/RNAseqDGE_BASIC/ensembl_dm3.chr4.gtf

  1. Upload the ensembl_dm3.chr4.gtf file to Galaxy. Don’t select the File Format for this file, Galaxy will work it out itself

Or:

  1. Upload the data directly to Galaxy using the file URLs
  1. Follow the steps above but instead of downloading the files locally and then uploading, copy the URLs from the previous step into the URL/Text box and Galaxy will download and import them directly. Remember to select the ‘fastqsanger’ File Format for the sequence files (but not the annotated gene list file)

  1. You should now have the following files in your Galaxy history:
  1. 6 sets of single-ended reads:
  • C1_R1.chr4.fq
  • C1_R2.chr4.fq
  • C1_R3.chr4.fq
  • C2_R1.chr4.fq
  • C2_R2.chr4.fq
  • C2_R3.chr4.fq
  1. Gene annotation list as ensembl_dm3.chr4.gtf

  1. View the RNAseq fastq files
  1. Click on the eye icon to the top right of each fastq file to view the first part of the file
  2. If you’re not familiar with the FASTQ format, click here for an overview

NOTE: since the reads in this dataset are synthetic, they do not have real quality scores.

NOTE: If you log out of Galaxy and log back at a later time your data and results from previous experiments will be available in the right panel of your screen called the ‘History’

Completed Galaxy history for this section (in SharedData>Published Histories): RNAseq_DGE_BASIC_Prep

OR

Import this history:

https://swift.rc.nectar.org.au:8888/v1/AUTH_a3929895f9e94089ad042c9900e1ee82/RNAseqDGE_BASIC/Galaxy-History-RNAseqDGE_BASIC_Prep.tar.gz

Section 1: Align reads with Tophat [30 mins]

The basic process here is to map each of the individual reads in the sample FASTQ readsets to a reference genome, so that we can count which reads came from which gene transcript. These read counts are the basis for differential gene expression.

As the reads are generated from mRNA rather than DNA, some of them will cross exon/intron boundaries in the genomic reference. The Tophat tool maps the read to the reference genome and uses the mapping information to identify exon/intron boundaries (splice junctions).

More detailed description of Tophat-based read alignment/mapping

  1. Align the RNAseq short reads against a reference genome/transcriptome

From the tools menu in the left hand panel of Galaxy, select NGS: RNA Analysis>Tophat For Illumina and run with these parameters:

  1. RNA-Seq FASTQ file: eg. C1_R1.chr4.fq
  2. Use built in index
  3. Reference genome: D. melanogaster (dm3)
  4. Single-end
  5. Use defaults for other fields
  6. Execute

Screenshots of this process can be seen here.

Note: This may take a few minutes, depending on how busy Galaxy is.

  1. Examine the output files

You should have four output files for the alignment step:

  • Tophat for Illumina data 1:splice junctions
  • This file lists all the places where TopHat had to split a read into two pieces to span an exon junction.
  • Tophat for Illumina data 1:accepted_hits
  • This file shows the actual paired reads mapped to their position on the genome, and split across exon junctions.
  • Tophat for Illumina data 1:insertions
  • Tophat for Illumina data 1:deletions
  • These files records any small insertions or deletions found in the reads. Since we are working with synthetic reads you can ignore Tophat for Illumina data 1:insertions Tophat for Illumina data 1:deletions for now.

  1. Repeat for all sample readsets

Repeat step 1 and 2 for the each of the sequence files, generating four files for each for a total of 24 output files form the alignment step.

  1. Visualise the aligned reads

The purpose of this step is to become aware of the quantitative exon-based peak nature of RNA-seq data and the differences between samples

  1. Click on one of the Tophat accepted hits files, for example ’Tophat for Illumina on data 1: accepted_hits’ .
  2. Click on ‘Display with IGV’ ‘webcurrent’ or  ‘local’ (if you have IGV installed) which is located at the bottom of the section for this file
  3. Once IGV opens, it will show you the accepted_hits file. (Note: This may take a bit of time as the data is downloaded to IGV)

Screenshot of the accepted_hits file loaded into IGV

  1. Select chr4 from the second drop box under the toolbar. Zoom in to view alignments of reads to the reference genome
  1. You should see the characteristic distribution of RNAseq reads across the exons of the genes, with some split at intron/exon boundaries
  2. The number of reads aligned to a particular gene is proportional to the abundance of the RNA derived from that gene in the sequenced sample
  3. Note that IGV already has a list of known genes of most major organisms including Drosophila, which is why you can see the genes in the bottom panel of IGV

Screenshot: zoom in

  1. For the splice function files - for example ‘TopHat for illumina on data 1: splice junctions’ - you’ll need to save this file to your local disk using the disk icon under the details of the file. Then open the saved .bed file directly in IGV using the File>LoadFromFile option from IGV. This is because IGV doesn’t automatically stream BED files from Galaxy

Screenshot: splice junctions file loaded into IGV with the accepted_hits file already loaded.

  1. View differentially expressed genes

The aim of this tutorial is to statistically test differential expression, but first it’s useful to reassure ourselves that the data looks right at this stage by comparing the aligned reads for biological condition 1 (C1) and condition 2 (C2) in the context of the reference genome.

We’ll try an inbuilt Galaxy genome browser for this activity, called Trackster. It’s not as powerful as IGV but is sometimes more convenient.

  1. Click on  ‘TopHat for illumina on data 1: accepted_hits’ (The accepted hits file from first replicate of C1) and then click on the little chart icon in the lower middle. The icon is the ‘View in Trackster’ icon. Then click ‘View in new visualization’
  2. Click + (Add Track) on the top right side of the page and add ‘TopHat for illumina on data 4: accepted_hits’ and ‘ensembl_dm3.chr4.gtf’ (The accepted hits file from first replicate of C2 , and the reference gene list)
  3. In the drop down list in the top middle part of the page, select chr4
  4. You may need to wait a couple of seconds for the data to load
  5. Next to the drop down list, click on the chromosomal position number display and specify the location ‘chr4:325197-341887’

Screenshot

ALTERNATIVELY: Use IGV

  1. Click on ‘TopHat for illumina on data 4: accepted_hits’ (The accepted hits file from first replicate of C2)
  2. Click on ‘display with IGV local’. This time use the ‘local’ link, as you already have an IGV running locally from the last step.
  3. Next to the drop down chromosome list in IGV, click on the chromosomal position number display and specify the location ‘chr4:325197-341887’

The middle gene in this area clearly looks like it has many more reads mapped in condition 2 than condition 1, whereas for the surrounding genes the reads look about the same. The middle gene looks like it is differentially expressed. But, of course, it may be that there are many more reads in the readsets for C1 and C2, and the other genes are underexpressed in condition 2. So we need to statistically normalise the read counts before we can say anything definitive, which we will do in the next step.

Note: You can cross compare other replicates of the conditions as well by adding them to the visualisation

Completed Galaxy history for this section (in SharedData>Published Histories): RNAseqDGE_BASIC_Sec1

OR

Import this history:

https://swift.rc.nectar.org.au:8888/v1/AUTH_a3929895f9e94089ad042c9900e1ee82/RNAseqDGE_BASIC/Galaxy-History-RNAseqDGE_BASIC_Sec1.tar.gz

Section 2: Test differential expression with Cuffdiff [45 min]

The aim here is to:

  • Generate tables of normalised read counts per gene per sample based on an annotated reference transcriptome
  • Statistically test for significant difference in normalised read counts gene-by-gene and transcript-by-transcript, taking into account inter-sample variance
  • Assign each gene a probability score that it is genuinely differentially expressed between the two conditions and estimate the fold difference in expression.

All these steps are rolled up into a single tool: Cufflinks, which takes the aligned reads from Tophat and generates normalised read counts + a list of differentially expressed genes based on a reference transcriptome - in this case, the curated Ensembl list of D. melanogaster genes from chromosome 4 that we supply as a GTF (Gene Transfer Format) file

More detailed explanation of Cufflinks DGE testing

Note: Before starting the next section, leave the Trackster interface and return to the analysis view of Galaxy by clicking ‘Analyse Data’ at the top of the window.

  1. Examine the reference transcriptome
  1. Display the ensembl_dm3.chr4.gtf reference transcriptome file in Galaxy
  2. The reference transcriptome is essentially a list of chromosomal features which together define genes; each feature is in turn defined by chromosomal start and end point, feature type (CDS, gene, exon etc), and parent gene and transcript.
  1. Importantly, a gene will have many features, but one feature will belong to only one gene.
  1. Learn about the GTF format here

The ensembl_dm3.chr4.gtf file contains ~4900 features which together define the 92 known genes on chromosome 4 of Drosophila melanogaster

Cufflinks uses the reference transcriptome to aggregate read counts per gene, transcript, transcription start site and CoDing Sequence (CDS)

For this tutorial, we’ll only consider differential gene testing, but it is also possible to test for differential expression of transcripts or transcription start sites

  1. Identify differentially expressed genes and transcripts using the Cuffdiff tool

Galaxy tool panel: NGS: RNA Analysis>Cuffdiff

  1. Transcripts: ensembl_dm3.chr4.gtf
  2. Condition 1
  1. Group name: C1
  2. Add replicate-> Add File: ‘Tophat for Illumina data 1.accepted_hits’
  3. Add new replicate-> Add File: ‘Tophat for Illumina data 2.accepted_hits’
  4. Add new replicate-> Add File: ‘Tophat for Illumina data 3.accepted_hits’
  1. Condition 2
  1. Group name: C2
  2. Add replicate-> Add File: ‘Tophat for Illumina data 4.accepted_hits’
  3. Add new replicate-> Add File: ‘Tophat for Illumina data 5.accepted_hits’
  4. Add new replicate-> Add File: ‘Tophat for Illumina data 6.accepted_hits’

Screenshot: loading files in Cuffdiff

  1. Keep other defaults
  2. Execute

Screenshot

Output files generated are in the following categories

  1. Gene and transcript expression levels as tables of normalised read counts (‘FPKM tracking’)
  2. Differential analysis testing on:
  1. Genes
  2. Transcripts
  3. Transcription Start Site (TSS) groups
  4. Splicing: files reporting on splicing
  5. Promoter: differential transcripts via promoter switching
  6. CDS: CoDing Sequences

  1. Examine the tables of normalised gene counts
  1. Open the output files ‘Cuffdiff on data xx, data x, and others: gene FPKM tracking’ by clicking on the eye icon
  2. The file consists of one row for each gene in the reference transcriptome, with columns containing the normalised read counts for each of the two groups of three replicates each
  1. Note that cufflinks gives each gene it’s own ‘tracking_id’, which equates to a gene. Multiple transcription start sites are aggregated under a single tracking_id
  2. A gene encompasses a chromosomal locus which covers all the features that make up that gene (exons, introns, 5’ UTR etc)
  3. Each experimental group has a normalised value of FPKM. This mean ‘Fragments per kilobase per million reads’ and is a method of normalising read counts to account for:
  1. variation in sequencing yield between samples
  2. the different lengths of genes (as longer transcripts generate more reads)

  1. Inspect the gene differential expression testing file
  1. Click on the eye next to ‘Cuffdiff on data xx, data x, and others: gene differential expression testing’ and view the content (screenshot)
  2. The interesting columns are: gene (c3), locus (c4), log2(fold_change) (c10), p_value (c12), q_value (c13) and significant (c14)
  3. Filter based on column 14 (‘significant’) - a binary assessment of q_value>0.05, where q_value is p_value adjusted for multiple testing
  1. Go to Filter and Sort>Filter
  2. Select the ‘Cuffdiff on data....: gene differential expression testing’ dataset
  3. Filter the dataset with the following condition: c14=='yes'
  1. This will keep only those entries that Cufflinks has marked as significantly differentially expressed
  1. Execute (screenshot)
  1. This will produce a sorted gene expression file based on q_value>0.05

Note: You can rename this file (screenshot) by clicking on the pencil icon in front of the file name with ‘Edit Attribute’ caption to Significant_DiffExpGenes (or something similar) to make the filename more meaningful.

  1. Examine the sorted list of differentially expressed genes
  1. Click on Significant_DiffExpGenes and the eye icon next to it with caption ‘Display in data browser’’
  1. As you can see, two genes have been identified:
  • ‘Ank’ gene from chr4:137014-150378
  • ‘CG2177’ from chr4:331557-334534
  1. They are significantly differentially expressed, with q-values 0.00175 for each.
  2. ‘CG2177’ from chr4:331557-334534 was the gene that we intuitively saw to be differentially expressed in the previous section, in the broader region of chr4:325197-341887

What do the genes do? As this is a artificial data set, it doesn’t really matter; the point is to understand the process of differential gene expression testing using RNAseq data.

Completed Galaxy history for this section (in SharedData>Published Histories): RNAseqDGE_BASIC_Sec2

OR

Import this history:

https://swift.rc.nectar.org.au:8888/v1/AUTH_a3929895f9e94089ad042c9900e1ee82/RNAseqDGE_BASIC/Galaxy-History-RNAseqDGE_BASIC_Sec2.tar.gz

Section 3. Repeat without replicates [20 min]

What happens if we only use one sample from each condition for differential gene expression testing?

  1. Repeat the differential gene expression testing from section 2, but this time only use one replicate from each condition group (C1 and C2).

Galaxy tool panel: NGS->RNA Analysis ->Cuffdiff

  1. Transcripts: ensembl_dm3.chr4.gtf
  2. Add new replicate-> Add File: ‘Tophat for Illumina data 1.accepted_hits’
  3. Add new replicate-> Add File: ‘Tophat for Illumina data 4.accepted_hits’
  4. Library normalization method: ‘classic-fpkm’
  5. Dispersion estimate method: ‘blind’
  6. Execute.

Then filter the generated geneset for significantly differentially expressed genes (Filter with c14=’yes’ as for previous section)

  1. Click on Significant_DiffExpGenes and the eye icon next to it with caption ‘Display in data browser’’

You should get _no_ differentially expressed genes at statistical significance of 0.05

The ‘Ank’ gene, which was found to be significantly expressed in our first analysis, is not identified in this analysis as differentially expressed. The ‘CG1277’ gene has a q_value of 0.091, just above the cutoff of 0.05 for statistical significance, so also is a ‘false negative’ in the sense that it doesn’t reach our definition of significance but we know it is differentially expressed if we have more data as in the previous section. If we ranked the genes by q_value, CG1277 would clearly come out as the _most_ significantly differentially expressed.

There are no false positives; but this can depend on the version of Cuffdiff you are using, or in fact any differential expression analysis tool. For example, the previous version of Cufflinks identified 9 genes that were differentially expressed that were not identified when replicates are used i.e. false positives.

As the algorithms for testing differential expression improve, it is likely that genes may slip into/out of significance - but you will always have more confidence if you have more replicates showing a difference in a genes expression

  1. Repeat this no-replicates analysis, but this time specify a different set of samples:

Galaxy tool panel: NGS->RNA Analysis ->Cuffdiff

  1. Transcripts: ensembl_dm3.chr4.gtf
  2. Condition 1: Add replicate-> ‘Tophat for Illumina data 1.accepted_hits’
  3. Condition 2: Add replicate-> ‘Tophat for Illumina data 5.accepted_hits’
  4. Library normalization method: ‘classic-fpkm’
  5. Dispersion estimate method: ‘blind’
  6. Execute.

You now see ‘CG2177’ appear again in the list as significantly differentially expressed, but not Ank. How do we interpret this? There is more difference in CG1277 expression (after normalisation) between samples 1 and 5 than samples 1 and 4, even though samples 4 and 5 are replicates; enough of a difference, in fact, to get CG1277 over the ‘statistical significance’ line of 0.05 in the case of (sample 1 vs 5) but not in (1 vs 4).

The size of the (normalised) difference in expression of Ank is much smaller, so we need to see it consistently across multiple replicates to be confident it actually exists. One replicate is not enough.

The identification of significantly different genes is based on (at least) the size of the difference in expression and the observation of a difference across multiple replicates. This demonstrates how important it is to have biological replicates in differential gene expression experiments.

  1. Repeat this analysis, specifying groups of two replicates each

What do you get? How many replicates do we need to identify Ank as differentially expressed?

Completed Galaxy history for this section (in SharedData>Published Histories): RNAseqDGE_BASIC_Sec3

OR

Import this history:

https://swift.rc.nectar.org.au:8888/v1/AUTH_a3929895f9e94089ad042c9900e1ee82/RNAseqDGE_BASIC/Galaxy-History-RNAseqDGE_BASIC_Sec3.tar.gz

Section 4. Extensions [10 min]

The Full Tuxedo Protocol

The full Tuxedo protocol includes two other tools, Cufflinks and Cuffmerge. Cufflinks and Cuffmerge build a new reference transcriptome by identify ALL transcripts and genes in the RNAseq dataset - ie using these tools will allow you to identify new genes and transcripts, and then analyse them for differential expression.

This is critical for organisms in which the transcriptome is not well characterised.

Read more on the full Tuxedo protocol here.

Transcript-level differential expression

One can think of a scenario in an experiment aiming to investigate the differences between two experimental conditions, where a gene had the same number of read counts in the two conditions but these read counts were derived from different transcripts; this gene would not be identified in a differential gene expression test, but would be in a differential transcript expression test.

The choice of what ‘unit of aggregation’ to use in differential expression testing is one that should be made by the biological investigator, and will affect the bioinformatics analysis done (and probably the data generation too).


We can explore the possibility of
new differentially expressed genes in our dataset by following the full Tuxedo protocol:

  1. Import a completed Galaxy history in which the full protocol has been follow: Shared Data>Published Histories>RNAseq_DGE_BASIC_Section4
  2. There are a number of additional files generated by Cufflinks and Cuffmerge, culminating in the creation of a new reference transcriptome which is used by Cuffdiff to test differential expression
  3. The final outcome of this process results in a list of differentially expressed genes which is the same as the alternate protocol: ‘CG2177’ and also ‘Ank’, although not quite at the same significance (Ank is q_value=0.09275)
  1. This is close to the results we obtained in Section 2 of this tutorial and gives us confidence in the results.
  2. The differences in significance probably come down to the definition of the genes, as we are using a newly constructed gene model which may mean some counts are not aggregated into the new Ank gene definition, for example.

The reason that we get generally the same results from the full Tuxedo protocol is, in this instance, a result of D. melanogaster having a well characterized transcriptome AND this being a synthetic dataset generated from known genes, so of course building the transcriptome again is unlikely to add extra information for finding the differentially expressed genes (and may actually lose information - note that our gene model as produced by cuffmerge has 3,183 transcripts identified in chr4, but the Ensembl gene model has 4.903 identified in the same chromosome). This however, is not true for other genomes that are not so well characterized or in cases such as cancer, where the transcriptomes could have changed significantly compared to the reference. In such cases the full Tuxedo protocol should be followed.

Completed Galaxy history for this section (in SharedData>Published Histories): RNAseqDGE_BASIC_Sec4

OR

Import this history:

https://swift.rc.nectar.org.au:8888/v1/AUTH_a3929895f9e94089ad042c9900e1ee82/RNAseqDGE_BASIC/Galaxy-History-RNAseqDGE_BASIC_Sec4.tar.gz

References

Trapnell C, Roberts A, Pachter L, et al. Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks. Nature Protocols [serial online]. March 1, 2012;7(3):562-578.