EDGE3 data analysis: Standard Hierarchical Clustering
This is a tutorial designed to get you up to speed on the functionality of EDGE3.
Helpful tip: Links to external websites are in this document and are represented by blue, underlined text (eg., EDGE3). We encourage you to check them out for more information.
Helpful tip: Until we get a table of contents section, the best way to search for something in this document is to utilize the search function of your browser. This is usually initiated by the Ctrl-F key combination shortcut.
The interface:
The image above represents what you will see after you've logged in and chosen Standard Clustering from the menu option under the Data Analysis subsection. In general, the layout remains the same while doing data analysis. There are three enumerated sections. The first section, 1-Main Menu, is located in the upper left hand side of the screen. This is where you can select what function of EDGE3 you wish to utilize. The second section, 2-Instructions, is located in the lower left-hand side and is probably where you found this tutorial. The left hand side of the screen is composed of what is called an accordion pane. by clicking on the little triangular carets you can open and close the EDGE3 menu and the Useful Information/Instructions sections. For an image of the Useful Information/Instructions section, see below this paragraph. Section 3, 3-Working Area, is where you do most of the data generation.

Standard Clustering
To begin with we will give a brief overview of the Standard Clustering interface. The interface is depicted in the image below and sections of interest at the moment are enumerated and labeled. We will go over the details of each of the enumerated sections.
As you can see from the image there eight enumerated sections that we will be covering in this initial overview. Before we progress to far in, one thing to note is the presence of the light bulb icons. These icons are essentially context-based help for the fields in the query form the are proximate to. As you can see, there are other sections that are not enumerated and we will go over them as this tutorial progresses.
The first section, 1-Two algorithms, is where you choose the method of clustering you desire to utilize with your data. There are two methods available, Hierarchical Clustering and K-means Clustering. Click HERE for a general description of the two methods. For the first part of the tutorial we will utilize hierarchical clustering.
The second section, 2-Data Sources, is currently only applicable to arrays generated on the Mouse array platform. This will be available for other platforms in the future. The difference between the two options is that with the Condensed option you are utilizing a data set where if a gene is spotted multiple times on the array each spot for that gene is averaged to a single number. This cuts down on the number of potential data points that are created, but some information is lost during the process. The Not Condensed option, when selected, causes the query to take into consideration every gene probe on the array.
The third section, 3-Experiments, is where you are able to select your arrays to be included in the query. As you can see there are four tabs heading this section with each tab representing one of the available organism arrays.
The fourth section, 4-Fold-change Thresholds, allows you to limit your query to only look at genes above or below a desired fold-change or within a range of fold-change values.
The fifth section, 5-Probe Specific Thresholds, allows you to set the Red and Green Processed Signal cutoffs and the p-Value cutoff. These values are all generated by the Feature Extraction software. In the image above the Red and Green Processed Signals are set to a threshold of 100, meaning that any probe whose signal is not greater than 100 in either channel will be eliminated from being returned in the query. The p-Value cutoff is set to 0.01 and this value is the calculated probability that a spot is differentially expressed. For more information on these values, you are encouraged to check out Agilent's web site to see how they are calculated and defined.
The sixth section, 6-Heatmap color options, allows you to modify the color range of the heatmap. The first color listed is utilized for genes that are up-regulated and the second color is utilized for genes that are down-regulated.
The seventh section, 7-Image formats, allows you to select the output format of the generated heatmap. The default image type selected is PNG (portable network graphics). Click HERE for more information on the PNG format. The two other formats are SVG and JPG. SVG (scalable vector graphics), unlike JPG and PNG, is not a raster-based image. In general, the file sizes for this format is larger for SVG than the other two formats.
The eighth section, 8-Heat-map interactivity, allows you to select whether or not an image map is included with the image (i.e., for JPG and PNG formats). The image map allows for the ability to be able to click on the heat map image and obtain more information on arrays, genes and gene expression. By default, links are included in the SVG file.
Querying the Data
Now that you've an introduction to the general layout of the Standard Clustering form, we will present an example of Hierarchical clustering. For this query we will leave the settings in Sections 1 and 2 as they are. However, to select the arrays for this query, we are going to have to choose arrays from the Toxicological Experiment in Section 3. To open the experiment, all you have to do is click on the Toxicology Experiment header to reveal the arrays associated with that experiment. The image below shows the results:
For this example, we will be choosing 6 arrays. To select/deselect an array all you have to do is click the checkbox that precedes the array name. The image below depicts the 6 arrays we've selected:
Background: The arrays we've selected are six arrays from a toxicological study. There are 3 Corn Oil treated biological replicates and 3 TCDD biological replicates. The arrays were generated by labeling the RNA from the treated animals with Cy3 and the RNA of a reference sample was labeled with Cy5.
After you have selected your arrays, the next step is to choose the fold-change threshold parameters. This can be done in Section 4. For this example we are going to limit the number of genes that are returned by increasing the fold-change thresholds. We are going to leave the values in the remaining sections unchanged and will proceed to click on the Submit button. Doing so results in the following:

This page represents the results page for a Standard Clustering Query. What is evident immediately is the generated heatmap. Before we describe that in detail, we would like to address the three sections above the heatmap: 1) Hierarchical Clustering Results/Input Parameters, 2) Associated Data and Image Files, and 3) Save/Update Query. The image below depicts the contents of these sections when they are all opened (this is done by clicking on the section headers).
-
Hierarchical Clustering Results/Input Parameters: This section tells you how many arrays you selected and the number of genes that met your query parameters. It also echoes a number of your entered parameters for you to be able to reference quickly. This is a helpful feature when a query takes awhile to complete allowing your short-term memory to become cluttered with other things.
-
Associated Data and Image Files: This section lists data files generated during Hierarchical clustering. As you can see there are two types of files generated, Data Files and Image Files. In this particular example two types of data files have been generated. The first type contains the fold-change values for the genes across the arrays and the second type contains the processed signal values for the cy3 and cy5 channels for the genes across the arrays.
-
Fold-change table (html) example. This file consists of three columns preceeding the data. The first column is the Feature Number on the array. The second column is the gene name. The third column represents the Refseq or GenBank Accession for the gene. Those with valid accessions are in blue and are linked to NCBI. The fourth column is blank and the rest of the columns represent the fold-change values for the gene on that line and the array in that column.
-
Fold-change CSV file example. This file is basically the same as the html file except that it is easily imported into spreadsheet programs because it is a comma-separated value (CSV) file. Here is an example of the output:
-
Image Files. We've already gone over the different image formats. Even though you select an output image on the Standard Clustering form all three different types are generated. If you are going to utilize any of these images in publications, we would suggest that you work with the SVG format. This format is easily edited by programs such as Adobe Illustrator and, for those of you who prefer low-cost alternatives, Inkscape. Inkscape has proven to be more than adequate in our hands at editing SVG files generated by EDGE3.
-
Save/Update Query: This section allows you to save the query that you generated. Although (currently) it doesn't allow you to save the results set itself, it does allow you to save the query parameters you entered. This feature saves you time with queries that have a large number of arrays, because all you have to do is select your query instead of going through the mind-numbing process of checking boxes. We will go into greater detail later in this tutorial.
Now that we've gone through that bit of information, the resulting heatmap with dendrograms is displayed below:
The heatmap is the area where the expression values are qualitatively represented by degrees of red and green color. Associated with the heatmap are Genes, labeled on the right side of the heatmap, and Arrays, labeled across the top of the heatmap. There are two dendrograms associated with the heatmap. The dendrograms represent how the hierarchical clustering algorithm organized the genes and the array. Those genes or arrays on the same "branch" of the dendrogram were deemed to be the most similar by the hierarchical clustering algorithm. As you branch out further (i.e., away from the heatmap) the distance between the inner branches increases and differences become greater. The "root" of the tree is essentially the point where a bifurcation occurs between the two most dissimilar groups. In the case of the arrays, it is evident that the algorithm considers the three corn oil biological replicates quite different than the three TCDD biological replicates because they are separated by the immediate bifurcation of the Array Dendrogram. In the case of the Genes Dendrogram, the algorithm has separated the down-regulated genes and up-regulated genes as is evidence by the distinct bifurcation at its root. For the moment, ignore the pink square in this image as we will address that below.
If you remember from our initial query, we included an image map with the selected image output format, PNG. This means that the heatmap we generated contains clickable links. There are three different types of links in this heatmap: 1) Gene Names, 2) Array Names, and 3) Heat map cells.
-
Gene Names: Each of the gene names listed on the right side of the heatmap are interactive. As an example, we've clicked on Cyp1a1 which opens the web page in a different tab or window. Here is the image of this:
-
-
As you can see this brings back a range of information. It tells you the Feature Number, Probe UID and Probe Name which are Agilent specific identifiers. In the Probe Sequence and Annotations section, you obtain the sequence of the oligo for this probe as well as various annotation-associated links to external databases to obtain more information on the gene in question.
-
Array Names: The array names located at the top of the heatmap are also clickable. When you click on an array, it brings up the following:
-
-
Currently the information is pretty generic, but in the future we will have links to the RNA samples associated with this array as well as the experiment(s) the array is associated with.
-
Heatmap cells: A heatmap cell is the square that qualitatively represents the expression of a gene for a particular array. The heatmap cells are essentially a genes x arrays matrix. As an example, please refer back to the image of the heatmap above. The pink square in this heatmap is enclosing the heatmap cell for gene Rgs16 associated with Liver A1 Corn Oil 10ml/kg array. It appears to be expressed differently than the other two Corn Oil treatments, so we are going to look at the pValue assigned by the Feature Extraction software to see if this is a significant result or if there may have been some issues. To do this, we clicked on the cell and were presented with the following:
-
-
The pValue associated with this feature seems to indicate that the gene is being differentially expressed. We'd have to check out the other two cells for this gene in the context of the other two biological replicates to determine whether or not this may be a case of biological variation.
-
In addition to the pValue, there is more information on this page, including the processed signals and fold-change for the cell. Additionally, there is a Probe Sequence and Annotations section (similar to when clicking on a gene name) as well as an All Data Values section where all of the Feature Extraction data generated for this spot on the array is presented. This section was not expanded, because there is quite a bit of data.
This concludes our brief tutorial on Standard Clustering using the Hierarchical clustering algorithm.