Previously, we passed a rank list of differentially expressed genes between platelets of healthy donors and breast cancer patients along with a gene set database to the Gene Set Enrichment Analysis (GSEA) software in hopes of identifying significantly altered pathways. We could have simply ended there and examined our GSEA reports for those pathways in the list that peaked our scientific interest. In many instances, however, the results of GSEA are voluminous in part because of a high degree of redundancy amongst pathways. Can we reduce this redundancy and provide a more meaningful, interactive means to reason over the results?
In this workflow step, we describe Enrichment Map (EM), a visual tool that simplifies the display of a list of gene sets with a large degree of redundancy. By then end of this discussion you should:
EM was originally described by Merico et al. (Merico 2010) as an aid in the interpretation of gene sets emerging from enrichment analyses. The motivation for this tool is the growing number of gene sets and gene annotation detail being made available, which leads to large numbers of gene sets emerging from enrichment analyses. Even with stringent criteria, it is not uncommon for analyses to generate hundreds or thousands of pathways, making results difficult to interpret on whole. To illustrate the value of EM, let us return to our analysis of alterations in pathways of platelets by Best et al..
We used GSEA to distill gene sets enriched in the underlying differences in RNA abundance between platelets from healthy donors (HD) and individuals diagnosed with breast cancer (BrCa). Our rationale for applying enrichment analysis was that we would much rather reason at the pathway-level than the gene-level. Figure 1 shows a section of the GSEA report and Table 1 shows an excerpt of the actual enrichment report for BrCa.
Table 1. Enrichment report for BrCa (na_pos phenotype)
|NAME||SIZE||ES||NES||NOM p-val||FDR q-val||FWER p-val|
|THROMBOXANE A2 RECEPTOR SIGNALING||39||0.84532094||2.175933||0.0||0.0||0.0|
|SIGNALING EVENTS MEDIATED BY FOCAL ADHESION KINASE||48||0.8144367||2.16101||0.0||0.0||0.0|
Notice that even under fairly stringent criteria (i.e. p-value < 1%), there are a still 578 gene sets that are deemed significantly enriched in BrCa platelets and 394 in HD. Furthermore, our GSEA report declares that 8 191 genes were used as input in our ranked list. Have we simply kicked the can down the road and traded a long list of ~8 200 genes for a long list of ~1 000 gene sets?
Ideally, we want a simple way to reduce the size of the data from an enrichment analysis while retaining, to the best of our ability, the amount of meaningful information. To this end, many approaches have set their sights on reducing the amount of redundancy amongst gene sets.
Many enrichment analyses use the Gene Ontology (Ashburner 2000) which is hierarchically organized. In this case, redundancy can be reduced by merging gene sets that are children of a parent cluster with broader scope. These and other similar approaches rely on variations of the same statistical test for enrichment. Importantly, these approaches do not apply to gene sets without a clearly defined hierarchy, for example, transcriptional regulator target genes. What we desire is a versatile method able to manage non-hierarchical gene sets that emerge from analysis methods, regardless of nature of the underlying enrichment test used (Khatri 2012).
EM is a visualization analysis tool that organizes gene sets into an information-rich similarity network. The true power of Enrichment Map is that it is a visual display method that reduces complexity by grouping similar gene sets as defined by the number of overlapping genes. These clusters can then be annotated with a representative label gleaned from the characteristics of the individual gene sets (Figure 2).
Gene sets are represented as nodes whose radius is proportional to the number of genes. Edges indicate nodes with shared genes, where the thickness of the line is proportional to the degree of overlap. Finally, EM can use node color to represent other dimensions of the data, for example, gene sets enriched in different classes.
The approach is modular in that it is compatible with any statistical test or gene set source. Enrichment Map shines when dealing with a collection of gene sets with a large degree of redundancy.
Control Panelis where settings for Cytoscape and each app appear. (C) The
Table Paneldisplays data for each node and edge along with expression heatmaps. (D) The
Results Panelis where settings declared in the
Control Panelcan be fine-tuned. Note that the NetPath IL-5 signalling pathway has been highlighted in the main window (yellow node, top right) and the corresponding gene expression is displayed in the
It should be immediately apparent from Figure 3 that the Enrichment Map app provides several helpful features including:
Below, we describe the process of loading our file dependencies from previous workflow steps (i.e. enrichment reports, expression, phenotypes, ranks) into the Cytoscape desktop software towards generating an Enrichment Map similar to Figure 3. We will then provide some tips on how to navigate, tune and interpret the EM in an effort to facilitate new scientific insight.
The base Cytoscape installation will use a collection of apps to generate the Enrichment Map. Apps are loaded into Cytoscape using the built-in App Manager.
From the Cytoscape menu bar, select
'Apps' --> 'App Manager' to bring up an app search panel (Figure 4).
Install Appstab. Beware that the App Manager search engine will respect spaces in entered text (e.g. 'EnrichmentMap' will be found but not 'Enrichment Map').
Search for the app, highlight the correct search result and click
Install. If the installation was successful, you should be able to select it by name in the
Apps menu bar drop-down. Also, if you bring up the
App Manager you should see it as a listing under the
Currently Installed tab.
Launch the Enrichment Map app by selecting from the menu bar
Apps --> EnrichmentMap --> Create Enrichment Map (Figure 5).
Control Panelon the left is where we will load our data files and adjust the Enrichment Map settings.
Below we provide the dependencies that will be loaded into the Enrichment Map app through the Cytoscape
Control Panel (Figure 5, left). All of these files were either provided of generated in previous workflow steps.
Having obtained the six (6) file dependencies you may now load these into the Cytoscape Enrichment Map app via the
Control Panel (Figure 6).
Control Paneltab for the
Enrichment Map Input. Load in the file dependencies for
Control Panel tab for Enrichment Map (Figure 6), make the following adjustments to the following sections.
BrCa VS. HDto reflect the comparisons we made. This also matches the order in which you loaded files into the
FDR Q-value Cutoffto 1E-04 in
Similarity Cutoff, click the radio for
Jaccard+Overlap Combinedand keep the
You will be able to adjust the cutoff parameters later on in the
Results Panelafter the Enrichment Map is built and displayed
Build to generate the EM displayed in Figure 7. Take some time to examine main window displaying the network. In particular, in the lower right region of the main window there is a bird’s eye view showing the region displayed in the main view. Let us recap what we are seeing in the main view:
Results Panel shows our currently selected parameters that were originally set in the
Table Panel has the
Node Table tab selected by default, listing our (selected) gene sets.
Legendtab. Increase or decrease the stringency for displaying nodes (
Q-value Cutoff) and edges (
Similarity Cutoff) using the sliders. For instance, one can restrict the displayed gene sets to those with very low chance of being a false positive.
Heat Maptab as in Figure 7, you will see a depiction of the normalized gene expression values and the samples coloured by class. If you click on the
Node Table(Figure 8) it should display the information summarized in the graph but displayed in tabular form. Note that if a node in the graph is selected or if there is a search match,only those results will be shown. Click anywhere on the map outside of a node to see the full results.
You can search the Enrichment Map by keyword using the input text box located at the top-right. For example, searching for ‘IL5%NETPATH%IL5’ higlights the corresponding node in yellow and selects it so you can immediately view data in the
It is very easy to get overwhelmed by the amount of options, settings and visual information in an Enrichment Map (Figure 7) so now is a good time to pause to reflect back upon our goal: To reduce the complexity in our long list of enriched gene sets from GSEA while maintaining as much useful information. To this end, note the presence of gene set ‘clusters’ as indicated by a very dense group of edges connecting nodes (e.g. Figure 7, top-left in blue). These are gene sets with many shared genes that likely represent very similar processes. Our aim will be to display such clusters as a single node and annotate it with one label that reflects the overarching theme of its constituents (Figure 9).
Clusters within the Enrichment Map represent similar biological processes and themes. In order to better summarize the Enrichment map we want to be able to annotate each of these clusters with the main theme associated with it. To do this we use the AutoAnnotate app to help us summarize the network and its themes. AutoAnnotate first clusters the network and then uses WordCloud to calculate the most frequent words present in each cluster node labels in efforts to highlight commonalities between the nodes in the cluster.
i. Separating gene sets in each class
In the main window graph, let us spatially separate the gene sets enriched in each class so that when we go to add labels, they are readily distinguishable.
We will select the BrCa group in red and drag the entire group over to the right separate it from the HD group in blue. To do this, Highlight the class by creating a column filter for nodes with a positive NES:
Control Paneltab for
+to add a filter
Column Filterfrom the drop-down
Figure 10 shows the selected nodes in yellow representing the gene sets enriched in BrCa. Drag the entire group to the right of the unselected gene sets.
Selectfor NES above 0.
ii. AutoAnnotate groups
AutoAnnotate first clusters the network and then uses WordCloud to calculate the most frequent words present in each cluster’s node labels (Figure 10).
Apps -> AutoAnnotate -> New Annotation Set...
Create Annotations to start the annotation process. You should see clusters forming and being annotated in the main window (Figure 11).
iii. Collapse groups
Figure 11 shows a pretty busy picture; It is natural to gravitate towards large clusters that appear in the enrichment results, however, in this case, size does not indicate importance or strength but rather the amount of database annotation there exist for a particular pathway or process. Single nodes represent processes that are less well known but no less important than the large clusters. In order to remove the bias introduced by redundant pathway annotations it is good to collapse the network, i.e. create a single group node for every cluster whose name is summary annotation calculated for it, in order to more easily see the overall themes present in the enrichment results (Figure 12).
Control Panelselect the
iv. Expand a group
Let’s reverse the process selectively and drill down into a particular group. Recall our running interest in the IL-5 signal transduction pathway originally curated by NetPath? Well it is hidden inside the cluster labelled ‘pid angiopoietin receptor pathway’. We can recover the view for this gene set (Figure 13).
Control Panelselect the
pid angiopoietin receptor pathway
Our cluster ‘pid angiopoietin receptor pathway’ has now been expanded to show the constituent pathways that include our IL-5 gene set. The edges indicate shared genes, but to get a better idea of the overlap, let’s look at the overlap in all pairs of gene sets in table format (Figure 14).
Table Paneltab for
Edge Tableshows the pair-wise overlap in genes between different gene sets in the cluster (EM1_Overlap_size, EM1_Overlap_genes). Highlighted in the
Table Panelis the overlap between IL-3 and IL-5.
Take for instance the IL-3 and IL-5 signalling pathways: IL-3 consists of 75 genes and IL-5 contains 49 (
Node Table); The gene sets share 33 genes between them (
Cytoscape affords the user a great deal of control over styles and layout. There is only so much that can be automated, so it will be up to you to tweak the look of the Enrichment Map to suit your needs. Please refer to the user manual for full description of capabilities.
So what now? Below, we list some helpful tips to aid you in extracting value from an enrichment analysis.
Check out a case study to see how an Enrichment Map workflow was used to provide the basis for experimentally verifiable, mechanistic insights into cell function
No algorithm or statistic can replace the experience, expertise and critical eye of you the researcher. The best place to start with a completed EM is to examine it for genes, pathways and themes that would be expected a priori. These are unsurprising results that have either been previously reported or which one could have easily guessed before the analysis was even performed. Do the results pass the sanity test?
As an example, direct your attention to the cluster annotated with the label ‘blood coagulation platelet activation’ (Hint: Do a text search). This cluster consists of a hierarchy of three gene sets referencing increasingly specific GO terms (Figure 15):
This is where the rubber meets the road: The enrichment analysis and EM are evidence that helps you to bridge bodies of knowledge and make new, interesting connections. The discoveries that will nourish a steady-stream of publications and earn the envy of peers is largely up to you.
One simple way to get the creative juices flowing is to try and understand whether there are established connections between certain gene sets/groups and your specific biological context of interest. For example, one can ask whether there is an established connections between cytokine signaling pathways (‘IL-3’, ‘IL-5’, ‘IL-6’), platelets and malignancies.
Perhaps we’ve noticed an interesting pathway not previously reported in our context. One way to build confidence in the result is to simply examine visually whether there is a contrast in expression between classes.
Recall that our p-values for differential gene expression do not provide any information about the magnitude of the expression differences.
One approach to examine expression of a gene set is to focus on the ‘leading edge’ - the subset of genes that contribute to the enrichment analysis enrichment score (ES). This leading edge can be easily seen in the
Heat Map tab by selecting
GSEA ranking from the
Sorting drop-down (Figure 16).
Recall that the leading edge data derives from our rank file.
GSEA Rankingoption from the
Sortingdrop-down. Genes of the gene set that contribute most to the enrichment score - the 'leading edge' - are highlighted in yellow.
Coming soon to a web app near you: