Previously, we introduced a study by Best et al. (Best 2015) that examined the feasibility of using platelet transcriptomes to distinguish individuals diagnosed with breast cancer (BrCa) from healthy donors (HD). The workflow step took in RNA-Seq counts and metadata describing the samples (i.e. HD or BrCa) and analyzed the data for gene-wise differential expression (DE). One of the outputs from that step is a list of each RNA species and a respective rank, calculated from the DE test. In brief, the magnitude of rank is proportional to the ‘rareness’ of a difference in RNA counts at least as large as that observed, assuming no association between sample class assignment and RNA count.
Note: For the purposes of this discussion, we will use the terms ‘pathway’ and ‘gene set’ interchangeably. It is more appropriate to use the term ‘pathway’ to refer to gene set components that enact a change or product (e.g. signal transduction, metabolism).
In this section we discuss the use of Gene Set Enrichment Analysis (GSEA) to sift out pathways from the underlying alterations in gene expression (Figure 1). The pathways that emerge from this analysis will be passed on to workflow step 3 (Visualize) where we attempt to simplify their interpretation using the Enrichment Map visualization tool.
By then end of this discussion you should:
A detailed description of GSEA is beyond the scope of this section. For a more technical explanation please refer to our GSEA primer. Recall that, in addition to a rank file, GSEA requires a set of candidate gene sets in the form of a gene set database file (Figure 1). Below we describe a typical gene set database and provide an extremely brief overview for how GSEA operates.
GSEA searches through candidate gene sets to identify those that are enriched within our ranked gene list. ‘Gene sets’ are a set of genes grouped together based upon some known relationship. For example, a gene set could consist of loci assigned to a particular chromosomal band. Alternatively, pathways - those genes whose coordinated activity lead to change or produce a product - can also be represented as a gene set. For example, a gene set for a pathway could be represented by genes involved in DNA damage cell cycle checkpoint.
Concretely, let us consider the IL-5 Signaling Pathway curated by as part of the NetPath database (Figure 2).
The pathway depicted in Figure 2 has a corresponding entry in NetPath that contains a listing of the genes and a brief description (Figure 3).
The group of candidate gene sets under consideration for a given GSEA run can be individually selected to reflect the interest and focus of the study at hand. More often, GSEA is used in an exploratory fashion and so candidate gene sets curated by others are often employed, which typically cover a wide swath.
To make life easier, GSEA has standardized the acceptable formats for defining gene set candidates and these are collectively known as ‘gene set databases’. These are simple tab-delimited files with the extensions that include
.grp. Concretely, let us consider a sample gene set database file developed by Gary Bader’s laboratory that defines human gene sets gathered from several sources including MSigDB; Gene Ontology; Reactome; Panther; NetPath; NCI; and HumanCyc. This database file defines a table, one gene set per row: The first column holds the gene set name, the second is a description and the subsequent fields are the names of genes in the set. Figure 4 shows an exceprt of the database, which has an entry for the NetPath IL-5 pathway.
Once we have gathered our candidate gene set database and the ranked gene list, we are ready to proceed. The details of GSEA are rather technical; Instead, we provide an overview of the algorithm (Box 1).
An ES effectively tells us how many of the candidate set's genes are 'bunched up' at the top (positive rank value) or the bottom (negative rank value) of the ranked gene list. More 'bunching' at the top means that the candidate's genes are more likely to be truly upregulated and vice versa.
This is the fraction of the random gene set ESs that exceed the candidate ES. A smaller p-value corresponds to a increasingly rare observation.
The chance that a gene set will be significant increases with the number of gene sets tested. Thus, corrections must be made to take this into account.
Let us review what we have learned in our lead up to this step: GSEA can be used to identify altered pathways from underlying changes in gene expression; The two inputs to GSEA are a rank gene list generated from our differential expression analysis along with a gene set database composed of candidate gene sets that GSEA will filter.
In this workflow step, we will use GSEA to generate two files that report enriched gene sets in each of the two conditions tested. An enrichment report is a table that lists the statistics describing the extent of enrichment (Table 1).
Table 1. Sample enrichment report
|NAME||…||SIZE||ES||NES||NOM p-val||FDR q-val||…|
|EUKARYOTIC TRANSLATION ELONGATION||…||82||-0.9144994||-3.0786333||0||0||0|
Previously, we compared RNA counts in BrCa relative to HD. Accordingly, GSEA will provide one report for each class, or in GSEA terminology, each ‘phenotype’:
These reports are the dependencies for the next workflow step ‘Visualize’ where we use software to view and interact with these enriched pathways in a manner that is far more interpretable than a list of output.
We are opting to use the desktop version of GSEA. There are some hoops to jump through in order to get it:
Launch the GSEA application. You will see the GSEA logo splash then the application itself (Figure 5).
Steps in GSEA analysis) provides quick access to the most common actions. The main window displays the
Hometab by default. Each control panel action typically opens a new tab in the main window.
This is the output from the previous step.
This is the human gene set database culled from various sources, generated by the Bader lab (
Let us assume you have installed the GSEA software and placed the input files in an
input directory. Then you should now have a directory structure similar to the following.
... |--- input | | | |--- brca_hd_tep_ranks.rnk | |--- Human_GOBP_AllPathways_no_GO_iea_February_01_2017_symbol.gmt ...
Here we will step through four tasks that we will need to perform within the GSEA software that will generate our enrichment reports.
In this part we get our inputs into GSEA. In the
Steps in GSEA analysis panel (Figure 5, left) click the
Load data button which will bring up a panel in the main window (Figure 6).
Browse for files...then
Choosethe ranked gene list
Browse for files...then
Choosethe gene set database file
Load databutton in the control panel
Steps in GSEA analysis.
Wait a few seconds for the files to load into memory. You will receive a pop-up dialog if the file was successfully loaded. You will also see the files in the
Object cachepanel of the
We now tell GSEA what these files actually represent and tailor the GSEA run accordingly. Bring up the GSEA pre-ranked tab by selecting
Tools -> GseaPreranked from the toolbar and fill in the details for the
Basic fields (Figure 7).
Gene sets database: Click the ellipsis and wait a few moments for a dialog to pop up. Navigate to
Gene matrix (local gmx/gmt)(click arrow along top). Select the gene set database that you obtained above (
Collapse dataset to gene symbols: False
Analysis name: Choose a name for this particular run of GSEA
Save results in this folder: Choose one
Toolsin the menu dropdown, then
GseaPrerankedto bring up the
Run GSEA on a Pre-Ranked gene listtab.
Run in the
Run Gsea on a Pre-Ranked gene list tab. The
GSEA reports panel (Figure 5, bottom left) will show the
Name of this run and the
Running while in progress.
Take a look at the directory you set for
Save results in this folder. The default location is
gsea_home in your user space. You should see something like the following:
... |--- input | | | |--- brca_hd_tep_ranks.rnk | |--- Human_GOBP_AllPathways_no_GO_iea_February_01_2017_symbol.gmt | |--- gsea_home | |--- output | |--- feb06 | |--- tep_brca_vs_hd.GseaPreranked.XXXXXXXXXXXXX | |--- index.html |--- pos_snapshot.html |--- neg_snapshot.html |--- gsea_report_for_na_pos_XXXXXXXXXXXXX.xls |--- gsea_report_for_na_neg_XXXXXXXXXXXXX.xls |--- my_analysis.GseaPreranked.XXXXXXXXXXXXX.rpt ... ...
When the GSEA software has completed its analysis, the
Status inside the
GSEA reports panel will update to
Success ... (Figure 5, bottom left). You may click this link to view the HTML report inside a browser (Figure 8). Alternatively, open the
index.html file located in the GSEA results directory in a browser.
na_posphenotype corresponds to the BrCa class and
na_negrefers to the HD class.
You can read the complete guide to Interpreting GSEA Results that includes a description of the GSEA report.
Below we will briefly highlight a few aspects to take special note of.
There are two sections by this name which refers to those gene sets with positive and negative enrichment scores, respectively (Figure 8). Recall that in the step Process Data we assessed gene expression in the BrCa class relative to the HD class. Consequently, in our case, the
na_pos phenotype corresponds to the BrCa class and
na_neg refers to the HD class. We briefly discuss the meaning of two of the entries in this section below.
Snapshot of the enrichment results. Recall from Box 1, Step 2 that during GSEA, an enrichment score (ES) is calculated for each candidate gene set. The snapshots display the raw data that GSEA uses in its calculations: The ES is the peak vertical axis value of this plot. Roughly speaking, the magnitude of the ES is indicative of the chances that it will be deemed significant. If you look inside your GSEA home folder you will see
neg_snapshot.html. For the
na_pos case, click on the link for
Snapshot of enrichment results to bring up a panel of enrichment score plots; There should be an entry for our old friend the IL-5 signal transduction pathway, which you can click to bring up the full report showing the summary, genes and the enrichment plot (Figure 9).
Detailed enrichment results. These are the end-goal for our workflow step! These files provide a summary report of gene sets enriched in this phenotype. You can find these files inside your GSEA results folder named
gsea_report_for_na_neg_XXXXXXXXXXXXX.xls. The rows of the report include information for each enriched gene set (Table 2).
Table 2. Exceprt of enrichment report for BrCa class
|NAME||SIZE||ES||NES||NOM p-val||FDR q-val||FWER p-val|
|THROMBOXANE A2 RECEPTOR SIGNALING||39||0.84532094||2.175933||0.0||0.0||0.0|
|SIGNALING EVENTS MEDIATED BY FOCAL ADHESION KINASE||48||0.8144367||2.16101||0.0||0.0||0.0|
Listed below are the outputs of this step that will be required as input dependencies for the next steps of the workflow.
We have renamed the enrichment report files and provide them below.