- Table of Contents

In this section we discuss the use of Gene Set Enrichment Analysis (GSEA) to identify pathways enriched in ranked gene lists, with a particular emphasis on ordering based on a measure of differential gene expression. We aim to convey how the approach works from an intuitive standpoint before dividing into a full discussion of the statistical underpinnings. By then end of this discussion you should:

- Understand what you can get out of GSEA
- Be aware of the advantages over previous methods
- Be aware of the statistical basis of the approach

High-throughput approaches for gene expression measurement can generate a tremendous volume of data but can be unwieldy and easily outstrip intuition. The noisy nature of biological processes compounds the difficulty in interpreting outputs of large-scale experiments. Clearly, there is a need for robust approaches that place data in a wider context that is more biologically meaningful to the scientific question at hand.

To this end, approaches collectively termed ‘Overrepresentation Analyses’ (ORA) were developed to take large lists of genes emerging from experimental results and determine whether there was evidence of enrichment for gene sets grouped on the basis of some shared theme (Khatri 2005, Khatri 2012). In simple terms, ORA approaches aim to distill which pathways are present within a list of genes. A popular source of sets is the Gene Ontology (GO) which groups of genes according to various biological processes and molecular functions.

While tremendously useful for interpreting differential expression output, ORA approaches have three major limitations. First, the inclusion criteria for input gene lists are rather arbitrary and typically involves selecting genes that exceed some user-defined statistical cutoff. This risks excluding potentially important genes that for whatever reason fail to reach statistical significance. Second, ORA approaches use gene names but not any of the rich quantitative information associated with gene expression experiments. In this way, equal importance is assigned to each an every gene. Third, many of the ORA procedures uses statistical procedures that assume independence among genes: Changes in any gene do not affect or are not influenced by any others. Clearly, this is unrealistic for biological systems and has the effect of making ORA procedures more prone to erroneous discoveries or false-positives.

Gene Set Enrichment Analysis (GSEA) is a tool that belongs to a class of second-generation pathway analysis approaches referred to as *significance analysis of function and expression (SAFE)* (Barry 2005). These methods are distinguished from their forerunners in that they make use of entire data sets including quantitive data gene expression values or their proxies.

Methods that fall under the SAFE framework use a four-step approach to map gene lists onto pathways

- Calculate a local (gene-level) statistic
- Calculate a global (gene set or pathway-level) statistic
- Determine significance of the global statistic
- Adjust for multiple testing

GSEA was first described by Mootha *et al.* (Mootha 2003) in an attempt to shed light on the mechanistic basis of Type 2 diabetes mellitus. They reasoned that alterations in gene expression associated with a disease can manifest at the level of biological pathways or co-regulated gene sets, rather than individual genes. The lack of power to detect true signals at the gene level may be a consequence of high-throughput expression measurements which involve heterogeneous samples, modest sample sizes and subtle but nevertheless meaningful alterations expression levels. Ultimately these confound attempts to derive reproducible associations with a biological state of interest.

In their study, Mootha *et al.* employed microarrays to compare gene expression in mice with diabetes mellitus (DM2) to controls with normal glucose tolerance (NGT). On one hand, two statistical analysis approaches failed to identify a single differentially expressed gene between these two groups of mice. On the other hand, GSEA identified oxidative phosphorylation (OXPHOS) as the top scoring gene set down-regulated in DM2; The next four top-scoring gene sets largely overlapped with OXPHOS.

Importantly, the prominence of OXPHOS genes provided the necessary clues for the authors to hypothesize that peroxisome proliferator-activated receptor coactivator 1 (PGC-1 encoded by *PPARGC1*) might play a role in DM2. Indeed, follow-up experiments demonstrated that *PPARGC1* expression was lower in DM2 and over-expression in mouse skeletal cells was sufficient to increase OXPHOS expression. Buoyed by the success of GSEA in this case, the authors went on to suggest the particular niche that the approach might occupy.

Single-gene methods are powerful only when the individual gene effect is marked and the variance is small across individuals, which may not be the case in many disease states. Methods like GSEA are complementary to single-gene approaches and provide a framework with which to examine changes operating at a higher level of biological organization. This may be needed if common, complex disorders typically result from modest variation in the expression or activity of multiple members of a pathway. As gene sets are systematically assembled using functional and genomic approaches, methods such as GSEA will be valuable in detecting coordinated variation in gene function that contributes to common human diseases.

Criticisms concerning the original methodology (Damian 2004) were considered in an updated version of GSEA described in detail by Subramanian *et al.* (Subramanian 2005). Below, we provide a description of the approach with particular emphasis on the protocol we recommend for analyzing gene lists ranked according to differential expression.

In this step, we describe a local or gene-level measure that is used to rank genes, in GSEA terminology, a ‘rank metric’. Previously, we described how to obtain and process RNA-seq datasets into a single list of genes ordered according to a function of each gene’s p-value calculated as part of differential expression testing. In this context, a p-value assigned to a gene can be interpreted as the probability of a difference in gene expression between groups at least as extreme as that observed given no inter-group difference. We simply declare this function of p-values the rank metric (Figure 1).

An example of a rank metric is the product of the sign of the ‘direction’ in the expression change (i.e. ‘up-regulation’ is positive and ‘down-regulation’ is negative) and the p-value ().

Under this rank metric, up-regulated genes with relatively small p-values appear at the top of the list and down-regulated genes with small p-values at the bottom.

The global statistic is at the very heart of GSEA and the rather simple algorithm used to calculate it belies a rather profound statistical method to judge each candidate gene set. We will set aside the technical details for a moment to see how GSEA calculates the enrichment score for a given pathway.

Let’s pick up the process following Figure 1 where we have a list of ranked genes. For illustrative purposes, suppose we wish to test the list for enrichment of a cell cycle pathway (Figure 2).

GSEA considers candidate gene sets one at a time. To calculate the enrichment score, GSEA starts at the top of the ranked gene list. If a gene is a member of the candidate gene set then it adds to a running sum, otherwise, it subtracts. This process is repeated for each gene in the ranked list and the enrichment score for that gene set is equal to the largest absolute value that the running sum achieved.

One aspect of this algorithm we side-stepped is the value that gets added or subtracted. In the original version of GSEA (Mootha 2003) the values were chosen specifically such that the sum over all genes would be zero. In Figure 2, this would be equivalent to the running sum meeting the horizontal axis at the end of the list. In this case, the enrichment score is the Kolmogorov-Smirnov (K-S) statistic that is used to determines if the enrichment score is statistically significant. The updated procedure described by Subramanian *et al* (Subramanian 2005) uses a ‘weighted’ version of the procedure whereby the increments to the running sum are proportional to the rank metric for that gene. The reasons for these choices are rather technical and we reserve this for those more mathematically inclined in the following section.

Consider a single gene set indexed by . The gene set consists of a list of genes (), that is . Note that each gene in the set must be a represented in the ranked list as you will see shortly.

Define the set of genes outside of the set as . We summarize the relevant notation up to this point

**Notation**- Number of genes:
- Gene rank metric:
- Ranked gene list:
- Gene set: where
- Genes in gene set:
- Genes not in gene set:

Define the enrichment score for a given gene set as which is the (weighted) Kolmogorov-Smirnov (K-S) statistic.

Where is the supremum and the indices represents the position or rank in . The is the largest difference in which are the (weighted) empirical cumulative distribution functions.

Where is the indicator function for membership in the specified gene set. We will note here that the enrichment score is a function of the gene set size, which will come into play when we discuss significance testing below. To get a better feel for what these equations mean, let’s see what happens when we vary the exponent .

Consider the simple case when . Then all contributions from genes in the gene set do not take into account the rank metric in \eqref{eq:2}. In effect, this gives all genes equal weight.

Under these circumstances, the calculated using \eqref{eq:4} is the ‘classic’ version of the Kolmogorov-Smirnov (K-S) statistic. It is more common to represent as an empirical cumulative distribution function (ecdf) of an order statistic (). Without loss of generality, define the order statistic as the rank or index of each gene in that is, . For ease of notation, we drop the subscript brackets.

Using the order statistic notation, equation \eqref{eq:4} describing the contribution of genes within the gene set can be expressed as

In the same way, we can express equation \eqref{eq:3} for the genes outside the gene set as

Rather than plotting a single running sum (Figure 2) we can plot its constituent and on the same axes, such as in Figure 3.

Upon close inspection, it is simple to relate the running sum (bottom Figure 2) with the ecdfs (Figure 3): Increases in the running sum correspond to increments in the ecdf of genes within the gene set (Figure 3, red) and likewise, decreases in the running sum correspond to increments in the ecdf for genes outside (blue). Then, the enrichment score - the largest deviation in the running sum - corresponds to the largest vertical distance between ecdf plots (dotted green).

We are now ready to handle the most important question that gets to the very validity of GSEA: *When is the ‘significant’?*.

We have learned that the can be represented as the biggest distance between component ecdfs. To ask whether a given is significant is equivalent to asking whether the component ecdfs represent different ‘samples’ of the same cdf and whether their differences are attributable to minor sampling errors. For didactic purposes, we’ll first present a simpler case, where we set up a K-S goodness-of-fit test between a single empirical cdf derived from a sample and a theoretical ‘specified’ cdf. We will also define concepts mentioned already in a more rigorous fashion.

Suppose we have an independent and identically distributed sample with some unknown cdf and we wish to test whether it is equal to a specified cdf . The null hypothesis is

Concretely, define the empirical cumulative distribution function (ecdf) that is generated from the data.

Then the K-S goodness-of-fit test proceeds by assuming that the ecdf in equation \eqref{eq:7} is an estimate of a specified cdf .

**Definition** The **Kolmogorov-Smirnov statistic ** is the distance between two functions.

So given the null hypothesis that the ecdf is a sample from the specified cdf, we want to know how the behaves. In other words, if we calculate a value of then we wish to know if it is a discrepancy that is worthy of further investigation. It turns out that there is an analytic solution to this question, that is, there is an equation for the probability distribution of so that we can derive a p-value under our null hypothesis. To get to this point, we’ll need a few theorems, which we present without proof.

**Theorem 1** The **Glivenko-Cantelli** theorem:

The significance of this theorem is subtle: The sample points we use to construct our ecdf along with *all those points in between* are sure to be within a specified distance of the target cdf when . This is also termed ‘uniform convergence’.

**Theorem 2** The **distribution-free property** states that the distribution of is the same for all continuous underlying distribution functions .

The distribution-free property is a key aspect of the K-S test and pretty powerful result. It says that regardless of whether the is normal, uniform, gamma or even some completely unknown distribution, they all have the same distribution. This is particularly useful because we won’t have any idea what the distribution of the ecdfs used to construct our GSEA running sum will be.

The Glivenko-Cantelli and the distribution-free properties are nice, but not very useful in practice. Knowing that an ecdf will converge regardless of the form of the distribution is just the start. What we really want to know is *how* the convergence happens, that is, how is distributed. This leads us right into the next theorem.

**Theorem 3** Define as the Kolmogorov-Smirnov distribution then,

Theorem 3 is the culmination of an extensive body of knowledge surrounding empirical process theory that, in simple terms, describes the distribution of random walks between two fixed endpoints and bounds (i.e. the interval as these are the cdf bounds) otherwise known as a ‘Brownian Bridge’.

The practical outcome of these theorems is that it gives us an equation from which we can calculate the exact probability of our maximum ecdf deviation from a specified cdf. In the K-S hypothesis testing framework, we set an *a priori* significance level and calculate the probability of our observed or anything more extreme, denoting this the p-value . If then this would suggest a discrepancy between the data and the specified cdf causing us to doubt the validity of .

Recall that by definition, the enrichment score for any given candidate gene set depends upon its size. In this case, the scores are not identically distributed and must be normalized prior to calculating the p-values. We reserve this discussion for the next section we when broach the topic of null distributions.

We are now ready to describe the setup that occurs in GSEA where we compare ecdfs representing the distribution of genes within and outside the gene set. Suppose we have a sample with cdf and a second sample with cdf . GSEA is effectively a test of the null hypothesis

The corresponding ecdfs are and as before and the K-S statistic is

and the rest is the same.

Consider the case when which is the recommended setting in GSEA. Then the global statistic is weighted by the value of the rank metric in \eqref{eq:2}. Effectively, this renders the enrichment score more sensitive to genes at the top and bottom of the gene list compared to the classic K-S case.

The choice to depart from the classic K-S statistic was two-fold. First, Subramanian *et al.* noticed an unwanted sensitivity of the enrichment score to genes in the middle of the list.

In the original implementation, the running sum statistic used equal weights at every stop, which yielded high scores for the sets clustered near the middle of the ranked list…These sets do not represent biologically relevant correlation with the phenotype.

Second, the authors were unsatisfied with the observation that the unweighted method failed to identify well-known gene expression responses, in particular, the p53 transcriptional program in wild-type cells.

In the examples described in the text, and in many other examples not reported, we found that p = 1 (weighting by the correlation) is a very reasonable choice that allows significant genes with less than perfect coherence, i.e. only a subset of genes in the set are coordinately expressed, to score well.

While weighting the enrichment score calculation with the rank metric can increase the power of the approach to detect valid gene sets, it does have several consequences that must be kept in mind.

Recall that the enrichment score is a function of the size of the gene set . This means that enrichment scores must be normalized for gene set size. In the unweighted case, an analytic expression for the normalization factor can be estimate but when terms are weighted, this is no longer the case and once again, we defer to bootstrap measures to derive it.

Another trade-off incurred by departing from the classic K-S statistic is that we no longer have an analytic expression for the null distribution of the enrichment scores as described by equation \eqref{eq:10}. This motivates the empirical bootstrap procedure for deriving the null distribution discussed in the following section concerning significance testing.

The above discussion motivates the next section on how GSEA generates null distributions for candidate gene set enrichment scores.

To recap, GSEA uses the set of rank metrics for a gene list to calculate a set of enrichment scores for candidate gene sets. The primary issue at this point is which scores are indicative of enrichment? In hypothesis testing jargon, we wish to determine the statistical significance of each global statistic. We accomplish this by deriving a p-value representing the probability of observing a given enrichment score or anything more extreme. To do this, we require some understanding of how statistics are distributed.

From our discussion of the global statistic, using a weighted enrichment score leaves us without an analytic description of their null distribution. That is, weighting the enrichment score with the local statistic deviates from the classic Kolmogorov-Smirnov statistic which would typically follow a K-S-like distribution.

GSEA employs ‘resampling’ or ‘bootstrap’ methods to derive an empirical sample of the null distribution for the enrichment scores of each gene set. The GSEA software provides a choice of two flavours of permutation methods that underlie the null distribution calculations (Figure 4).

The SAFE and GSEA publications describe ‘phenotype’ permutation approach to sample the null distribution (Figure 4A). For a given gene set , this amounts to randomly swapping sample labels and recalculating a permutation enrichment score . This process is repeated times to generate which is a vector of points sufficient to depict the underlying distribution.

From an intuitive standpoint, this generates a sample of the enrichment score distribution under the assumption of no particular association between gene rank and phenotype. In other words, we get a sense of how widely enrichment scores vary and how often when the two groups are effectively the same.

From a statistical standpoint, the authors claim that this provides a more accurate description of the null model.

Importantly, the permutation of class labels preserves gene-gene correlations and, thus, provides a more biologically reasonable assessment of significance than would be obtained by permuting genes.

Indeed, variations on GSEA that purport to simply the methodology rely on an assumption of gene-gene independence (Irizarry 2009). However, empirical comparisons suggest that ignoring these correlations leads to variance inflation - a wider and flatter null distribution - resulting in a much higher risk of false discovery for gene sets (Tamayo 2016).

The workflow we recommend uses as input a ‘pre-ranked’ list of genes are ordered by a function of the p-value for differential expression. In GSEA software this is called ‘GSEAPreranked’ and precludes phenotype permutation.

Rather, a gene set permutation approach is used to generate the null distribution (Figure 4B). For each gene set of size , the same number genes are randomly selected from the ranked list and the corresponding enrichment score is calculated. This process is repeated times to generate the sample null distribution consisting of the vector .

The GSEA team recommends using phenotype permutation whenever possible. This preserves the correlation structure between the genes in the dataset. Gene set permutation creates random gene sets and so disrupts the gene-gene correlations in the data. Thus, gene set permutation provides a relatively weaker (less stringent) assessment of significance.

Once an empirical null distribution of values is in hand using either of the permutation methods described above, it is straightforward to calculate the p-value for an enrichment score (Figure 4C). By definition, is the probability of observing a statistic or anything more extreme under the null hypothesis. In practice, we derive an empirical p-value by calculating the fraction of null values greater than or equal to our observed value .

Note that throughout this discussion we have chosen to depict the GSEA global statistic as positive deviations of the running sum (Figure 2). Of course, there is no particular reason why the deviations could not be negative. Indeed a more accurate description of the results of permutation would see a bimodal distribution representing global statistics that lie above and below zero (Figure 5). Thus, it is more accurate to say that the p-values are calculated from the positive or negative region of the empirical null distribution.

You may wish to review our primer on multiple testing, in particular the section on false discovery rates.

When we test a family of hypotheses, the chance of observing a statistic with a small p-value increases. When smaller than the significance level, they can be erroneously classified as discoveries or Type I errors. Multiple testing procedures attempt to quantify and control for these.

In GSEA, the collection of gene sets interrogated against the observed data are a family of hypotheses. The recommended procedure for quantifying Type I errors is the false discovery rate (FDR). The FDR is defined as the expected value of the fraction of rejected null hypotheses that are in fact true. In practice, GSEA establishes this proportion empirically.

In general, given a specified threshold of the global statistic, the FDR is the number of true null hypotheses larger than divided by the sum of true and false null hypotheses larger than . For GSEA, would be some value of the enrichment score and a true null hypothesis would represent a gene set that is in fact not enriched. In practice, we won’t directly observe the latter so it is estimated from the values of the empirical null that exceed (Figure 6).

Suppose there are observed and empirical null distribution data points. If the number of observed and null statistics beyond the threshold are and , respectively, then the empirical false discovery rate is

For a nice primer on empirical methods for FDR estimation, we refer the reader to a well written piece by William S. Noble entitled

‘How does multiple testing correction work?’(Noble 2009).

At first glance, this appears to be quite a straightforward exercise: We retrieve our sets of null distributions and observed enrichment statistics and calculate the ratio for a given enrichment score threshold. However, there is one glaring problem: The enrichment statistic depends on gene set size which precludes comparison across gene sets.

That the enrichment score depends on gene set size can be seen from the definition in equations \eqref{eq:1} - \eqref{eq:3}. Consequently, unless we are in an unlikely scenario where all the gene sets we test are of equal size, the enrichment scores will not be identically distributed and hence cannot be directly compared. This precludes the calculation of an empirical false discovery rate.

GSEA solves this problem by applying a transformation to calculated enrichment scores such that they lie on a comparable scale. In particular, this normalized enrichment score (NES) is the enrichment score divided by the expected value (i.e. average) of the corresponding null distribution. Concretely, for each gene set we derive an enrichment score and a corresponding null distribution via gene set permutation.

The normalized enrichment score is the score divided by the expected value of the corresponding null

In practice, we will partition the null into positive and negative values and use the average of these to normalize the positive and negative enrichment scores, respectively.

Remember that, in addition to the gene set enrichment scores (Figure 6, blue), we will also need a normalized null distribution for every gene set (Figure 6, red). Each element of a given null distribution will be determined in a similar fashion

We can now restate the null hypothesis

where is derived empirically through over all gene sets.

Given the observed and null normalized enrichment scores, the FDR can be calculated using an approach similar to that described in the previous section (equation \eqref{eq:13}).