# RNA-Seq to Enrichment Map - R Notebook

Process platelet RNA-Seq data, identify altered pathways then visualize using Enrichment Map

This is an R Notebook companion to the pathway enrichment analysis workflow 'RNA-Seq to Enrichment Map'. This notebook and file dependencies are available for downloaded via the link (top, right).

## A. Overview

Section TODOs

• Install software and package dependencies

This R Notebook documents a comparison mRNA levels between two conditions and uses this information to identify and then visualize pathway-level differences. In particular, you will use convert RNA-Seq count data into a single ranked list where genes are ordered according to their differential expression. Enriched pathways from this list are distilled using Gene Set Enrichment Analysis (GSEA) then visualized as a Cytoscape Enrichment Map.

### Software requirements

We will be using the following software and packages along the way.

#### GSEA

After registering, login then download the GSEA Java Jar file (e.g. gsea-3.0.jar). For the purposes of this notebook, we will assume the GSEA path is /Users/username/bin/gsea-3.0.jar.

#### Cytoscape

After installing Cytoscape you’ll need to load in the Enrichment Map app. Do this through the Apps -> App Manager in the toolbar.

For subsequent clustering and automated labelling of gene set themes, you should consider installing the following apps:

## B. Sample Study

Section To Dos

• Familiarize yourself with study data and related files

In this workflow we will use expression data collected in a study by Myron G. Best and colleagues 1 whose aim was to differentiate blood platelets from healthy donors (HD) to those diagnosed with a breast cancer (BrCa) towards a proof-of-principle for blood-based cancer diagnosis.

### Rationale

The primary physiological role of platelets is to sense and accumulate at the sites of damaged endothelial tissue and initiate a blood clot to mitigate and vessel leakage2. Previous observations are consistent with the notion that platelets have close contact with cancer cells - a process termed ‘education’ - and may play a role in their metastatic potential3. Diversity in tumour-educated platelets (TEP) could be clinically relevant if they enable discrimination between different stages of a malignancy.

### RNA-Seq Data

Best et al. collected blood platelets from 55 HD and from 39 individuals with BrCa and subjected these samples to RNA-sequencing. Herein, we will be restricting our attention to a subset of 5 patient samples from each class. The metadata for the RNA-Seq files is contained in a file named tep_rnaseq_metadata.txt which lists RNA-Seq data filenames (‘id’) and the ‘class’ of each patient sample (Table I).

Table I. Patient TEP RNA-Seq metadata (tep_rnaseq_metadata.txt)

id class
MGH-BrCa-H-74_htsqct.txt BrCa
MGH-BrCa-H-68_htsqct.txt BrCa
MGH-BrCa-H-66_htsqct.txt BrCa
MGH-BrCa-H-59_htsqct.txt BrCa
MGH-BrCa-H-11_htsqct.txt BrCa
HD-5_htsqct.txt HD
HD-4_htsqct.txt HD
HD-3-1_htsqct.txt HD
HD-2-1_htsqct.txt HD
HD-1_htsqct.txt HD

Each RNA-Seq measurement is represented as a tabular text file that, in this instance, contains 57 736 rows: The first column holds an Ensembl ID representing a region of the human genome and the second column a read count for transcripts mapped to that Ensembl ID. All of this is typically taken care of by your sequencing facility. An excerpt of raw data for sample HD-1_htsqct.txt is shown below in Table II.

Table II. Excerpt of RNA-Seq file HD-1_htsqct.txt

ENSG00000000003 0
ENSG00000000005 0
ENSG00000000419 100
ENSG00000000457 6
ENSG00000000460 11
ENSG00000000938 159

## C. Data Pre-Processing

Section To Dos

• Perform RNA-Seq data gene ID mapping
• Merge RNA-Seq data into a single R object

At the end of this section we wish to have the RNA-Seq data into an R data strcuture that we can more easily pipe into our differential gene expression testing procedures. Because of the numerous formats and states that raw RNA-Seq data can present itself, you may require a customized approach.

### Gene identifier mapping

Our pathway enrichment analysis in GSEA identifies gene sets that are enriched in gene expression data. We will be using candidate gene sets named with the HGNC approved symbol. Since our RNA-Seq data use Ensembl identifiers (Table II) we will need to ‘map’ these to their corresponding HGNC symbol. If it is not possible to perform this mapping, we simply omit the gene. Also, beware of duplicates.

### RNA-Seq data merging

We will take the individual RNA-Seq data files and merge them and any associated metadata into a single object, the SummarizedExperiment4.

A SummarizedExperiment stores experimental assays as a rectangular array whose rows correspond to the (genomic) ranges and whose columns correspond to the different samples. The SummarizedExperiment class stores metadata on the rows and columns. Metadata on the samples usually include experimental or observational covariates … Row metadata comprise the start and end coordinates of each feature and the identifier of the containing polymer, for example, the chromosome name.

We offload the above work onto an R function merge_data and its helpers that perform the required ID-mapping and data merging and returns a single SummarizedExperiment instance.

Let’s perform the ID-mapping and data merging.

The result of the merge is a SummarizedExperiment object named brca_hd_tep_se.

The SummarizedExperiment package provides a number of helpful accessor functions to examine the data.

For instance, we can take a peek at the column (sample) metadata using the SummarizedExperiment::colData function.

class
MGH-BrCa-H-74_htsqct BrCa
MGH-BrCa-H-68_htsqct BrCa
MGH-BrCa-H-66_htsqct BrCa
MGH-BrCa-H-59_htsqct BrCa
MGH-BrCa-H-11_htsqct BrCa
HD-5_htsqct HD
HD-4_htsqct HD
HD-3-1_htsqct HD
HD-2-1_htsqct HD
HD-1_htsqct HD

Similarly, we can peek at an excerpt of the row metadata (i.e. GenomicRanges) using the SummarizedExperiment::rowRanges function.

seqnames start end width strand ensembl_gene_id
RNU6-871P chr12 59450673 59450772 100 - ENSG00000251931
MIR626 chr15 41691585 41691678 94 + ENSG00000207766
RNU6-35P chr4 109992325 109992431 107 + ENSG00000207260
MIR5694 chr14 67441855 67441930 76 - ENSG00000265993
RNU6-1157P chr11 118593988 118594093 106 + ENSG00000207185
RNU4-85P chr3 19996803 19996925 123 + ENSG00000201545

Note that we were only able to successfully map 34 996 unique HGNC symbols from the original set of Ensembl IDs.

Finally, we can peek at the assay data using the SummarizedExperiment::assay function.

MGH-BrCa-H-74_htsqct MGH-BrCa-H-68_htsqct HD-5_htsqct HD-4_htsqct
RNU6-871P 0 0 0 0
MIR626 0 0 0 0
RNU6-35P 0 0 0 0
MIR5694 0 0 0 0
RNU6-1157P 0 0 0 0
RNU4-85P 0 0 0 0

## D. Differential Expression Testing

Section To Dos

It will be convenient to generate the ranked list, expression and class files alongside the RNA-Seq data transformations. We will be using the edgeR Bioconductor package.

### Filtering

RNA species with very low mapped read counts in a small number of samples can be highly variable. Consequently, we choose to ignore these in the search for differential expression. Best et al. use the rule of thumb that ‘genes with less than five (non-normalized) read counts in all samples were excluded from analyses’.

Note that our end product is a very convenient class, the edgeR::DGEList.

The main components of an DGEList object are a matrix counts containing the integer counts, a data.frame samples containing information about the samples or libraries, and a optional data.frame genes containing annotation for the genes or genomic features. The data.frame samples contains a column lib.size for the library size or sequencing depth for each sample. If not specified by the user, the library sizes will be computed from the column sums of the counts. For classic edgeR the data.frame samples must also contain a column group, identifying the group membership of each sample.

With this in mind, let’s take a look at the ‘counts’ component.

MGH-BrCa-H-74_htsqct MGH-BrCa-H-68_htsqct MGH-BrCa-H-11_htsqct HD-5_htsqct
MAP4K5 255 520 854 238
SAV1 167 189 323 34
IRF3 5 5 23 22
ZC3H11A 80 145 181 16
NUP88 24 13 49 45
STXBP5-AS1 67 159 90 46

Similarly, we can peek inside the ‘samples’ component.

group lib.size norm.factors
MGH-BrCa-H-74_htsqct BrCa 1010942 1
MGH-BrCa-H-68_htsqct BrCa 1458995 1
MGH-BrCa-H-66_htsqct BrCa 1104454 1
MGH-BrCa-H-59_htsqct BrCa 375353 1
MGH-BrCa-H-11_htsqct BrCa 1567897 1
HD-5_htsqct HD 978816 1
HD-4_htsqct HD 942170 1
HD-3-1_htsqct HD 1006665 1
HD-2-1_htsqct HD 771351 1
HD-1_htsqct HD 1295886 1

### Normalization

RNA for a sample can be sequenced to varying ‘depths’ in that the total number of sequence reads mapped to a gene for an individual sequencing run is not necessarily constant. What most concerns us is not the absolute counts of an individual RNA species coming out of a sequencing run but rather the proportion. In practical terms, we desire a fair-comparison of RNA counts between samples that takes into account variation in depth. Our recommendation is to use a normalization technique called Trimmed mean of M-values (TMM)5 that effectively standardizes counts between distinct sequencing runs by assuming that most genes are not expected to alter their expression.

group lib.size norm.factors
MGH-BrCa-H-74_htsqct BrCa 1010942 0.8467126
MGH-BrCa-H-68_htsqct BrCa 1458995 0.9171140
MGH-BrCa-H-66_htsqct BrCa 1104454 1.0598800
MGH-BrCa-H-59_htsqct BrCa 375353 0.9411707
MGH-BrCa-H-11_htsqct BrCa 1567897 1.1394811
HD-5_htsqct HD 978816 0.8057214
HD-4_htsqct HD 942170 1.0404548
HD-3-1_htsqct HD 1006665 0.9094940
HD-2-1_htsqct HD 771351 1.1216675
HD-1_htsqct HD 1295886 1.3247572

Note that the TMM-normalized DGEList called brca_hd_tep_tmm_normalized_dge has updated column norm.factors in component ‘samples’. This value will be used to determine an ‘effective library size’, leaving the raw counts untouched.

#### Text file for expression dataset (TXT)

At this point we can generate an expression file of normalized RNA counts where row names are gene symbols and column names are sample IDs. We will use this later to be able to display an expression heatmap for any enriched pathway in our Enrichment Map.

We should have a file brca_hd_tep_tmm_normalized_expression.txt containing the following tab-delimited content:

NAME DESCRIPTION MGH-BrCa-H-74_htsqct MGH-BrCa-H-68_htsqct HD-5_htsqct HD-4_htsqct
MAP4K5 MAP4K5 297.905074 388.620956 301.78036 181.57979
SAV1 SAV1 195.098617 141.248771 43.11148 20.40222
IRF3 IRF3 5.841276 3.736740 27.89566 32.64356
ZC3H11A ZC3H11A 93.460415 108.365459 20.28776 73.44800
NUP88 NUP88 28.038125 9.715524 57.05931 71.40778
STXBP5-AS1 STXBP5-AS1 78.273098 118.828331 58.32730 67.32734

### Differential expression testing

In this step we perform a pair-wise comparison of RNA species counts in BrCa samples relative HD samples. The framework used to determine differential RNA expression is a ‘hypothesis-testing’ technique that entails:

• Declare the null hypothesis of no difference in RNA counts for each gene
• Define a null distribution that describes how RNA counts vary under circumstances where there is no association between RNA counts and class
• Calculate the probability (p-value) of observing a difference in RNA species counts at least as extreme as the one observed assuming the null hypothesis/distribution
• Perform multiple-testing correction

The result of these transformations is a topTags object named brca_hd_tep_de_tested_tt. We can peek inside our result to view the list of genes ranked by p-value from differential expression testing.

logFC logCPM PValue FDR
TRIM58 6.577953 7.818481 0 0
ARHGAP45 6.612944 9.047699 0 0
NCK2 5.679970 7.160183 0 0
GAS2L1 5.946095 7.488178 0 0
SPTB 5.334870 7.041291 0 0
ANKRD9 6.913215 6.214325 0 0

#### Rank list file (RNK)

At this stage, we can generate a rank list file where row names are gene symbols and a single column indicates the rank calculated as a function of their p-value. The larger the magnitude of the positive or negative rank, the rarer such an observation would be under the assumption of no association between class and RNA count.

Let’s peek at the top and bottom of our rank list saved in brca_hd_tep.rnk.

gene rank
TRIM58 34.46062
ARHGAP45 32.99957
NCK2 27.51070
GAS2L1 23.96204
SPTB 23.90996
gene rank
MS4A1 -11.50463
CD79A -11.53814
MDM4 -12.93184
LYZ -15.42705
CD74 -16.58382

#### Categorical class file (CLS)

This file will be used in the Enrichment Map so that we can differentiate (i.e. colour) those pathways ‘up-regulated’ in HD versus BrCa samples. The CLS format contains information about the sample classes (aka ‘condition’, ‘phenotype’) and assigns each sample to one class.

The matrix brca_hd_tep_cls has the following format, assuming N samples:

Total samples Total classes 1
# Class Name Class Name
Sample 1 class Sample 2 class Sample 3 class Sample N class
 [,1]                                      [1,] "10 2 1"                                  [2,] "# HD BrCa"                               [3,] "BrCa BrCa BrCa BrCa BrCa HD HD HD HD HD"


## E. Gene Set Enrichment Analysis

Section To Dos

• Obtain gene set database of candidates
• Run GSEA and obtain enriched gene sets

### Gene set database

A gene set database is a text file that describes names of gene sets and one of more gene IDs that are in that gene set. We will be using the Gene Matric Transposed (GMT) format.

For your convenience, gene set database (GMT) files for human, mouse and rat are curated by the Bader lab on a regular basis from various sources. In this instance we provide Human_GOBP_AllPathways_no_GO_iea_August_01_2017_symbol.gmt which is a database of human gene sets that does not include terms inferred from electronic annotations (iea).

### Run GSEA

GSEA comes as stand-alone Java program. Recall that we asked you to register, login and download the Java jar file to a convenient location (e.g. /Users/username/bin/gsea-3.0.jar). We will run GSEA via R with a set of options. Any of the supplied options can be customized.

In our GSEA run, the following relevant options have been specified:

• rpt_label - name of output folder for this run
• out - directory for results
• gmx - path to the gene set definition (gmt) file
• rnk - path to rank list file (RNK)
• nperm - number of permutations to generate null distribution
• set_min (set_max) - limits on number of genes for candidate gene sets
• scoring_scheme - how to calculate contributions of genes to a gene set’s score
• permute - for GSEA preranked you can only permute via gene_sets
• num - number of results to plot output file for
• rnd_seed - random seed to use
• zip_report - zip to output directory
• gui - when running from the commandline this needs to be false
GSEA can take many minutes to complete. Mileage may vary depending on the settings. Our analysis took 5 minutes on a MacBook Air (Early 2015), 2.2 GHz Intel Core i7 (8GB 16600 MHz DDR3).

In this case, we will have a directory tep_BrCa_HD.Preranked.XXXXXXXXXXXXX inside of the directory declared by in gsea_out that contains reports and figures generated during GSEA.

Look for the following files:

• index.html - This is a GSEA report that summarizes the analysis results
• gsea_report_for_na_pos_XXXXXXXXXXXXX.xls - The collection of gene sets enriched in the BrCa class. The ‘pos’ label originates from the way in which we compared the RNA-Seq classes, that is, BrCa relative to HD
• gsea_report_for_na_neg_XXXXXXXXXXXXX.xls - The collection of gene sets enriched in the HD class

Please refer to the GSEA documentation on ‘Interpreting GSEA Results’ for full details.

## F. Enrichment Map

Section To Dos

• Obtain an Enrichment Map of the enriched pathways

Enrichment Map6 is a visualization analysis tool that organizes gene sets into an information-rich similarity network. The true power of Enrichment Map is that it is a visual display method that reduces complexity by grouping similar gene sets as defined by the number of overlapping genes.

Gene sets are represented as nodes whose radius is proportional to the number of genes. Edges indicate nodes with shared genes, where the thickness of the line is proportional to the degree of overlap. Finally, EM can use node color to represent other dimensions of the data, for example, gene sets enriched in different classes.

### Cytoscape

Recall that we asked you to download and install Cytoscape (3.5.1) along with various apps. Open Cytoscape on your computer as this notebook will run commands to it through a custom package cytoscape-automation/r2cytoscape (0.0.3).

### Create Enrichment Map

We’re ready to declare our options for the Enrichment Map Cytoscape app.

Let’s take a peek at the Enrichment Map.

Often times, the complexity of an Enrichment Map can be reduced even further: Clusters of gene sets can be collapsed and annotated with a representative label gleaned from the characteristics of the individual gene sets.

Finally, let’s get a view of our annotated Enrichment Map.

Please refer to the full ‘RNA-Seq to Enrichment Map’ workflow for details.