To make the concepts in our workflow concrete, we use expression data from a study by Myron G. Best and colleagues (Best 2015). Their aim was to differentiate blood platelets from healthy donors to those diagnosed with a malignancy towards a proof-of-principle for blood-based cancer diagnosis.
By then end of this discussion you should (Figure 1):
This workflow is framed using sample data from a study by Best et al. who developed aimed to provide a proof-of-principle for platelet-based cancer diagnosis. We briefly review this and related material in this section.
Cancer surveillance would be greatly aided by practical, low-cost alternatives to support early-detection, diagnosis, and treatment decisions. Ideally these tools would be non-invasive yet retain the sensitivity and accuracy needed to reliably differentiate between normal and pathological states. Blood-based biomarkers have been aggressively pursued as a means to diagnose malignancies. The components of blood that have been examined include both cell-free molecules (e.g. DNA, RNA, proteins) along with immune cells (monocytes, platelets) (Figure 2).
Within the marrow, platelets originate as cytoplasmic fragments of megakaryocytes which ‘bud-off’ into the circulation via shear forces generated by circulating blood. Approximately 1 trillion platelets circulate an adult human at any one time making it the second most abundant cell type in blood. The primary physiological role of platelets is to sense and accumulate at the sites of damaged endothelial tissue and initiate a blood clot to mitigate and vessel leakage (Semple 2011). Disruption of the integrity of the endothelium exposes extracellular molecules that signal adhesion of platelets which, in turn, secrete a host of molecules and cytokines that recruit additional platelets to form the initial hemostatic plug.
Platelets are far from simple vestibules of biomolecules. An increasing body of research supports an active role for platelets in modulating innate and adaptive immune responses and a direct involvement in pathologies such as sepsis, atherosclerosis and rheumatoid arthritis (Semple 2011). In particular, the immune regulatory role of platelets arises from receptor-mediated interactions with pathogens, neutrophils and antigen-presenting cells.
Although anuclear, it has been shown that platelets are neither inert nor homogeneous. At the transcript-level, circulating platelets have been shown to possess functional splicing apparatus that is triggered in response to external activation (Denis 2005). At the protein-level, platelets possess a fully functional translation apparatus and its proteome has been described as a ‘fluid’ of components that rapidly alters depending on the conditions (Lindemann 2007). Thus, platelets are subject to many common aspects of gene regulation in order to sense and respond to their environment.
The pathogenesis of cancer requires the cooperation of a host of stromal and immune cells termed the tumour microenvironment. Indeed, a variety of immune cells that normally suppress cancer are coopted by tumours to enable evasion of immune surveillance in a process termed ‘education’ (Quail and Joyce 2013). One mechanism of heterotypic signalling involves shedding of exomes by cancer cells containing pro-tumourigenic and pro-metastatic factors.
Platelets have been implicated in aiding and abetting the metastatic potential of cancer cells through a variety of routes (Gay 2011) (Figure 3). First, they aggregate in order to shield cancer cells that have entered the vasculature from immune cell recognition. Second, they facilitate their extravasation by enabling cancer cell to arrest and adhesion at points in the vasculature. Third, platelets secrete variety of molecules that support cancer cell survival and promote endothelial permeability, further promoting their extravasation.
The close contact between cancer cells and platelets results in their ‘education’. For example, tumour-associated RNA (e.g. EGFRvIII in brain and PCA3 in prostate) could be detected in platelets (Nilsson 2011) and is consistent with observations that tumor exomes could be taken up by platelets.
The evidence indicating that platelets have intimate contact with cancer cells, take up exome-derived RNA, demonstrate differential splicing in response to their environment and their abundance in blood supports the notion that they may possess a large degree of heterogeneity. Such diversity could be clinically relevant if they enable discrimination between different stages of a malignancy. Best et al. set out to determine just how much diagnostic ‘information’ is contained in platelet transcriptomes.
Best et al. prospectively collected blood platelets from 55 healthy donors (HD) and from 189 treated and untreated patients with cancers at varying stages (Table 1). In particular, 39 of these were from breast cancers (BrCa) which will be the focus of our workflow.
Table 1. Summary of Patient Characteristics Adapted from Best et al.
Figure 4 depicts the sample collection and processing scheme. For each patient, approximately 100-500 picograms of total platelet RNA - the equivalent content in less than a drop of blood - was extracted for sequencing.
Since Best et al. were interested in the discriminatory capacity of transcriptomes, they initially filtered RNA species for those that were intron-spanning and had sufficiently high expression counts (>5) to reduce the amount of noise.
A reduced set of 5 003 protein and non-coding RNAs (excluding Y chromosome and mitochondrial) were used in a pair-wise comparison of expression between HD and pan-cancer samples. Across all cancers the authors identified 1 453 and 793 RNAs with increased and decreased representation, respectively (Figure 4D). These differentially expressed genes were sufficient to discriminate HD and cancer-derived platelets (Figure 4E).
Is the information in platelets sufficiently informative to discriminate between healthy donors and those with breast cancer? To determine this, the authors first performed a clustering analysis to extract a subset of RNA species (n = 192) with discriminatory power then fed these genes into a machine learning algorithm trained to assign the correct category for each sample. In this case, the authors reported a 100% test accuracy in discriminating platelets derived from normal versus breast cancer patients (Figure 5).
That such a plentiful and accessible material could provide the basis of an astonishingly accurate classification scheme is an exciting achievement and lends support for blood-based cancer diagnostics. We can repurpose the same RNA measurements to probe deeper into the biology of the platelets themselves.
Let us revisit two points raised earlier:
While a list of genes expression differences with exquisite discriminatory power for breast cancer diagnosis is useful per se, we can go a step further: What biological processes distinguish platelets from and healthy and diseased patients? In other words, we wish to better understand how those differences in RNA species might underlie various pathways inside a cell. What does the transcriptomic data really mean?
In this workflow step (Box 2), we will walk through the transformation of RNA sequencing data counts generated by Best et al. for platelets from 16 HD and 16 BrCa patients into three output files that are dependencies for later steps:
In this workflow, we will be focusing on a subset of 16 BrCa and 16 HD samples. Table 2 shows an excerpt of RNA-Seq output for one biological sample: The first column indicates the gene symbol (Ensembl) and the second column indicates the mapped sequence read count for that RNA species.
Table 2. Sample high-throughput RNA sequencing counts
We also provide a tab-delimited metadata file (tep_phenotypes.txt) that contains the name (id) of each RNA-Seq file and its corresponding class (Table 3).
This metadata file is something that would not be provided by a sequencing facility but would be simple to create in a text editor or Excel.
Table 3. Contents of metadata file
With our RNA-Seq count files and metadata in-hand, the true work involved at this stage is assigning a rank to each RNA species that based on some measure of differential RNA expression. Rather than provide a detailed discussion of the concerns surrounding differential expression testing, we provide a thumbnail sketch of the tasks involved in achieving this goal.
We refer the reader to our primer on RNA sequencing analysis for a detailed description of the theory underlying the processing steps described here.
Getting the data into the format that is useful for downstream analysis is an important but often under-appreciated aspect of computational biology research. In this case, there are three tasks that we must accomplish with our data and metadata.
First, we must integrate or ‘merge’ the 32 RNA-Seq files together into a single table. The reason for this is that data in table format is a form that is more easily loaded into RNA-Seq analysis software packages. Typically, sequencing facilities often provide individual files for each RNA-Seq sample similar to that in Table 2, but mileage may vary depending on your particular facility.
Second, we must perform gene ‘ID mapping’. This entails translating the names of genes/RNA species provided within the raw RNA-Seq files into a desired namespace. This is necessary because the ‘enrichment’ software that distills pathways from gene expression must be able to match RNA counts for a gene with the genes that constitute candidate pathways.
Third, our metadata file is sufficient to generate our phenotype file (Table 4), which declares the number of samples and classes (row 1), names the classes (row 2) and then declares the class to which each sample belongs (row 3). Since our metadata contains the name (id) of the sample and the class, this is a simple task.
Table 4. Phenotype output (.cls file)
Biological processes are inherently noisy (Raser 2005) and the same goes for gene expression. Some of this gene expression noise arises from the stochastic nature of biochemical reactions which are rather pronounced when dealing with small numbers of molecules.
In practical terms, RNA species with very low mapped read counts in a small number of samples can be highly variable. Consequently, we choose to ignore these in the search for differential expression. Best et al. use the rule of thumb that ‘genes with less than five (non-normalized) read counts in all samples were excluded from analyses’.
RNA for a sample can be sequenced to varying ‘depths’. This means that the total number of sequence reads mapped to a gene for an individual sequencing run is not necessarily constant. The reason for this lies in the nature of next-generation sequencing technologies. Nevertheless, what most concerns us is not the absolute counts of an individual RNA species coming out of a sequencing run but rather the proportion. In practical terms, we desire a fair-comparison of RNA counts between samples that takes into account variation in depth.
Over the years, several approaches have been proposed to account for varying depth in RNA-Seq outputs (Oshlack 2010). Our recommendation is to use a normalization technique called Trimmed mean of M-values (TMM; Robinson & Oshlack 2010) that effectively standardizes counts between distinct sequencing runs by assuming that most genes are not expected to alter their expression.
At this stage, we can generate an expression file of normalized RNA counts where row names are gene symbols and column names are sample IDs (Table 5).
Table 5. Expression output (counts per million mapped reads)
In this stage we perform a pair-wise comparison of RNA species counts in BrCa samples relative HD samples. The framework used to determine differential RNA expression is a ‘hypothesis-testing’ technique that entails the following:
At this stage, we can generate a rank file where row names are gene symbols and a single column indicates the rank calculated as a function of their p-value. The larger the magnitude of the positive or negative rank, the rarer such an observation would be under the assumption of no association between class and RNA count (Table 6).
Table 6. Rank output
Listed below are the outputs of this step that will be required as input dependencies for the next steps of the workflow.