blocks_image
An interactive Gene Set Enrichement Interface: The Gene Set Matrix.
The first of the two question was more important to us, therefore we used a recently described algorithm (GSA) which has an improved sensitivity and specificity over "classical" GSEA for assessing the differences. We also used a more stringent significance criterion (False Discovery Rate [FDR] of < 10%, compared to the suggested FDR <25% cutoff in the GSEA software tool).
We compared each computationally defined cluster with every other cluster. The results can be plotted as a matrix of pairwise values, where each number indicates the number of gene sets that were found to be significantly differentially regulated between two sample clusters.

The comparisons have two aspects:

One, where a number of gene sets were identified as up regulated in the upper sample cluster as compared to the lower sample cluster. A second one, where a number of gene sets were found to be enriched in the lower sample cluster as compared to the upper sample cluster.
We then use the Gene Set Analysis Matrix (GSA Matrix) as interface enable easy access to the Gene Sets found to be enriched in between two clusters.
In this part of the study we have compared each computationally defined NMF cluster with each other cluster. We address two questions with this approach:

1. Are there actual biologically meaningful differences between the computationally defined sNMF clusters?
This questions is answered by the number of gene sets enriched in one sample cluster versus the other.

2. What is the biological basis of these differences?
This questions is answered by the identity and biological implication of a differentially enriched Gene Set, which can tell us
what predefined molecular pathways or downstream transcription factor targets are enriched in one cluster versus the other.

In this part of the study we defined "biologically meaningful differences" as a significant, differential regulation ("enrichment") of Gene Sets. Gene Sets are either based on experimental evidence or were curated to represent as gene set curated "text book knowledge" (e. g. the
Krebs cycle). Gene Sets represent a powerful tool for the analysis of microarray datasets with externally validated biological knowledge, thus avoiding circular reasoning in our study.
Gene sets were taken from two public databases,
MSigDB2 and the Stanford Synthetic Gene Collection (~3000 Gene Sets altogether).

The light grey block at the upper intersection of two clusters represents the number of genes sets up-regulated in the upper cluster, the light grey block at the lower intersection of two cluster represents the number of gene sets up-regulated in the other cluster as compared to the first cluster.
Figure 1 illustrates this display method. Clicking on each comparison tile will return a detailed list of the pathways enriched in the upper and the lower cluster in the diagonale. The first column in each comparison block will list the Gene Sets enriched in the upper cluster, the second column in each comparison Clicking on the clusters on the diagonal will return a web page that displays a
TreeMap representation of the sparse NMF cluster consensus summary statistics values.
Figure 1: Gene Set Matrix Explanation
A
B
C