Every microarray experiment faces a problem also referred to as the “curse of dimensionality”. In brief, each expression measurement in a given sample can formally be regarded as a dimension in a space that contains as many dimensions as measurements were made. In our specific case, this translates into a ~23000 dimensional space.

However, at the same time, another general rule applies, which is sometimes referred to as “concentration of measure”, which could also be formulated as “there are only a few things that really matter” (to a particular biological question, one may add).
Simplistically, the above-mentioned dimensions can be condensed into a few components that contain meaningful information in regard to a specific biological problem. This dimension reduction can be achieved by various methods; our choice was
non-negative matrix factorization (NMF).

Without going too much into the details, we would like to highlight some features and concepts that render this algorithm, in our opinion, an appropriate tool for our study and discuss the “readouts” (summary statistics, charts and graphs) that communicate the findings to the reader.

NMF has the ability to detect and weigh parts within a given dataset (and “formulate” these as W-matrix) that can be used by multiplication with a second, also inferred matrix (H-Matrix) to reconstruct the original dataset. The parts-based decomposition of the “big” picture is an important difference to other dimension-reduction algorithms.

The fact that parts-based decomposition is possible in NMF allows for subsystems to be the driving force for classification of samples within the consensus clustering framework. In the context of microarray studies, subsystems or dimensions are sometimes also referred to as “metagenes”.

How could this rather abstract concept be explained with a real-life-like metaphor?

Assume that we have different types of “personal computers”, which can run “software”, such as Word, Excel, games such as Pac Man, an email client and a web browser.

The computer “code” executed on these machines could be compared to mRNA transcripts, which are “uploaded” into the computer’s “RAM” to be accessible and worked with by the “user” (the “cell type” in this picture). With transcriptional profiling, we are now studying globally what is present in the “RAM”, without really knowing exactly what the “code” actually means, or what the “user” is up to.

Assume also that there are different types of “users” who are frequently running the same “software” (e.g. a virus scanner or a web browser), but most importantly also different types of “software”.“Software” running in a cell could be transcriptional patterns, which are encoding functions of specific cell types, such as pluripotency or neuronal signal transmission. Another term for this “software” concept could be “dimensions” in the above-mentioned machine-learning context.
In our “computer” metaphor, “software” could be games or office programs, which can be related to the “user’s identity” (office worker, Pac Man competition specialist) - or the cell’s function.

Additionally, these programs are not exclusive to a certain type of “user”; a computer game aficionado might still run Word, because she has to write texts for her class assignments. One should also keep in mind that the same underlying “software code” or “commands” can be responsible for different types of “programs”; the same “commands” may be invoked as well for a computer game as well as for an office application, only the number of command calls may differ from one application to the other.

To translate this picture cell into our real situation: the cell (“user”) runs transcriptional “programs” enabling functioning as a neuron or a fibroblast. There is a considerable overlap of some of the commonly used “commands” (mRNA). The diverse kinds of “software” (transcriptional profiles consisting of coordinated “commands”) may co-exist in the same “computer”.

To add even more complexity: one does not know a priori what specific “software” (transcriptional profile) is executed on a particular “computer” or even what the true identity of the “user” (cell preparation) is.
The only thing we can approximate is what “commands” (mRNA) are invoked and how many times a single command is called upon, in relative terms to other “users” (cell preparations).
Our ultimate goal is to identify the “user type” (cell type) by patterns of “commands” (mRNA) that are common in between similar “user types” and can also differentiate one “user type” (e.g. fibroblasts) from other “user types” (e.g. neurons). Since our means for the characterization of a “user type” a priori is limited, we also want to find out what and how many “user types” (cell types) can be reliably discerned from each other just based on the “command” structure.

As mentioned above, NMF has the ability to analyse the subsystems (“software”) by matrix decomposition and discern transcriptional patterns (“the programs consisting of coordinately invoked commands”) from each other. The “concentration of measure” in the datasets is the feature that enables this amazing ability. In microarray studies, this is achieved by as comprehensive a collection of “commands” as possible (mRNA transcripts from almost every gene known) in every given cell preparation. If “there are only a few things that really matter” and the importance of single “commands” cannot be known a priori, one has to sample as many different “commands” as possible in one sample. This is also an important explanation as to why only high throughput methods (such as microarray studies) – and not marker-based approaches – can achieve such a focus on dimensions “that really matter” – without preconceptions or assumptions, which might skew the analysis.

Would a single “command” (gene) with a present/not present pattern be able to discern one “user” from the other (cell type)?

This is highly unlikely since there are very few single genes which are expressed exclusively only as a single cell type and are known, so they can be used to differentiate this cell type from every other cell preparation (which is a strong argument against “marker gene approaches” in our “prospective” study with a very broad context, where we collected every stem cell preparation we could get hold of from collaborators).

Finally, any dimension reduction is not a process that will give black-and-white answers (“This sample is a neuron!” or “This is a fibroblast!” or “This is a stem cell!”).

As discussed above, dimension reduction methods such as NMF may point towards few “programs” that can discern one group of samples from another, but one has to keep in mind that NMF is not restricted to focus on a single transcription pattern which has been deemed to be “important” a priori.

The second aspect of the unbiased nature of this approach is that NMF won’t group “users” according to some common preconceptions, e.g. what might be an “appropriate” sample group (like all hESC together) and what might not (e.g in not intuitive ways, like.
Since many matrix decomposition algorithms do start their computations from random points seeded in the data set, they will not necessarily find the same “important” dimensions over and over again. Since we have to assume that different cell types (“users”) are running several, overlapping “programs”, which at the same time may discern different cell types from each other, we will get a somewhat different dimensionality “snapshot” each time NMF decomposes the data matrix into a W- and an H-matrix. This additional information derived from this somewhat shifting focus in each additional run of NMF is put to use by the consensus clustering framework in order to attach a confidence measure to cell type groupings based on many randomly sampled dimensions.

Moreover, these methods are blind to the meaning of the “commands”, they will only analyze structures contained in the data (mainly co-expression patterns) – which is at the current stage of biological knowledge, an advantage, because we cannot claim to know the full meaning of any of the mRNA commands in all contexts.

Finally, a voting mechanism is employed to interpret multiple NMF runs to condensate sample clusters (in our manuscript derived from 300 runs). Like every other voting mechanism, consensus clustering uses some “majority vote”, that interprets a distribution of “votes” to decide, to what sample cluster a specific sample most likely belongs. This can result into a clear-cut decision; this can sometimes result into an ambiguous majority vote which needs further inspections and considerations.

Consensus clustering mainly focuses on evaluating the stability of discovered clusters with an subsampling approach while the
“basic assumption of this method is intuitively simple: if the data represent a sample of items drawn from distinct sub-populations, and if we were to observe a different sample drawn from the same sub-populations, the induced cluster composition and number should not be radically different. Therefore, the more the attained clusters are robust to sampling variability, the more we can be confident that these clusters represent real structure.” (Monti et al 2003)

If multiple dimension searches always group certain samples, the sample cluster (which is in our methodology a first step to cell type identification) achieves a high cluster consensus score and a high sample consensus score makes it very likely that this represents a stable sample group.

"The model is diagrammed as a network depicting how the visible variables v1,...,vn in the bottom layer of nodes are generated from the hidden variables h1,..., hr in the top layer of nodes. According to the model, the visible variables vi are generated from a probability distribution with mean glyphaWia ha. In the network diagram, the influence of h a on vi is represented by a connection with strength Wia. In the application to facial images, the visible variables are the image pixels, whereas the hidden variables contain the parts-based encoding. For fixed a, the connection strengths W1a,...,Wna constitute a specific basis image (right middle) which is combined with other basis images to represent a whole facial image (right bottom)." Lee & Seung 1999 Nature 401, 788-791
Can Machines Learn?

- grasping Biology in high-dimensional Spaces
Examples:
Machine Learning & Microarrays
blocks_image
blocks_image
blocks_image
an microarray experiment with up to 150 million single data points on one single chip
The consensus matrix is a commonly used display for demonstrating pair wise similarities and dissimilarities between items. These are examples from Wikipedia [1] , showing the same data in three different ways: A is the 2D distribution of objects on a piece of paper, B represents the measure distances between these objects and C plots the distances as consensus matrix with dark colors representing smaller distances, and lighter colors representing larger distances. The display in C shows for examples that object “a” is remote from all other objects, but “b” and “d” are the closest to a among these objects.
[1] http://en.wikipedia.org/wiki/Distance_matrix
blocks_image