
The
model selection by the cophenetic coefficient has
been proposed by Monti et al.
2003 and
used by Brunet et al.
2004 for
NMF in the consensus clustering framework.
Although it does not provide "binary" decisions"
on the overall clustering result (e. g. correct
vs false), the cophenetic coefficient
suggests
which models may be better than others. In this
case k4 and k12 appear to be overall reasonable
overall clustering results. k12 was chosen for
further downstream analysis, since the resulting
clusters resembled more our current concepts of
cell class identities than the more
comprehensive k4 clusters, allthough it will be
important to understand, on what basis these
four sample clusters have been aggregated by
sNMF.
More technically, the cophenetic corellation coeficient can be defined for our specific case as follows:
The cophenetic correlation coefficient is the Pearson correlation coefficient between pairwise distances of a set of objects and their cophenetic distances, which are derived from a hierarchical clustering. The cophenetic distance of two objects is defined as the intergroup dissimilarity at which the two observations are first combined into a single cluster[1,2].
A high cophenetic correlation coefficient conveys that the clustering dendrogram reflects the original distances well. In our setting, this implies that segregating the data into k groups is well supported by the co-occurance data of the consensus clustering.
[1] Farris, J. S. On the Cophenetic Correlation Coefficient Systematic Zoology, 1969, 18, 279-285
[2] R Development Core Team. R: A Language and Environment for Statistical Computing 2007. Help files.
More technically, the cophenetic corellation coeficient can be defined for our specific case as follows:
The cophenetic correlation coefficient is the Pearson correlation coefficient between pairwise distances of a set of objects and their cophenetic distances, which are derived from a hierarchical clustering. The cophenetic distance of two objects is defined as the intergroup dissimilarity at which the two observations are first combined into a single cluster[1,2].
A high cophenetic correlation coefficient conveys that the clustering dendrogram reflects the original distances well. In our setting, this implies that segregating the data into k groups is well supported by the co-occurance data of the consensus clustering.
[1] Farris, J. S. On the Cophenetic Correlation Coefficient Systematic Zoology, 1969, 18, 279-285
[2] R Development Core Team. R: A Language and Environment for Statistical Computing 2007. Help files.