Analysts doing cluster analysis sometimes want the data to tell them the optimum number
of clusters. Common "stopping rules" use the Calinski-Harabasz pseudo-F statistic
and Duda-Hart indices, which are based on squared Euclidean distances between cases.
Cluster analysis operates on a pairwise matrix of distances between the objects clusters,
which are usually created from the observed variables. However, approaches such as expert
judgement or algorithmic pattern-recognition (as used for instance in sequence analysis)
often output matrices of pairwise similarity or difference whose relationship to the
observed variables is much less direct. Built-in Stata utilities allow calculation of the CH
and DH indices when cluster analysis starts from variables, but not with cluster analysis
that starts from a pairwise distance matrix (unless the distances are squared Euclidean distances
defined on variables which are still available). In this note I present two small Stata
utilities that will calculate the CH and DH statistics from the distance matrix, if the distances
are squared Euclidean. If the distances have another metric, these utilities can be
seen as calculating a pseudo-CH pseudo-F or pseudo-DH statistic, potentially extending
their use to new applications.
History
Publication
University of Limerick Department of Sociology Working Paper Series;WP2016-01