Loading...
Thumbnail Image
Publication

Cluster analysis stopping rules in Stata

Date
2016
Abstract
Analysts doing cluster analysis sometimes want the data to tell them the optimum number of clusters. Common "stopping rules" use the Calinski-Harabasz pseudo-F statistic and Duda-Hart indices, which are based on squared Euclidean distances between cases. Cluster analysis operates on a pairwise matrix of distances between the objects clusters, which are usually created from the observed variables. However, approaches such as expert judgement or algorithmic pattern-recognition (as used for instance in sequence analysis) often output matrices of pairwise similarity or difference whose relationship to the observed variables is much less direct. Built-in Stata utilities allow calculation of the CH and DH indices when cluster analysis starts from variables, but not with cluster analysis that starts from a pairwise distance matrix (unless the distances are squared Euclidean distances defined on variables which are still available). In this note I present two small Stata utilities that will calculate the CH and DH statistics from the distance matrix, if the distances are squared Euclidean. If the distances have another metric, these utilities can be seen as calculating a pseudo-CH pseudo-F or pseudo-DH statistic, potentially extending their use to new applications.
Supervisor
Description
non-peer-reviewed
Publisher
Department of Sociology, University of Limerick
Citation
University of Limerick Department of Sociology Working Paper Series;WP2016-01
Collections
Funding code
Funding Information
Sustainable Development Goals
External Link
License
Embedded videos