Abstract
A number of statistical models for forming and evaluating clusters are reviewed. Hierarchical algorithms are evaluated by their ability to discover high density regions in a population, and complete linkage hopelessly fails; the others don't do too well either. Single linkage is at least of mathematical interest because it is related to the minimum spanning tree and percolation. Mixture methods are examined, related to k-means, and the failure of likelihood tests for the number of components is noted. The DIP test for estimating the number of modes in a univariate population measures the distance between the empirical distribution function and the closest unimodal distribution function (or k-modal distribution function when testing for k modes). Its properties are examined and multivariate extensions are proposed. Ultrametric and evolutionary distances on trees are considered briefly.
Similar content being viewed by others
References
BAKER, F.B. (1974), “Stability of Two Hierarchical Grouping Techniques, Case I: Sensitivity to Data Errors,”Journal of the American Statistical Association, 69, 440–445.
BINDER, D.A. (1978), Comment on ’Estimating Mixtures of Normal Distributions and Switching Regressions’,Journal of the American Statistical Association, 73, 746–747.
BROADBENT, S.R., and HAMMERSLEY, J.M. (1957), “Percolation Processes, I: Crystals and Mazes,”Proceedings of the Cambridge Philosophical Society, 53, 629–641.
DAY, N.E. (1969), “Estimating the Components of a Mixture of Normal Distributions,”Biometrika, 56, 463–474.
DICK, N.P., and BOWDEN, D.C. (1973), “Maximum Likelihood Estimation for Mixture of Two Normal Distributions,”Biometrics, 29, 781–790.
EVERITT, B.S., and HAND, D.J. (1981),Finite Mixture Distributions, London: Chapman and Hall.
FITCH, W.M., and MARGOLIASH, E. (1967), “Construction of Phylogenetic Trees,”Science N.Y., 155, 279–284.
GOWER, J.C., and ROSS, G.J.S. (1969), “Minimum Spanning Trees and Single Linkage Cluster Analysis,”Applied Statistics, 18, 54–65.
HARTIGAN, J.A. (1967), “Representation of Similarity Matrices by Trees,”Journal of the American Statistical Association, 62, 1140–1158.
HARTIGAN, J.A. (1975),Clustering Algorithms, New York: John Wiley.
HARTIGAN, J.A. (1977), “Distribution Problems in Clustering,” inClassification and Clustering, ed. J. V. Ryzin, New York: Academic Press.
HARTIGAN, J.A. (1978), “Asymptotic Distributions for Clustering Criteria,”The Annals of Statistics, 6, 117–131.
HARTIGAN, J.A. (1981), “Consistency of Single Linkage for High Density Clusters,”Journal of the American Statistical Association, 76, 388–394.
HARTIGAN, J.A., and HARTIGAN, P.M. (1984), “The Dip Test of Multimodality,”The Annals of Statistics, submitted.
HOSMER, D.W. (1973), “A Comparison of Iterative Maximum Likelihood Estimates of the Parameters of a Mixture of Two Normal Distributions under Three Different Types of Sample,”Biometrics, 29, 761–770.
JARDINE, C.J., JARDINE, N., and SIBSON, R. (1967), “The Structure and Construction of Taxonomic Hierarchies,”Math. Biosciences, 1, 173–179.
JOHNSON, S.C. (1967), “Hierarchical Clustering Schemes,”Psychometrika, 32, 241–254.
LING, R.F. (1973), “A Probability Theory of Cluster Analysis,”Journal of the American Statistical Association, 68, 159–169.
MAC QUEEN, J. (1967), “Some Methods for Classification and Analysis of Multivariate Observations,”Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, 1, 281–297.
POLLARD, D. (1982), “A Central Limit Theorem for k-means Clustering,”Annals of Probability, 10, 919–926.
RAO, C.R. (1948), “The Utilization of Multiple Measurements in Problems of Biological Classification,”Journal of the Royal Statistical Society, Series B, 10, 159–203.
SMYTHE, R.T., and WIERMAN, J.C. (1978), “First Passage Percolation on the Square Lattice,”Leture Notes in Mathematics, 671, Berlin: Springer-Verlag.
WISHART, D. (1969), “Mode Analysis: A Generalization of Nearest Neighbor Which Reduces Chaining Effects,” inNumerical Taxonomy, ed. A. J. Cole, London: Academic Press.
WOLFE, J.H. (1970), “Pattern Clustering by Multivariate Analysis,”Multivariate Behavioral Research, 5, 329–350.
WOLFE, J.H. (1971), “A Monte-Carlo Study of the Sampling Distribution of the Likelihood Ratio fro Mixtures of Multinormal Distributions,”Research Memorandum, 72–2, Naval Personnel and Research Training Laboratory, San Diego.
WONG, M.A. (1982), “A Hybrid Clustering Algorithm for Identifying High Density Clusters,”Journal of the American Statistical Association, 77, 841–847.
WONG, M.A., and LANE, T. (1983), “A kth Nearest Neighbor Clustering Procedure,”Journal of the Royal Statistical Society, SeriesB, 45, 362–368.
Author information
Authors and Affiliations
Additional information
Research supported by the National Science Foundation Grant No. MCS-8102280.
Rights and permissions
About this article
Cite this article
Hartigan, J.A. Statistical theory in clustering. Journal of Classification 2, 63–76 (1985). https://doi.org/10.1007/BF01908064
Issue Date:
DOI: https://doi.org/10.1007/BF01908064