Skip to main content
Log in

Statistical theory in clustering

  • Authors Of Articles
  • Published:
Journal of Classification Aims and scope Submit manuscript

Abstract

A number of statistical models for forming and evaluating clusters are reviewed. Hierarchical algorithms are evaluated by their ability to discover high density regions in a population, and complete linkage hopelessly fails; the others don't do too well either. Single linkage is at least of mathematical interest because it is related to the minimum spanning tree and percolation. Mixture methods are examined, related to k-means, and the failure of likelihood tests for the number of components is noted. The DIP test for estimating the number of modes in a univariate population measures the distance between the empirical distribution function and the closest unimodal distribution function (or k-modal distribution function when testing for k modes). Its properties are examined and multivariate extensions are proposed. Ultrametric and evolutionary distances on trees are considered briefly.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  • BAKER, F.B. (1974), “Stability of Two Hierarchical Grouping Techniques, Case I: Sensitivity to Data Errors,”Journal of the American Statistical Association, 69, 440–445.

    Google Scholar 

  • BINDER, D.A. (1978), Comment on ’Estimating Mixtures of Normal Distributions and Switching Regressions’,Journal of the American Statistical Association, 73, 746–747.

    Google Scholar 

  • BROADBENT, S.R., and HAMMERSLEY, J.M. (1957), “Percolation Processes, I: Crystals and Mazes,”Proceedings of the Cambridge Philosophical Society, 53, 629–641.

    Google Scholar 

  • DAY, N.E. (1969), “Estimating the Components of a Mixture of Normal Distributions,”Biometrika, 56, 463–474.

    Google Scholar 

  • DICK, N.P., and BOWDEN, D.C. (1973), “Maximum Likelihood Estimation for Mixture of Two Normal Distributions,”Biometrics, 29, 781–790.

    Google Scholar 

  • EVERITT, B.S., and HAND, D.J. (1981),Finite Mixture Distributions, London: Chapman and Hall.

    Google Scholar 

  • FITCH, W.M., and MARGOLIASH, E. (1967), “Construction of Phylogenetic Trees,”Science N.Y., 155, 279–284.

    Google Scholar 

  • GOWER, J.C., and ROSS, G.J.S. (1969), “Minimum Spanning Trees and Single Linkage Cluster Analysis,”Applied Statistics, 18, 54–65.

    Google Scholar 

  • HARTIGAN, J.A. (1967), “Representation of Similarity Matrices by Trees,”Journal of the American Statistical Association, 62, 1140–1158.

    Google Scholar 

  • HARTIGAN, J.A. (1975),Clustering Algorithms, New York: John Wiley.

    Google Scholar 

  • HARTIGAN, J.A. (1977), “Distribution Problems in Clustering,” inClassification and Clustering, ed. J. V. Ryzin, New York: Academic Press.

    Google Scholar 

  • HARTIGAN, J.A. (1978), “Asymptotic Distributions for Clustering Criteria,”The Annals of Statistics, 6, 117–131.

    Google Scholar 

  • HARTIGAN, J.A. (1981), “Consistency of Single Linkage for High Density Clusters,”Journal of the American Statistical Association, 76, 388–394.

    Google Scholar 

  • HARTIGAN, J.A., and HARTIGAN, P.M. (1984), “The Dip Test of Multimodality,”The Annals of Statistics, submitted.

  • HOSMER, D.W. (1973), “A Comparison of Iterative Maximum Likelihood Estimates of the Parameters of a Mixture of Two Normal Distributions under Three Different Types of Sample,”Biometrics, 29, 761–770.

    Google Scholar 

  • JARDINE, C.J., JARDINE, N., and SIBSON, R. (1967), “The Structure and Construction of Taxonomic Hierarchies,”Math. Biosciences, 1, 173–179.

    Google Scholar 

  • JOHNSON, S.C. (1967), “Hierarchical Clustering Schemes,”Psychometrika, 32, 241–254.

    PubMed  Google Scholar 

  • LING, R.F. (1973), “A Probability Theory of Cluster Analysis,”Journal of the American Statistical Association, 68, 159–169.

    Google Scholar 

  • MAC QUEEN, J. (1967), “Some Methods for Classification and Analysis of Multivariate Observations,”Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, 1, 281–297.

    Google Scholar 

  • POLLARD, D. (1982), “A Central Limit Theorem for k-means Clustering,”Annals of Probability, 10, 919–926.

    Google Scholar 

  • RAO, C.R. (1948), “The Utilization of Multiple Measurements in Problems of Biological Classification,”Journal of the Royal Statistical Society, Series B, 10, 159–203.

    Google Scholar 

  • SMYTHE, R.T., and WIERMAN, J.C. (1978), “First Passage Percolation on the Square Lattice,”Leture Notes in Mathematics, 671, Berlin: Springer-Verlag.

    Google Scholar 

  • WISHART, D. (1969), “Mode Analysis: A Generalization of Nearest Neighbor Which Reduces Chaining Effects,” inNumerical Taxonomy, ed. A. J. Cole, London: Academic Press.

    Google Scholar 

  • WOLFE, J.H. (1970), “Pattern Clustering by Multivariate Analysis,”Multivariate Behavioral Research, 5, 329–350.

    Google Scholar 

  • WOLFE, J.H. (1971), “A Monte-Carlo Study of the Sampling Distribution of the Likelihood Ratio fro Mixtures of Multinormal Distributions,”Research Memorandum, 72–2, Naval Personnel and Research Training Laboratory, San Diego.

    Google Scholar 

  • WONG, M.A. (1982), “A Hybrid Clustering Algorithm for Identifying High Density Clusters,”Journal of the American Statistical Association, 77, 841–847.

    Google Scholar 

  • WONG, M.A., and LANE, T. (1983), “A kth Nearest Neighbor Clustering Procedure,”Journal of the Royal Statistical Society, SeriesB, 45, 362–368.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Additional information

Research supported by the National Science Foundation Grant No. MCS-8102280.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Hartigan, J.A. Statistical theory in clustering. Journal of Classification 2, 63–76 (1985). https://doi.org/10.1007/BF01908064

Download citation

  • Issue Date:

  • DOI: https://doi.org/10.1007/BF01908064

Keywords

Navigation