Practical SAHN Clustering for Very Large Data Sets and Expensive Distance Metrics
DOI:
https://doi.org/10.7155/jgaa.00338Abstract
Sequential agglomerative hierarchical non-overlapping (SAHN) clustering techniques belong to the classical clustering methods applied heavily in many application domains, e.g., in cheminformatics. Asymptotically optimal SAHN clustering algorithms are known for arbitrary dissimilarity measures, but their quadratic time and space complexity even in the best case still limits the applicability to small data sets. We present a new pivot based heuristic SAHN clustering algorithm exploiting the properties of metric distance measures in order to obtain a bestcase runtime of O(nlogn) for the input size n. Our approach requires only linear space and supports median and centroid linkage. It is especially suitable for expensive distance measures, as it needs only a linear number of exact distance computations. This aspect is demonstrated in our extensive experimental evaluation, where we apply our algorithm to large graph databases in combination with computationally demanding graph distance metrics. We compare our approach to exact state-of-the-art SAHN algorithms in terms of quality and runtime on real-world and synthetic instances including vector and graph data. The evaluations show a subquadratic runtime in practice and a very low memory footprint. Our approach yields high-quality clusterings and is able to rediscover planted cluster structures in synthetic data sets.Downloads
Download data is not yet available.
Downloads
Published
2014-12-01
How to Cite
Kriege, N., Mutzel, P., & Schäfer, T. (2014). Practical SAHN Clustering for Very Large Data Sets and Expensive Distance Metrics. Journal of Graph Algorithms and Applications, 18(4), 577–602. https://doi.org/10.7155/jgaa.00338
Issue
Section
Articles
Categories
License
Copyright (c) 2014 Nils Kriege, Petra Mutzel, Till Schäfer
This work is licensed under a Creative Commons Attribution 4.0 International License.