4.6 Article

Hierarchical Clustering with Contiguity Constraint in R

期刊

JOURNAL OF STATISTICAL SOFTWARE
卷 103, 期 7, 页码 1-26

出版社

JOURNAL STATISTICAL SOFTWARE
DOI: 10.18637/jss.v103.i07

关键词

R; hclust; constrained clustering; space; chronological clustering; Lance and Williams algorithm

资金

  1. Natural Sciences and Engineering Research Council of Canada (NSERC) [7738]

向作者/读者索取更多资源

This article presents a new implementation of hierarchical clustering for the R language that allows the application of spatial or temporal contiguity constraints. The implementation is efficient, but limited by input/output access when dealing with large problems.
This article presents a new implementation of hierarchical clustering for the R language that allows one to apply spatial or temporal contiguity constraints during the clustering process. The need for contiguity constraint arises, for instance, when one wants to partition a map into different domains of similar physical conditions, identify discontinuities in time series, group regional administrative units with respect to their performance, and so on. To increase computation efficiency, we programmed the core functions in plain C. The result is a new R function, constr.hclust, which is distributed in package adespatial. The program implements the general agglomerative hierarchical clustering algorithm described by Lance and Williams (1966; 1967), with the particularity of allowing only clusters that are contiguous in geographic space or along time to fuse at any given step. Contiguity can be defined with respect to space or time. Information about spatial contiguity is provided by a connection network among sites, with edges describing the links between connected sites. Clustering with a temporal contiguity constraint is also known as chronological clustering. Information on temporal contiguity can be implicitly provided as the rank positions of observations in the time series. The implementation was mirrored on that found in the hierarchical clustering function hclust of the standard R package stats (R Core Team 2022). We transcribed that function from Fortran to C and added the functionality to apply constraints when running the function. The implementation is efficient. It is limited mainly by input/output access as massive amounts of memory are potentially needed to store copies of the dissimilarity matrix and update its elements when analyzing large problems. We provided R computer code for plotting results for numbers of clusters.

作者

我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。

评论

主要评分

4.6
评分不足

次要评分

新颖性
-
重要性
-
科学严谨性
-
评价这篇论文

推荐

暂无数据
暂无数据