An obvious way of clustering larger datasets is to try and extend existing methods so that they can cope with a larger number of objects. The focus is on clustering large numbers of objects rather than a small number of objects in high dimensions.
The CLARA (Clustering for Large Applications) algorithm is fully described in chapter 3 of Kaufman and Rousseeuw (1990). Compared to other partitioning methods such as PAM (Partitioning Around Medoids), it can deal with much larger datasets. Internally, this is achieved by considering sub-datasets of fixed size (sampsize) such that the time and storage requirements become linear in n rather than quadratic.
Each sub-dataset is partitioned into k clusters using the same algorithm as in pam.
Once k representative objects have been selected from the sub-dataset, each observation of the entire dataset is assigned to the nearest medoid.
R package ‘cluster’ (https://cran.r-project.org/web/packages/cluster/cluster.pdf)
computes a “clara” object, a list representing a clustering of the data into k clusters.
The currently available distance metrics for calculating dissimilarities between observations are “euclidean” and “manhattan”.
Euclidean distances are root sum-of-squares of differences, and manhattan distances are the sum of absolute differences.
R package ‘ClusterR’ (https://cran.r-project.org/web/packages/ClusterR/ClusterR.pdf)
has the Clara_Medoids function with support for different distance metrics for calculating dissimilarities between observations such as “euclidean”, “manhattan”, “chebyshev”,
“canberra”, “braycurtis”, “pearson_correlation”, “simple_matching_coefficient”,
“minkowski”, “hamming”, “jaccard_coefficient”, “Rao_coefficient” and “mahalanobis”.
Also, it has a threads argument specifying the number of cores to run in parallel. Openmp will be utilized to parallelize the number of the different sample draws.