An obvious way of clustering larger datasets is to try and extend existing methods so that they can cope with a larger number of objects. The focus is on clustering large numbers of objects rather than a small number of objects in high dimensions.

The CLARA (Clustering for Large Applications) algorithm is fully described in chapter 3 of Kaufman and Rousseeuw (1990). Compared to other partitioning methods such as PAM (Partitioning Around Medoids), it can deal with much larger datasets. Internally, this is achieved by considering sub-datasets of fixed size (sampsize) such that the time and storage requirements become linear in n rather than quadratic.

Each sub-dataset is partitioned into k clusters using the same algorithm as in pam.

Once k representative objects have been selected from the sub-dataset, each observation of the entire dataset is assigned to the nearest medoid.

R package ‘cluster’ (https://cran.r-project.org/web/packages/cluster/cluster.pdf)

computes a “clara” object, a list representing a clustering of the data into k clusters.

The currently available distance metrics for calculating dissimilarities between observations are “euclidean” and “manhattan”.

Euclidean distances are root sum-of-squares of differences, and manhattan distances are the sum of absolute differences.

R package ‘ClusterR’ (https://cran.r-project.org/web/packages/ClusterR/ClusterR.pdf)

has the Clara_Medoids function with support for different distance metrics for calculating dissimilarities between observations such as “euclidean”, “manhattan”, “chebyshev”,

“canberra”, “braycurtis”, “pearson_correlation”, “simple_matching_coefficient”,

“minkowski”, “hamming”, “jaccard_coefficient”, “Rao_coefficient” and “mahalanobis”.

Also, it has a threads argument specifying the number of cores to run in parallel. Openmp will be utilized to parallelize the number of the different sample draws.

### Sviđa ti se?

Sviđa mi se Učitavanje...

*Povezano*