Mahalanobis Clustering

A new clustering algorithm, Mahalanobis clustering, is proposed as an improvement on traditional K-means clustering.

In order to mathematically identify clusters in a data set, it is usually necessary to first define a measure of similarity or proximity which will establish a rule for assigning patterns to the domain of a particular cluster center. As it is to be expected, the measure of similarity is problem dependent. The most popular similarity measure is the Euclidean distance. The smaller the distance, the greater the similarity. By using Euclidean distance as a measure of similarity, hyperspherical-shaped clusters of equal size are usually detected. This measure is useless or even undesirable when clusters tend to develop along principal axes. To take care of hyperellipsoidal-shaped clusters, the Mahalanobis distance is one of the popular choices. One of the major difficulties associated with using the Mahalanobis distance as a similarity measure is that we have to recompute the inverse of the sample covariance matrix every time a pattern changes its cluster domain, which is computationally expensive.

The choice of using Mahalanobis vs Euclidean distance in k-means is really a choice between using the full-covariance of your clusters or ignoring them. When you use Euclidean distance, you assume that the clusters have identity covariances. In 2D, this means that your clusters have circular shapes. Obviously, if the covariances of the natural groupings in your data are not identity matrices, e.g. in 2D, clusters have elliptical shaped covariances, then using Mahalanobis over Euclidean will be much better modeling.

Usage example: How can we get information on credit ratings ?
The first step in this process is to make the data comparable. This consists of two parts: first we normalize each feature in amongst its peers, and then we look at a statistical distance metric instead of a raw distance metric. A good example here is the use of the Mahalanobis distance.

The Mahalanobis distance
How can we compare a client who has three children with a client who recharges their phone for 40 rupees twice a month, or another client who is a woman? Each of these data points are as different as the features they quantify. Professor P. C. Mahalanobis introduced a scale-invariant distance metric that allowed for the comparison of data-sets by looking at the statistical variance of the data. By removing the scale of the data and comparing values based on their statistical properties, such as offset from the mean of a datum and the variance of the population, the Mahalanobis distance allows us to semantically gauge binary features, discrete features, and continuous features with each other. This lets us compare apples and oranges!

Ovaj unos je objavljen u Nekategorizirano. Bookmarkirajte stalnu vezu.


Popunite niže tražene podatke ili kliknite na neku od ikona za prijavu: Logo

Ovaj komentar pišete koristeći vaš račun. Odjava / Izmijeni )

Twitter picture

Ovaj komentar pišete koristeći vaš Twitter račun. Odjava / Izmijeni )

Facebook slika

Ovaj komentar pišete koristeći vaš Facebook račun. Odjava / Izmijeni )

Google+ photo

Ovaj komentar pišete koristeći vaš Google+ račun. Odjava / Izmijeni )

Spajanje na %s