### A very good article that explains k-Means clustering:

http://msdn.microsoft.com/en-us/magazine/jj891054.aspx

Consider the problem of identifying abnormal data items in a very large data set, for example, identifying potentially fraudulent credit-card transactions, risky loan applications and so on. One approach to detecting abnormal data is to group the data items into similar clusters and then seek data items within each cluster that are different in some sense from other data items within the cluster.

Data clustering is closely related to and sometimes confused with data classification. Clustering is an unsupervised technique that groups data items together without any foreknowledge of what those groups might be. Clustering is typically an exploratory process. Classification, in contrast, is a supervised technique that requires the specification of known groups in training data, after which each data tuple is placed into one of these groups. Classification is typically used for prediction purposes.

In clustering terminology, data items are sometimes called tuples. For example, each tuple represents a person and has two numeric attribute values, a height in inches and a weight in pounds. One of the limitations of the k-means algorithm is that it applies only in cases where the data tuples are completely numeric.

Although there are advanced clustering techniques that can suggest the optimal number of clusters to use, in general data clustering is an exploratory process and the best number of clusters to use is typically found through trial and error.

One common method of choosing the appropriate cluster solution is to compare the sum of squared error (SSE) for a number of cluster solutions. SSE is defined as the sum of the squared distance between each member of a cluster and its cluster centroid. Thus, SSE can be seen as a global measure of error. In general, as the number of clusters increases, the SSE should decrease because clusters are, by definition, smaller. A plot of the SSE against a series of sequential cluster levels can provide a useful graphical way to choose an appropriate cluster level. An appropriate cluster solution could be defined as the solution at which the reduction in SSE slows dramatically.

The central concept in the k-means algorithm is the centroid. In data clustering, the centroid of a set of data tuples is the one tuple that’s most representative of the group. Which tuple is most representative? One approach is to compute a mathematical average (mean) tuple, and then select as the centroid the tuple that is closest to that average tuple. The most common approach is to use the Euclidean distance.

To summarize, each cluster has a centroid, which is the most representative tuple in the cluster. Centroid values are computed by finding the one tuple in each cluster that’s closest to the average tuple (the mean) in each cluster. Each data tuple is assigned to the cluster whose cluster centroid is closest to the tuple. Possibly abnormal tuples are outliers that’s the farthest (as measured by Euclidean distance) from the cluster centroid (most representative tuple).

You might want to consider alternative ways to define distance. A very common option is to use the sum of the absolute values of the differences between each component. Because Euclidean distance squares differences, larger differences are weighted much more heavily than smaller differences.

Another important factor related to the choice of distance function in the k-means clustering algorithm is data normalization. The demo program uses raw, un-normalized data. In a person tuple consists of height and a weight, tuple weights are typically values such as 160.0 and tuple heights are typically values like 67.0, differences in weights have much more influence than differences in heights. In many situations, in addition to exploring clustering on raw data, it’s useful to normalize the raw data before clustering. There are many ways to normalize data. A common technique is to compute the mean (m) and standard deviation (sd) for each attribute, then for each attribute value (v) compute a normalized value nv = (v-m)/sd.