K-Means clustering is one of the most popular unsupervised machine learning algorithms. It is a classification algorithm, meaning it’s purpose is to arrange the unlabeled data by shared qualities and characteristics.

## Use Cases

K-Means clustering is primarily used to find groups within a sea of data that has not been explicitly labeled. Some example use cases for this type of model are:

- Identify people based on interests
- Define personas based on activity monitoring
- Group inventory by sales activity or manufacturing metrics
- Anomaly detection
- Detect bots

The possibilities are endless and these are only five use cases.

## How Does It Work?

In order to label the data, the algorithm starts with randomly selected centroids. These are the starting points for each of the *n* clusters. From there, there is an iterative process that calculated and optimizes the positions of the centroids. This iterative process continues until there is no more change in the centroids, i.e. the clustering was successful, or a predefined number of iterations has been reached.

## Choosing K

The K in K-means represents the number of clusters to separate the unlabeled data into.

In order to optimize the value of K, you need to run the algorithm for a range of different K values and from there compare the results. There is no method for accurately determining what the value of K should be.

A popular metric for comparing results of different K values is the *mean distance between data points and their cluster centroid*. If you subsequently plot the values of *mean distance to the centroid as a function of K*, you will find what people refer to as an “elbow point”.

Past the elbow point, you notice there are diminishing returns to increasing the number of clusters. This method is a rough estimate, and there are other methods out there, including cross-validation, to estimate K but I have found that this method provides accurate results.

## Example: Classifying Wine

In this example, we will be using a wine dataset containing chemical analysis of Italian wines to cluster different types of wines.

I like to use a jupyter notebook as I am working with Python, but feel free to use whichever IDE or text editor you feel most comfortable in.

### Loading the Dataset

The first step is to load the dataset. For this step we will be using Pandas.

import pandas as pd # Load Data data = pd.read_csv("wines.csv") # Define the columns we care about. COLS = ['Alcohol', 'Malic_Acid', 'Ash', 'Ash_Alcanity', 'Magnesium','Total_Phenols', 'Flavanoids', 'Nonflavanoid_Phenols', 'Proanthocyanins', 'Color_Intensity', 'Hue', 'OD280', 'Proline']

If you now look at the first 5 rows of the dataset, you should have a similar output:

Before we dive into the modeling, let’s import Matplotlib and take a look at a couple diagrams of the data.

# Histogram data.hist(column=COLS, bins=20, figsize=(11,11))

# Correlation Heatmap (based on example at https://seaborn.pydata.org) import Seaborn as sns import numpy as np # Get Correlations corr = data[COLS].corr() # We only want to show bottom quadrant mask = np.zeros_like(corr, dtype=np.bool) mask[np.triu_indices_from(mask)] = True # Set up figure f, ax = plt.subplots(figsize=(11, 9)) # Generate a custom diverging colormap cmap = sns.diverging_palette(220, 10, as_cmap=True) # Draw the heatmap with the mask and correct aspect ratio sns.heatmap(corr, mask=mask, cmap=cmap, vmax=.3, center=0, square=True, linewidths=.5, cbar_kws={"shrink": .5})

We can tell from the correlation matrix that there are some relatively moderate correlations across our dataset, but nothing to worry about.

### Data Preparation

Because we are using 13 different features with vastly different scales to classify our Italian wines, we need to normalize our data. If we were to ignore this step, our clusters would be pulled in the direction of features with values that are larger than others. In other words, we are making sure all the features are measured on the same scale of the axis.

# Normalizing the data data_norm = (data - data.mean()) / (data.max() - data.min())

### Modeling

Now it is time to start the clustering. To start off, we will simply chose a K value of 2.

# Cluster the data km = KMeans(n_clusters=2, random_state=0) km.fit(data_norm) # Label the clusters in the original dataset labels = km.labels_ data['clusters'] = labels # Print the mean statistics for the 2 clusters print(data.groupby('clusters')[COLS].mean())

### How Many Clusters?

Now let’s see if we can’t find the optimal K value by running the above model for a range of different K values and plotting the *mean distance between data points and their cluster centroid* by number of clusters.

Finding these distortions however are not included in the KMeans library. As a result, we will be importing *cdist* from *scipy*.

# Optimizing K kms = [] distortions = [] K = range(0,10) for i in K: # Model kms.append( KMeans(n_clusters=i + 1, random_state=0).fit(data_norm) ) # Calculate Distortions distortions.append( sum(np.min(cdist( data_norm, kms[i].cluster_centers_, 'euclidean'), axis=1)) / data.shape[0]) # Plotting the Elbow Method fig, ax = plt.subplots(figsize=(10,7)) plt.plot(K, distortions) plt.xlabel('K values') plt.ylabel('Distortion') plt.title('The Elbow Method showing the optimal k') plt.show()

The Elbow method clearly shows that we already selected the optimal number of clusters; two (2).

### Results

Here is the same table as shown above. It displays the mean values for our 13 features by each of the two clusters.

Additionally, we can output scatterplots for each of the variables in a seaborns *pairplot*. This is color coded by our two clusters and gives you another view of how these clusters are dividing our data.

# Pairwise Scatterplot g = sns.pairplot(data[COLS + ['clusters']], vars=COLS, hue="clusters", palette="husl") for i, j in zip(*np.triu_indices_from(g.axes, 1)): g.axes[i, j].set_visible(False)

## Summary

In this blog post we learned what K-Means clustering is, briefly how it works, and we learned how we can use it on multiple features to classify fine Italian wines.

I hope you enjoyed this little walkthrough. Feel free to leave a comment if you have any questions or concerns.

## 1 Comment

Pingback: Logistic Regressions 101 with Python Example – Classy Data