K-Means clustering is one of the most popular unsupervised machine learning algorithms. It is a classification algorithm, meaning it’s purpose is to arrange the unlabeled data by shared qualities and characteristics.

Unlabeled Clustering Data

Use Cases

K-Means clustering is primarily used to find groups within a sea of data that has not been explicitly labeled. Some example use cases for this type of model are:

  • Identify people based on interests
  • Define personas based on activity monitoring
  • Group inventory by sales activity or manufacturing metrics
  • Anomaly detection
  • Detect bots

The possibilities are endless and these are only five use cases.

How Does It Work?

In order to label the data, the algorithm starts with randomly selected centroids. These are the starting points for each of the n clusters. From there, there is an iterative process that calculated and optimizes the positions of the centroids. This iterative process continues until there is no more change in the centroids, i.e. the clustering was successful, or a predefined number of iterations has been reached.

Choosing K

The K in K-means represents the number of clusters to separate the unlabeled data into.

In order to optimize the value of K, you need to run the algorithm for a range of different K values and from there compare the results. There is no method for accurately determining what the value of K should be.

A popular metric for comparing results of different K values is the mean distance between data points and their cluster centroid. If you subsequently plot the values of mean distance to the centroid as a function of K, you will find what people refer to as an “elbow point”.

Past the elbow point, you notice there are diminishing returns to increasing the number of clusters. This method is a rough estimate, and there are other methods out there, including cross-validation, to estimate K but I have found that this method provides accurate results.

Example: Classifying Wine

In this example, we will be using a wine dataset containing chemical analysis of Italian wines to cluster different types of wines.

I like to use a jupyter notebook as I am working with Python, but feel free to use whichever IDE or text editor you feel most comfortable in.

Loading the Dataset

The first step is to load the dataset. For this step we will be using Pandas.

import pandas as pd

# Load Data
data = pd.read_csv("wines.csv")

# Define the columns we care about.
COLS = ['Alcohol', 'Malic_Acid', 'Ash', 'Ash_Alcanity',
        'Magnesium','Total_Phenols', 'Flavanoids',
        'Nonflavanoid_Phenols', 'Proanthocyanins',
        'Color_Intensity', 'Hue', 'OD280', 'Proline']

If you now look at the first 5 rows of the dataset, you should have a similar output:

Output from data.head()

Before we dive into the modeling, let’s import Matplotlib and take a look at a couple diagrams of the data.

# Histogram
data.hist(column=COLS, bins=20, figsize=(11,11))
Histogram of the features in the dataset.
# Correlation Heatmap (based on example at https://seaborn.pydata.org)
import Seaborn as sns
import numpy as np

# Get Correlations
corr = data[COLS].corr()

# We only want to show bottom quadrant
mask = np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True

# Set up figure
f, ax = plt.subplots(figsize=(11, 9))

# Generate a custom diverging colormap
cmap = sns.diverging_palette(220, 10, as_cmap=True)

# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr, mask=mask, cmap=cmap, vmax=.3, center=0,
            square=True, linewidths=.5, cbar_kws={"shrink": .5})

We can tell from the correlation matrix that there are some relatively moderate correlations across our dataset, but nothing to worry about.

Data Preparation

Because we are using 13 different features with vastly different scales to classify our Italian wines, we need to normalize our data. If we were to ignore this step, our clusters would be pulled in the direction of features with values that are larger than others. In other words, we are making sure all the features are measured on the same scale of the axis.

# Normalizing the data
data_norm = (data - data.mean()) / (data.max() - data.min())

Modeling

Now it is time to start the clustering. To start off, we will simply chose a K value of 2.

# Cluster the data
km = KMeans(n_clusters=2, random_state=0)
km.fit(data_norm)

# Label the clusters in the original dataset
labels = km.labels_
data['clusters'] = labels

# Print the mean statistics for the 2 clusters
print(data.groupby('clusters')[COLS].mean())

How Many Clusters?

Now let’s see if we can’t find the optimal K value by running the above model for a range of different K values and plotting the mean distance between data points and their cluster centroid by number of clusters.

Finding these distortions however are not included in the KMeans library. As a result, we will be importing cdist from scipy.

# Optimizing K
kms = []
distortions = []

K = range(0,10)
for i in K:
    # Model
    kms.append(
        KMeans(n_clusters=i + 1,
               random_state=0).fit(data_norm)
    )

    # Calculate Distortions
    distortions.append(
        sum(np.min(cdist(
            data_norm, 
            kms[i].cluster_centers_,
            'euclidean'), axis=1)) / data.shape[0])

# Plotting the Elbow Method
fig, ax = plt.subplots(figsize=(10,7))
plt.plot(K, distortions)
plt.xlabel('K values')
plt.ylabel('Distortion')
plt.title('The Elbow Method showing the optimal k')
plt.show()
Elbow method plotted results.

The Elbow method clearly shows that we already selected the optimal number of clusters; two (2).

Results

Here is the same table as shown above. It displays the mean values for our 13 features by each of the two clusters.

Mean results of our 2 clusters (same as pictured above).

Additionally, we can output scatterplots for each of the variables in a seaborns pairplot. This is color coded by our two clusters and gives you another view of how these clusters are dividing our data.

# Pairwise Scatterplot
g = sns.pairplot(data[COLS + ['clusters']], 
                 vars=COLS,
                 hue="clusters", 
                 palette="husl")

for i, j in zip(*np.triu_indices_from(g.axes, 1)):
    g.axes[i, j].set_visible(False)

Summary

In this blog post we learned what K-Means clustering is, briefly how it works, and we learned how we can use it on multiple features to classify fine Italian wines.

I hope you enjoyed this little walkthrough. Feel free to leave a comment if you have any questions or concerns.

I have half a decade of experience working with data science and data engineering in a variety of fields both professionally and in academia. I ahve demonstrated advanced skills in developing machine learning algorithms, econometric models, intuitive visualizations and reporting dashboards in order to communicate data and technical terminology in an easy to understand manner for clients of varying backgrounds.

Write A Comment