Let's dive into the basics of unsupervised Machine Learning algorithms! In this weeks BaseCamp tutorial, we will show you the (probably) most common clustering algorithm: KMeans.

We will be working with the famous packages pandas, numpy, and sklearn. Let's start by loading them:

import pandas as pd
import numpy as np


We will be working with NBA data, more specifically the average statistics of all players from the regular season 2016/2017. We will use these data to separate players into different clusters/groups based on the players' statistics and we will see if some interesting groups of players are created. You can download the dataset here. The data is in the JSON format, therefore we will use the package JSON to load the file into Python and parse it:

data_path = "path to your file"
with open(data_path + 'players_average_2017.csv') as json_data:
    data = json.load(json_data)

# Parsing
columns = data["resultSets"][0]["headers"]
df = pd.DataFrame(data["resultSets"][0]["rowSet"], columns=columns)

In the column names, we can see that there are plenty of columns in rank in it. These are just the ordering of players for specific statistics and we won't use them for clustering. We will use the code below to drop the columns that have the substring "_RANK" in them.

# we will drop the columns with _RANK in it.
to_drop = []
for cl in columns:
    if cl.find("_RANK") != -1:
df.drop(to_drop, axis=1, inplace=1)

Let's start with some simple descriptive statistics of our dataset:

# How many players did occur in at least 1 game during the season?
print len(df.PLAYER_ID)
# How many points were scored in average by 1 player in 1 game?
print df.PTS.mean()
# the average above is not correct , we have to use weighted average,
# people who played more games will have bigger impact on average
print (df.PTS * df.GP).sum() / df.GP.sum()
# Descriptive statistics for important variables
df.drop(["PLAYER_ID","TEAM_ID"], axis = 1).describe()

In this phase, we should remember or write down the values of descriptive statistics, so we can compare them later with specific clusters of players.


As usual, we have to transform all categorical variables into numeric ones. Furthermore, there is one very important step in data preparation for the KMeans algorithm. We need to make sure, that all the variables we use are on the same scale. Therefore we will apply one of the scalers from sklearn to our dataset:

train = df.drop(["PLAYER_ID","TEAM_ID",
                ], axis = 1)

# For scaling
from sklearn.preprocessing import MinMaxScaler
# For transformation into numeric categories
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
train["TEAM_ABBREVIATION"] = le.fit_transform(train["TEAM_ABBREVIATION"])

We can check if we have any missing values in our data. If we don't, we can directly our variables:

# Scaling
sc = MinMaxScaler()
train_sc = sc.transform(train)

For more information about the LabelEncoder and MinMaxScaler check the links below:




K-Means Interface:

You can explore the KMeans here. The constructor of the KMeans class returns an estimator with the fit() method that enables you to perform clustering. This process is consistent with other sklearn algorithms we have explored in previous tutorials.

K-Means Parameters:

Using the above link, we can see that there are a few parameters which control the K-Means algorithm. We will look at one parameter specifically: the number of clusters used in the algorithm. The number of clusters needs to be chosen based on the domain knowledge of the data.

from sklearn.cluster import KMeans

# Run K-means using 4 cluster centers on user_np_matrix
kmeans_4 = KMeans(n_clusters=4, random_state=0)

# Run K-means using 5 cluster centers on user_np_matrix
kmeans_5 = KMeans(n_clusters=5, random_state=0)

# Run K-means using 6 cluster centers on user_np_matrix
kmeans_6 = KMeans(n_clusters=6, random_state=0)

# Run K-means using 7 cluster centers on user_np_matrix
kmeans_7 = KMeans(n_clusters=7, random_state=0)


In some cases, we don't know up front what the best amount of clusters to create is. Therefore it is very important to run the algorithm a couple of times for the different number of clusters and evaluate each run separately. To do this we are going to use two metrics:

  • Inertia
  • Silhouette score


Inertia is a metric that is used to estimate how close the data points in a cluster are. This is calculated as the sum of squared distance for each point to its closest centroid, i.e., its assigned cluster center. The intuition behind Inertia is that clusters with lower Inertia are better, as it means closely related points form a cluster. Inertia is calculated by scikit-learn by default:

print("Inertia for KMeans with 4 clusters = %lf " %(kmeans_4.inertia_))
print("Inertia for KMeans with 5 clusters = %lf " %(kmeans_5.inertia_))
print("Inertia for KMeans with 6 clusters =  %lf "%(kmeans_6.inertia_))
print("Inertia for KMeans with 7 clusters = %lf "%(kmeans_7.inertia_))

Silhouette Score:

The Silhouette Score measures how closely created various clusters are. A higher Silhouette Score is better as it means that we don't have too many overlapping clusters. The Silhouette Score can be computed using sklearn.metrics.silhouette_score from scikit learn and values a range between -1 and 1. -1 means that observations are assigned to the wrong cluster, 0 means that clusters overlap and 1 means that observations are assigned to the correct cluster:

from sklearn.metrics import silhouette_score

def get_silhouette_score(data, model):
    cluster_labels = model.fit_predict(data)
    score = silhouette_score(data, cluster_labels)
    return score
print get_silhouette_score(train_sc, kmeans_4)
print get_silhouette_score(train_sc, kmeans_5)
print get_silhouette_score(train_sc, kmeans_6)
print get_silhouette_score(train_sc, kmeans_7)


Using the evaluation methods above we can see that we have a different result for each method. Using Inertia, the solution with 7 clusters was evaluated as the best. On the other hand, using Silhouette Score it was the solution with 4 clusters that was the best. We should check a number of players per cluster, to find out if we have some clusters which are either too big or too small:

# Adding labels to our dataframe
df["labels_4"] = kmeans_4.labels_
df["labels_7"] = kmeans_7.labels_

# number of players per cluster
# number of players per cluster

There is one cluster in both solutions where a performance metric of the players is much higher than average. It is cluster number 2 for K-means with 4 clusters and number 3 for the solution with 7 clusters. Sizes of these clusters are 76 and 37 respectively. Below you can see the list of players from these two clusters. Be aware that your numbers can be slightly different for each time you fit the model:

Andrew Wiggins
Anthony Davis
Blake Griffin
Carmelo Anthony
Chris Paul
Damian Lillard
DeMar DeRozan
DeMarcus Cousins
Eric Bledsoe
Giannis Antetokounmpo
Goran Dragic
Gordon Hayward
Isaiah Thomas
James Harden
Jimmy Butler
Joel Embiid
John Wall
Karl-Anthony Towns
Kawhi Leonard
Kemba Walker
Kevin Durant
Kevin Love
Kyle Lowry
Kyrie Irving
LeBron James
Marc Gasol
Mike Conley
Paul George
Paul Millsap
Rudy Gay
Russell Westbrook
Stephen Curry

These players are clearly the superstars of the league, with the biggest impact on a game and the most touches per game. In the 4 cluster solution, there are also 40 additional players which aren't on the same superstar level.


The 7 cluster solution seems to be better because there are definitely more than four different player types which are supposed to be split into separate groups. However, the list above still consists of players on every position (as long as they are the superstars in the league). Therefore you can try to go even further and do the clustering with more clusters to see if these guys will be split into more groups or what else might happen to them. Let us know how it went for you.

You can also check our online bootcamp for more Data Science education.