In this weeks tutorial, we will show you how to approach classification problems with simple Machine Learning algorithms in Python, using pandas and sklearn. These two packages are state of the art for data wrangling and Machine Learning in Python.
We will use a dataset with 79885 reported traffic stops for a large Southeastern city in the United States in 2016. Using a classification algorithm we will try to predict if policemen opened and search the trunk or not. The dataset can be downloaded at https://drive.google.com/file/d/0Bz9_0VdXvv9ba1NYTGcyZGNiT1U/view?usp=sharing
import pandas as pd
- Using pandas read_csv we can load data directly into pandas data frame:
data_path = "" df = pd.read_csv(data_path + "trafficstop.csv", sep=",") df.head(2)
To check the types of the variables you have, we will call function dtypes on top of the data frame object.
Description: Records from all 79885 reported traffic stops for a large Southeastern city in 2016.
- Month 1-12 (Month)
- Reason for Stop (RsnStop)
- Officer Race (OffRace)
- Officer Male (OffMale)
- Officer Years Service (OffYrsSrv)
- Driver Race (DrvRace)
- Driver Hispanic (DrvHisp)
- Driver Male (DrvMale)
- Driver Age (DrvAge)
- Search Vehicle (SrchVhcl)
- Result of Stop (RsltStop)
Reasons for Stop:
- 5=Safe Movement
Result of Stop:
To check the types of the variables you have, we will call function dtypes on top of the data frame object:
Definition of target variable:
target = df.SrchVhcl.values # checking distribution of target variable. df.SrchVhcl.value_counts()
We have a huge difference between the number of events and non-events which can cause some issues in the modeling, but we will address this issue later.
Only numeric variables are in the dataset which is good for modeling, but we have a lot of categorical attributes (numbers indicates categories).
In the following part, we will identify those and create dummy variables from them. We will drop the target variable from the train set first and also of the variable "result of stop". This is the result of the stop and usually happens after the vehicle was searched. We shouldn't predict searching with that.
train_set = df.drop(["SrchVhcl","RsltStop"], axis = 1) discrete =  for feat in train_set.columns: if len(train_set[feat].value_counts()) < 15: discrete.append(feat) discrete
We will change them into dummy variables:
# checking the missing values train_set.count() train_set_tr = pd.get_dummies(train_set, columns=discrete) train_set_tr.head(2)
After we have prepared our training set, we can create a Machine Learning model. In this case, we will use classifier for prediction of categorical target variables.
import numpy as np y = target # Dropping the dependant variable from the train set X = train_set_tr.as_matrix() # Splitting the dataset into training and test set # Test set is used to evaluate model on the data it was not trained on. from sklearn.model_selection import train_test_split X_train_temp, X_test, y_train_temp, y_test = train_test_split(X, y, test_size=0.25, random_state=42) #======================= keeping the same amount of events and non-events in my training-set # samples of events and non_events y_events = y_train_temp[y_train_temp == 1] X_events = X_train_temp[y_train_temp == 1,:] y_non_events = y_train_temp[y_train_temp == 0] X_non_events = X_train_temp[y_train_temp == 0,:] print "Number of events in the trainset is", len(y_events) print "Number of non-events in the trainset is", len(y_non_events) # select random sample from non_events random_non_events_index = np.random.randint(X_non_events.shape, size=len(y_events)) X_non_events_sample = X_non_events[random_non_events_index, :] y_non_events_sample = y_non_events[random_non_events_index] print "Number of obs in the X_non_events_sample is", len(X_non_events_sample) print "Number of obs in the y_non_events_sample is", len(y_non_events_sample) #merging events and non-events together to final trainset with 50-50 ratio X_train = np.concatenate((X_non_events_sample, X_events),axis = 0) y_train = np.concatenate((y_non_events_sample, y_events),axis = 0) #!!! IT IS IMPORTANT THAT WE KEEP THE REAL RATION OF EVENTS AND NO-EVENTS IN THE TEST SET !!! print len(X_train) print len(X_test)
# import RandomForest from sklearn.ensemble import RandomForestClassifier rf = RandomForestClassifier(n_estimators = 100,min_samples_split=2,max_depth=3, n_jobs=-1) # Fit it to the training data rf.fit(X_train, y_train)
There is plenty of metrics which we can use to evaluate classification models. For this tutorial we will use simple confusion matrix.
# Simple Evaluation using Confusion matrix test = pd.DataFrame([rf.predict(X_test), y_test]).transpose() test.columns = ["prediction","real"] test["count"] = 1 test.groupby(["prediction","real"]).count()
In our predictions, there is an approximately three times higher ratio of searches than in the overall sample (it may vary with each fit of the model). There is still a room for improvement.
You can tune up feature transformation and selection, try different modeling techniques and play around with the cut-off (by default, it is 0.5) Let us know what score you achieved!
The last thing we will check in this tutorial is feature importance. What are the features that have the biggest influence on our target variable?
# Feature names from dataframe feats = list(train_set_tr.columns) # Values for feature importances imp = list(rf.feature_importances_) # transforming importances into the pandas DataFrame pd.DataFrame([feats,imp]).transpose().sort_values(1, ascending = False).head(5)