In this weeks tutorial, we will show you how to approach classification problems with simple Machine Learning algorithms in Python, using pandas and sklearn. These two packages are state of the art for data wrangling and Machine Learning in Python.


Problem

We will use a dataset with 79885 reported traffic stops for a large Southeastern city in the United States in 2016. Using a classification algorithm we will try to predict if policemen opened and search the trunk or not. The dataset can be downloaded at https://drive.google.com/file/d/0Bz9_0VdXvv9ba1NYTGcyZGNiT1U/view?usp=sharing

import pandas as pd


LOADING DATA

  • Using pandas read_csv we can load data directly into pandas data frame:
data_path = ""
df = pd.read_csv(data_path + "trafficstop.csv", sep=",")
df.head(2)

To check the types of the variables you have, we will call function dtypes on top of the data frame object.

Description: Records from all 79885 reported traffic stops for a large Southeastern city in 2016.

VARIABLES/NAMES

  • Month 1-12 (Month)
  • Reason for Stop (RsnStop)
  • Officer Race (OffRace)
  • Officer Male (OffMale)
  • Officer Years Service (OffYrsSrv)
  • Driver Race (DrvRace)
  • Driver Hispanic (DrvHisp)
  • Driver Male (DrvMale)
  • Driver Age (DrvAge)
  • Search Vehicle (SrchVhcl)
  • Result of Stop (RsltStop)

    Reasons for Stop:

  • 1=CheckPoint
  • 2=DWI
  • 3=Investigation
  • 4=Other
  • 5=Safe Movement
  • 6=SeatBelt
  • 7=Speeding
  • 8=StopLight/Sign
  • 9=VehicleMovement
  • 10=VehicleRegistry

    Officer race:

  • Blank=NotSpecified
  • 1=AmericanIndian/NativeAlaska/Hawaii
  • 2=Asian/PacificIslander
  • 3=Black/AfricanAmerican
  • 4=Hispanic/Latino
  • 5=White

    Driver Race:

  • 1=Asian
  • 2=Black
  • 3=NativeAmerican
  • 4=Other/Unknown
  • 5=White

    Result of Stop:

  • 1=NoActionTaken
  • 2=VerbalWarning
  • 3=WrittenWarning
  • 4=Citation
  • 5=Arrest

To check the types of the variables you have, we will call function dtypes on top of the data frame object:

df.dtypes


DATA PROCESSING

Definition of target variable:

target = df.SrchVhcl.values

# checking distribution of target variable.
df.SrchVhcl.value_counts()

We have a huge difference between the number of events and non-events which can cause some issues in the modeling, but we will address this issue later.

Only numeric variables are in the dataset which is good for modeling, but we have a lot of categorical attributes (numbers indicates categories).

In the following part, we will identify those and create dummy variables from them. We will drop the target variable from the train set first and also of the variable "result of stop". This is the result of the stop and usually happens after the vehicle was searched. We shouldn't predict searching with that.

train_set = df.drop(["SrchVhcl","RsltStop"], axis = 1)

discrete = []
for feat in train_set.columns:
    if len(train_set[feat].value_counts()) < 15:
        discrete.append(feat)
discrete

We will change them into dummy variables:

# checking the missing values
train_set.count()
train_set_tr = pd.get_dummies(train_set, columns=discrete)
train_set_tr.head(2)


MODELING

After we have prepared our training set, we can create a Machine Learning model. In this case, we will use classifier for prediction of categorical target variables.

import numpy as np

y = target
# Dropping the dependant variable from the train set
X = train_set_tr.as_matrix()

# Splitting the dataset into training and test set
# Test set is used to evaluate model on the data it was not trained on.
from sklearn.model_selection import train_test_split
X_train_temp, X_test, y_train_temp, y_test = train_test_split(X, y, test_size=0.25, random_state=42)


#======================= keeping the same amount of events and non-events in my training-set
# samples of events and non_events
y_events = y_train_temp[y_train_temp == 1]
X_events = X_train_temp[y_train_temp == 1,:]
y_non_events = y_train_temp[y_train_temp == 0]
X_non_events = X_train_temp[y_train_temp == 0,:]
print "Number of events in the trainset is", len(y_events)
print "Number of non-events in the trainset is", len(y_non_events)
# select random sample from non_events
random_non_events_index = np.random.randint(X_non_events.shape[0], size=len(y_events))
X_non_events_sample = X_non_events[random_non_events_index, :]
y_non_events_sample = y_non_events[random_non_events_index]

print "Number of obs in the X_non_events_sample is", len(X_non_events_sample)
print "Number of obs in the y_non_events_sample is", len(y_non_events_sample)

#merging events and non-events together to final trainset with 50-50 ratio
X_train = np.concatenate((X_non_events_sample, X_events),axis = 0)
y_train = np.concatenate((y_non_events_sample, y_events),axis = 0)

#!!! IT IS IMPORTANT THAT WE KEEP THE REAL RATION OF EVENTS AND NO-EVENTS IN THE TEST SET !!!


print len(X_train)
print len(X_test)
# import RandomForest
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(n_estimators = 100,min_samples_split=2,max_depth=3, n_jobs=-1)
# Fit it to the training data
rf.fit(X_train, y_train)


EVALUATION

There is plenty of metrics which we can use to evaluate classification models. For this tutorial we will use simple confusion matrix.

# Simple Evaluation using Confusion matrix
test = pd.DataFrame([rf.predict(X_test), y_test]).transpose()
test.columns = ["prediction","real"]
test["count"] = 1

test.groupby(["prediction","real"]).count()


Summary

In our predictions, there is an approximately three times higher ratio of searches than in the overall sample (it may vary with each fit of the model). There is still a room for improvement.

You can tune up feature transformation and selection, try different modeling techniques and play around with the cut-off (by default, it is 0.5) Let us know what score you achieved!


FEATURE IMPORTANCE

The last thing we will check in this tutorial is feature importance. What are the features that have the biggest influence on our target variable?

# Feature names from dataframe
feats = list(train_set_tr.columns)
# Values for feature importances
imp = list(rf.feature_importances_)

# transforming importances into the pandas DataFrame
pd.DataFrame([feats,imp]).transpose().sort_values(1, ascending = False).head(5)