This week we will show you how to start with Machine Learning in Python on a simple regression exercise using Pandas and sklearn. These 2 packages are state of the art for data wrangling and Machine Learning in Python.

Problem:

We will use a UFO dataset, which consists of 80332 ufo sightings. Using a regression classifier we will try to predict the duration of the sightings. The dataset can be downloaded here: https://drive.google.com/file/d/0B2gZvn36c5CmRTJpS3pkUllmX1U/view?usp=sharing

import Pandas as pd

LOADING DATA:

  • Using Pandas read_csv we can load data directly into Pandas data frame
  • We drop the first column which is the empty column in the .csv file. It is loaded automatically with the name "Unnamed: 0"
data_path = ""
df = pd.read_csv(data_path + "UFO_data.csv", sep=";").drop("Unnamed: 0", axis = 1)
df.head(2)

To check the types of the variables we have, we will call function dtypes on top of our data frame object:

df.dtypes


DATA PROCESSING

Definition of target variable

target = df.duration_seconds.values

Sklearn doesn't work with object variables, so we need to transform them into numbers. First of all we are going to identify all numeric variables:

numeric = []
for feat in df.columns:
    try:
        df[feat] = df[feat].astype(float)
        numeric.append(feat)
    except ValueError:
        print "Variable %s is object or date" % feat
numeric

There are a lot of options you can do with categorical variables. One of the basic transformation which is used very often is to convert to dummy variables. For this, we can use Pandas function get_dummies().

# we select some of the categorical variables (those which have small amount of categories)
dummy_df = pd.get_dummies(df[["ast_orbiting_body","precipType"]])

There are plenty of more options what to do with categorical variables or text. For example some predefined transformator in sklearn like

LabelEncoder:
http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html 

or OneHotEncoder:
http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html#sklearn.preprocessing.OneHotEncoder 

Or some more complicated transformations like Word2Vec and Doc2Vec to transform text into numbers.
 

CREATING A TRAINING SET

# let's merge 2 datasets we have on the index values
train_set = df[numeric].merge(dummy_df, left_index = True, right_index = True)
train_set.head()


REPLACEMENT OF MISSING VALUES

sklearn doesn't work with missing values so we need to remove or replace them. We remove columns which have more than 50% of values missing:

train_set = train_set.ix[:,(train_set.count() / float(len(train_set)) > 0.5)]

We will be using the non-linear classifier RandomForest so we will replace missing values with -1. This way we know that it is a separate value. You have to be careful with this though and shouldn't do this if you plan to use, for example, linear regression.

train_set[train_set.isnull()] = -1
train_set.head()


MODELING

After we have prepared our training set, we can create a Machine Learning model. In this case, we will use a regression classifier because our target variable is continuous:

y = target
# Dropping the dependant variable from the train set
X = train_set.drop("duration_seconds", axis = 1).as_matrix()

# Splitting the dataset into training and test set
# Test set is used to evaluate model on the data it was not trained on.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

print len(X_train)
print len(X_test)
# import RandomForest
from sklearn.ensemble import RandomForestRegressor

rf = RandomForestRegressor(n_estimators = 100,min_samples_split=2,max_depth=10, n_jobs=-1)
# Fit it to the training data
rf.fit(X_train, y_train)


EVALUATION

There are plenty of metrics which we can use to evaluate regression models. For this tutorial we will use simple RMSE:
https://en.wikipedia.org/wiki/Root-mean-square_deviation

import numpy as np
# Simple Evaluation using RMSE
print "RMSE on trainset is ", np.sqrt(np.sum((rf.predict(X_train) - y_train)*(rf.predict(X_train) - y_train))/len(y_train))
print "RMSE on testset is ", np.sqrt(np.sum((rf.predict(X_test) - y_test)*(rf.predict(X_test) - y_test))/len(y_test))


FEATURE IMPORTANCE

The last thing we will check in this tutorial is feature importance. What are the features that have the biggest influence on our target variable?

# Feature names from dataframe
feats = list(train_set.drop("duration_seconds", axis = 1).columns)
# Values for feature importances
imp = list(rf.feature_importances_)

# transforming importances into the pandas DataFrame
pd.DataFrame([feats,imp]).transpose().sort_values(1, ascending = False)


CONCLUSION

We hope that this tutorial will give you a basic introduction and overview of how to approach the regression problem in Machine Learning. Like in the previous blog posts, the approach is focused on simplicity, so there is still some room for improvement. You can tune up feature transformation, feature selection and try different modeling techniques. Let us know what score you achieved!