This week we will show you how to start with Machine Learning in Python on a simple regression exercise using Pandas and sklearn. These 2 packages are state of the art for data wrangling and Machine Learning in Python.


We will use a UFO dataset, which consists of 80332 ufo sightings. Using a regression classifier we will try to predict the duration of the sightings. The dataset can be downloaded here:

import Pandas as pd


  • Using Pandas read_csv we can load data directly into Pandas data frame
  • We drop the first column which is the empty column in the .csv file. It is loaded automatically with the name "Unnamed: 0"
data_path = ""
df = pd.read_csv(data_path + "UFO_data.csv", sep=";").drop("Unnamed: 0", axis = 1)

To check the types of the variables we have, we will call function dtypes on top of our data frame object:



Definition of target variable

target = df.duration_seconds.values

Sklearn doesn't work with object variables, so we need to transform them into numbers. First of all we are going to identify all numeric variables:

numeric = []
for feat in df.columns:
        df[feat] = df[feat].astype(float)
    except ValueError:
        print "Variable %s is object or date" % feat

There are a lot of options you can do with categorical variables. One of the basic transformation which is used very often is to convert to dummy variables. For this, we can use Pandas function get_dummies().

# we select some of the categorical variables (those which have small amount of categories)
dummy_df = pd.get_dummies(df[["ast_orbiting_body","precipType"]])

There are plenty of more options what to do with categorical variables or text. For example some predefined transformator in sklearn like


or OneHotEncoder: 

Or some more complicated transformations like Word2Vec and Doc2Vec to transform text into numbers.


# let's merge 2 datasets we have on the index values
train_set = df[numeric].merge(dummy_df, left_index = True, right_index = True)


sklearn doesn't work with missing values so we need to remove or replace them. We remove columns which have more than 50% of values missing:

train_set = train_set.ix[:,(train_set.count() / float(len(train_set)) > 0.5)]

We will be using the non-linear classifier RandomForest so we will replace missing values with -1. This way we know that it is a separate value. You have to be careful with this though and shouldn't do this if you plan to use, for example, linear regression.

train_set[train_set.isnull()] = -1


After we have prepared our training set, we can create a Machine Learning model. In this case, we will use a regression classifier because our target variable is continuous:

y = target
# Dropping the dependant variable from the train set
X = train_set.drop("duration_seconds", axis = 1).as_matrix()

# Splitting the dataset into training and test set
# Test set is used to evaluate model on the data it was not trained on.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

print len(X_train)
print len(X_test)
# import RandomForest
from sklearn.ensemble import RandomForestRegressor

rf = RandomForestRegressor(n_estimators = 100,min_samples_split=2,max_depth=10, n_jobs=-1)
# Fit it to the training data, y_train)


There are plenty of metrics which we can use to evaluate regression models. For this tutorial we will use simple RMSE:

import numpy as np
# Simple Evaluation using RMSE
print "RMSE on trainset is ", np.sqrt(np.sum((rf.predict(X_train) - y_train)*(rf.predict(X_train) - y_train))/len(y_train))
print "RMSE on testset is ", np.sqrt(np.sum((rf.predict(X_test) - y_test)*(rf.predict(X_test) - y_test))/len(y_test))


The last thing we will check in this tutorial is feature importance. What are the features that have the biggest influence on our target variable?

# Feature names from dataframe
feats = list(train_set.drop("duration_seconds", axis = 1).columns)
# Values for feature importances
imp = list(rf.feature_importances_)

# transforming importances into the pandas DataFrame
pd.DataFrame([feats,imp]).transpose().sort_values(1, ascending = False)


We hope that this tutorial will give you a basic introduction and overview of how to approach the regression problem in Machine Learning. Like in the previous blog posts, the approach is focused on simplicity, so there is still some room for improvement. You can tune up feature transformation, feature selection and try different modeling techniques. Let us know what score you achieved!