In this tutorial, I will introduce you to the basics of how to work with time series in Python. For this we will use the packages Pandas, statsmodels (for some hypothesis testing) and matplotlib (for visualizations).
import pandas as pd import numpy as np import matplotlib.pylab as plt from matplotlib.pylab import rcParams # use this so plots will appear directly in the jupyter notebook %matplotlib inline # set up the size of the plot's for your notebook rcParams['figure.figsize'] = 15, 6
WHAT DO WE NEED TIME SERIES FOR?
There are many reasons why time series are important. For example, a lot of prediction problems involve a time component. We can also use time series as features in supervised learning.
The main difference between time series and regular Machine Learning problems is that they are time dependent. Therefore the basic assumption of linear regression, that the observations are independent, isn't valid in this case.
You can download the time series from here.
data_path = "path_to_your_file" data = pd.read_csv(data_path + 'monthly-car-sales-in-quebec-1960.csv', sep = ";") data.head()
We need to drop the last row because it is not a valid observation. We will also simplify the name of our variable because it's too long and has spaces in the name:
data.drop(data.index, inplace=1) print data.dtypes data.columns = ["Month", "car_sales"]
In order to read the data as a time series, we have to transform it into the Pandas series and use the column with dates as a index:
ts = pd.Series(data["car_sales"].values, index=pd.to_datetime(data.Month)) ts.head()
Now we can use indexing of the time series based on the dates in the index:
# specify the entire range: ts['1960-01-01':'1960-05-01'] # use ':' if one of the indices is at ends: ts[:'1960-05-01'] # if we want to have all entries from one year ts['1960']
To plot the time series we can use a simple command from matplotlib.
The time series consists of four main parts:
- Level: The baseline value for the series if it was a straight line
- Trend: The optional and often linear increasing or decreasing behavior of the series over time
- Seasonality: The optional repeating patterns or cycles of behavior over time
- Noise: The variability in the observations that cannot be explained by Trend and Seasonality
We can use a function from the statsmodel package to decompose our time series into the parts mentioned above:
from statsmodels.tsa.seasonal import seasonal_decompose decomposition = seasonal_decompose(ts) trend = decomposition.trend seasonal = decomposition.seasonal noise = decomposition.resid plt.figure(figsize=(15,15)) plt.subplot(411) plt.plot(ts, label='Original TS') plt.legend(loc='best') plt.subplot(412) plt.plot(trend, label='Trend') plt.legend(loc='best') plt.subplot(413) plt.plot(seasonal,label='Seasonality') plt.legend(loc='best') plt.subplot(414) plt.plot(noise, label='Noise') plt.legend(loc='best')
Stationarity is one of the most important properties of time series. It means that the rolling average and the rolling standard deviation of time series do not change over time. By rolling average/variance we mean that at any time, we’ll take the average/variance of the last year, i.e. last 12 months. This period can differ from use case to use case.
HOW TO CHECK STATIONARITY OF A TIME SERIES?
There are two main ways how to check stationarity of time series:
- Visualization Of Rolling Statistics: We can plot the moving average and moving variance and see if it varies with time.
- Hypothesis Testing - Dickey-Fuller Test: This is one of the statistical tests for checking stationarity. The null hypothesis is that the time series is non-stationary. The test results comprise of a test statistic and some critical values for different confidence levels. If the 'test statistic’ is less than the ‘critical value’, we can reject the null hypothesis and say that the series is stationary.
from statsmodels.tsa.stattools import adfuller #Determing rolling statistics rol_mean = pd.rolling_mean(ts, window=12) rol_std = pd.rolling_std(ts, window=12) #Plot rolling statistics: orig = plt.plot(ts, color='blue',label='Original') mean = plt.plot(rol_mean, color='red', label='Rolling Mean') std = plt.plot(rol_std, color='black', label = 'Rolling Std') plt.legend(loc='best') plt.title('Moving mean & std') plt.show(block=False)
#Perform Dickey-Fuller test: test = adfuller(ts, autolag='AIC') print test # to make the results of the test more readable dfoutput = pd.Series(test[0:4], index=['Test Statistic','p-value','#Lags Used','Number of Observations Used']) for key,value in dftest.items(): dfoutput['Critical Value (%s)'%key] = value print dfoutput
HOW YOU MAKE A TIME SERIES STATIONARY?
When dealing with time series, stationarity of the series is often the assumption that has to be done. However, most of the time series in practice are not stationary, therefore we need to figure out how to transform them.
In theory, there are two main reasons behind non-stationarity of a time series:
- Trend: Changing of the time series mean over time. For example, as we have seen in this case, the number of sold cars was growing over time.
- Seasonality: Variations, which occur regularly at the specific time. In our case, people have the tendency to buy cars in specific months every year.
This leads us to the following underlying assumption: By extracting the estimated trend and seasonality from the non-stationary time series we can transform it to a stationary one.
We will use Trend and Seasonal Component from the decomposition of time series performed earlier:
ts_moving_avg_diff = ts - trend - seasonal ts_moving_avg_diff.head(12) ts_moving_avg_diff.dropna(inplace=True) # compute rolling mean and rolling standard deviation of new time series rol_mean = pd.rolling_mean(ts_moving_avg_diff, window=12) rol_std = pd.rolling_std(ts_moving_avg_diff, window=12) # plot the new time series to check the stationarity orig = plt.plot(ts_moving_avg_diff, color='blue',label='Original') mean = plt.plot(rol_mean, color='red', label='Rolling Mean') std = plt.plot(rol_std, color='black', label = 'Rolling Std')
We can check the stationarity once more also with the Dickey-Fuller test:
#Perform Dickey-Fuller test: test = adfuller(ts_moving_avg_diff, autolag='AIC') print test dfoutput = pd.Series(test[0:4], index=['Test Statistic','p-value','#Lags Used','Number of Observations Used']) for key,value in dftest.items(): dfoutput['Critical Value (%s)'%key] = value print dfoutput
Now we can see that the time series is already stationary.
This was a basic overview of how to work with time series in Python. You can try to download different time series from here and play around with those. In the future, we will add a tutorial on how to forecast a time series and how to do some more complex stuff. In the meantime, you can check out our courses for more Data Science education.