In part 2 of this video series, learn how to build an ARIMA time series model using Python’s statsmodels package and predict or forecast N timestamps ahead into the future. Now that we have differenced our data to make it more stationary, we need to determine the Autoregressive (AR) and Moving Average (MA) terms in our model. To determine this, we look at the Autocorrelation Function plot and Partial Autocorrelation Function plot. This series is considered for** intermediate** and **advanced** users. If you are looking to further your knowledge in data science, why not check out our data science bootcamp.

Hi, welcome back to this Data Science Dojo video tutorial series on time series.

In part one, we left it at differencing our data to make it more stationary.

As this is a requirement of many time series models. In part two we’ll

take our difference data and start modeling on it and forecast into the future.

So, what we need to do now is look at the autocorrelation function and

partial autocorrelation plots, or ACF PACF for short.

So these plots help determine the number of order aggressive terms and moving

average terms in a autoregressive moving average model. Or to spot the seasonality

or periodic trends. So I’ll explain what I mean by autogregressive and moving

average. So autoregressive basically is able to forecast the next timestamps

value by regressing over the previous values, and a moving average is able to

forecast the next timestamps value by averaging the previous values. So, autoregressive

integrated moving average model, which is the one we’re going to

use, is useful for non stationary data as it allows us to difference the data plus

has an additional seasonal differencing parameter for seasonal non stationary data.

So first let’s produce these plots and then I’ll explain how to interpret them.

So, we’re going to produce our first plots going to be ACF plot.

and we’re going to produce a PACAF plot as well.

Okay, let’s have a look at these.

Okay, so the ACF and the PCAF plot includes a 95% confidence interval band.

So anything outside this kind of shaded band here

is a statistically significant correlation. So if we see a significant spike at lag X

in the ACF that helps us determine the number of moving average terms and if we

see a significant spike at lag X in the PACF, that helps us determine the number

of autoregressive terms. So here in the ACF plot we see a spike at about one here.

So that will turn, help us determine the number of moving average terms and

if we look at the PACF, we can see two major spikes here, so one at about

five, and one I think at about thirteen. So that will help us determine the

number of AR terms. For now we’re just going to go ahead with a model that only

includes about five AR terms and see how that goes.

So, now that we have looked at

our ACF and PACF plots, we can now build our ARIMA model. That takes into account

that the amount of terms that we need to use. And just keep in mind this models also

going to infer the frequency, so we need to make sure there’s no gaps between our date

times before we start modeling.

Okay, so let’s call this ARMA 1 model.

And I’m going to apply our ARIMA model.

And we’re gonna give it our data.

And the order of terms is gonna be our ARMA terms and differencing.

So, first we’ll put in number of AR terms here.

Two rounds of differences, or two sets of differences. And one MA term here.

And I’m going to put an option here, or specified transparameters as false.

This kind of ensures, if you set it as true, ensures that things are kept stationary but

you’ll see why I have to set this as false later on in the video tutorial

series, when we talk about issues with our model.

And we’re going to print the summary of our model, so we can get a few details

modeled here, so let’s do that.

I’ll explain how to interpret the summary as well

Okay, let’s go ahead and run this.

So we’ve had a look at our autocorrelation and partial

autocorrelation and now we’ve built our model.

Alright, so this shows us a summary of our model here.

we want to probably look at the P values for our coefficient

of our terms here, so our AR terms and our MA terms here.

So looking at this is is useful because if the P value for say an AR or an MA

coefficient is greater than 0.05, which is our significance level.

A kind of cut off mark to determine whether it’s significant or not.

Then we can say it’s probably not significant enough of a term to keep in the model.

So how you look at this, we might want to remodel and include only this

AR or MA term here, as the other ones might not be necessary.

But for the purpose of demonstration, let’s go ahead, and then we’ll

discuss issues with our model later on.

The next step is, we want to predict the next 5 hours on the next 5 timestamps ahead,

which is our test holdout set.

So I’ll comment these out so they’re not too much of a distraction.

And we’ll give it our model.

and we use the predict function here.

And I’m going to give it the time stamps from the last time stamp was basically 6:00 p.m.

on the 6th of February 9, 2019. So I’m gonna take the time stamps into the future

from the last time stamp, which is from 7:00 p.m. to 11:00 p.m. on the five

time stamps ahead, so let’s do this.

I’m also going to make this type levels, and you’ll see why later on, why we need to specify that.

I’m also going to print these predictions, obviously.

Alright, so here are our forecasts, or our predictions, for the next five hours ahead.

We can kind of see it going in this sort of downward trajectory here, so it

predicts that sentiment is likely to go, turn in a kind of bad direction.

But what we need to keep in mind is, with time series we need to back

transform our D difference predicted values with our D differenced or

original actual values. This is automatically done when predicting so

when we specified type levels here.

We kind of wanted to predict on the

original scale, not on the D differenced kind of scale.

Nevertheless, we’re going to demonstrate how to de-transform, say, two rounds of differences

using cumulative sum, when you’ve been given original data. So the first step in that

is we want to basically get the second round of differences back to the first

round of differences, and then take that D different starter and get it back to

the original. So kind of like it’s two-step process.

So let’s go ahead and demonstrate this.

So, as I said, we want to get our second round of differences

back to the first round. So I’ll just call this undif one.

Take our second round of differences.

And we’re going to fill in any missing values just so they don’t cause us any problems.

And the next step, we want to get that

difference data, or undifference data back to the original. So this undiff 2.

Once again, fill in any missing values.

Okay, now we can compare these. So, the difference or,

There’re going to be small differences between our original data and our undifferenced data.

But we’re going to round it up to six places after the decimal point.

I mean, our values only come in six places after the decimal point anyway,

So they’re not very big differences to care about, but they’re essentially the same.

When we do round it up six places past the decimal point, so let’s have a look at this.

And we’ll just look at our original data first.

To about six places after the decimal point.

I want to see if it’s equal to the same as our undifferenced data.

Also, do this six points after the decimal point.

And just for our own sanity check, we can just look at the first few values for the original values

and compare it with the D difference values to see if they’re on par.

Let’s have a look at this.

Okay, cool. So, it’s come back as true as if there are no differences

or real differences between them. So our undifference data and our

original values are on par. And you can have your own kind of sanity check here

to make sure, just say the first few examples are definitely the same.

Now that we have modeled the data and made our predictions, we’ll compare our

predictions against the actual values in part three.

Thanks for watching. If you found this video tutorial useful, give us a like. Otherwise, you can check out our

other videos at tutorials.datasciencedojo.com

**Watch Part 1 Here:**

Read and Transform your data: Time Series in Python

**Watch Part 3 Here:**

Mean Absolute Error for Forecast Evaluation: Time Series in Python

**Code, R & Python Script Repository**

**Packages Used:**

pandas

matplotlib

StatsModels

statistics

**More Data Science Material:**

[Video] Getting started with Python and R for Data Science

[Video] Web scraping in Python and Beautiful Soup

[Blog] Supercharge your Python Plots with Zero Extra Code

(4012)

Thank you so much! 🙂 It is a very nice explanation.

Thanks! Glad you found it useful

Thank you Rebecca. Great videos indeed. One observation: for some reason I was expecting you to use hourly_sentiment_series_diff2 instead of hourly_sentiment_series in ARMA1model (line 38, video 2). Or the previous differencing was only to come up with correct orders for the model? Sergey

Thanks! Within ARIMA() you specify the order of number of AR, number of differences, and number of MA. So the order in this example is order=(5,2,1), with 2 sets of differences on the data. We use hourly_sentiment_series_diff2 for plotting to see if this helped make the data more stationary. We first see if 1 set of differences is enough or not before applying differences again. If you need to do this 2 times then you set this to 2 in the order(). If 1 set of diffs does a fairly good job, then set it to 1 in order(). This will difference the data the same way we differenced it for the purpose of plotting and checking if how it looks.