Part 1 : **Discuss under the stationarity - ARMA** (<- you are here)

Part2 : Non-stationarity and Seasonality - ARIMA and Seasonal ARIMA

# In the beginning ...

The time-series analysis is frequently needed in the practical data analysis, and the skill could help you solve many real problems for all developers.

For example, almost all marketing or social analysis will have some kind of seasonality. Even if you analyze your application's access logs, it might have some "peak time" (trends of access frequency) or something.

Recognizing time-series analysis is based on the mathematical background. However, throughout in this post, I focus on explaining why things are needed on building your intuitions shortly. (For a lot of descriptions, I will write only consequence without proofs.) I also won't use the difficult terminologies as possible, but I note that the primary mathematical knowledge might be needed for this reading.

Before starting, we have some notice to say.

First, the time-series analysis is having several approaches like the spectrum analysis (Fourier Transform), exponential smoothing, etc... Here we discuss using **ARIMA** which is based on the mathematical model which represents many aspects of real time-series. (The famous GARCH is also based on this ARIMA.) Here we use "forecast" package for ARIMA analysis in R.

The second, there're so many things (a lot of backgrounds) to say, and I cannot write it once. Therefore, first I focus on the fundamentals and basic ideas about ARIMA under the condition of stationary process in this post, and next we will discuss more practical topics including non-stationarity.

Note : In this post, we don't care the index of time, like "day of week", "quarter", etc. (We use like 1, 2, 3, ... as time axis.) If you want to handle these time signature, timekit package can help you analyze time-series.

# ARMA Modeling

As I mentioned above, here we assume the stationarity of time-series data.

The "stationarity" means : the mean (expected value) is not dependant for time and is constant, and the autocovariance only depends on the time differences, not depends on t. That is, the stationary process has the following restriction. (One important notice is that the real time-series data might be measured once and the values are fixed. But here we should consider the mathematical model to fit our time-series data, and therefore we suppose is not some fixed values, but consists of the statistical probability space.)

Note : Strictly speaking, the above definition is called "weak-sense stationary process". This definition is often used in the practical statistical analysis rather than "strongly stationary process". When we use the term "stationarity" through the posts, it means "weak-sense stationarity".

See "Stationary process (Wikipedia)".

Intuitively, it implies that the stochastic behavior of value's motion is always same in any time t. For a brief example, it is not stationary process, if values increase more and more as time passes, because the mean is not constant and depends on t. (In this case, the differences might behave some steady model, and we discuss this case in the next post.)

Now we consider the mathematical model to fit our time-series data under the condition of stationarity. Let's consider the stock price data. (The stock price would be non-stationary, but here we suppose it's stationary.)

As you know, the stock price is influenced by the yesterday's price. If the price (close price) of t - 1 is y dollars, the price of day t starts with y dollars.

That is, we can assume that the price will be determined by the following model, where c and is constant, and is randomly distributed noise data called "white noise". (For this , both the mean and the autocovariance equal to 0. But it's okay to suppose as to make things easy.)

Note : If is negative value, and is having the correlation of the reverse. (When is large, will be negatively large.) If then, the graph will be so jagged.

This model is called **AR (Autoregressive)**, and AR(p) is given as the following definition in general. (Therefore the above stock model is AR(1).)

AR model can represent many aspects of cyclic stationarity.

For example, the following R program is plotting AR model with . As you can see, it draws the beautiful line with cycles.

```
data_count <- 300
cons <- 70
y1 <- 10
phi1 <- 1.8
phi2 <- -0.9
e <- rnorm(data_count) * 10
y2 <- cons + (phi1 * y1) + e[2]
y <- data.frame(
Indexes=c(1, 2),
Values=c(y1, y2))
for (i in 3:data_count) {
yi <- cons + (phi1 * y$Values[i-1]) + (phi2 * y$Values[i-2]) + e[i]
y <- rbind(
y,
data.frame(Indexes=c(i), Values=c(yi)))
}
plot(y)
```

Note : You can simply use `arima.sim()`

(in "forecast" package) for ARIMA (including ARMA) simulation. But here we don't use this function for your understanding.

You remember that not all AR expression is stationary process, and AR model has the stationarity, only if some condition is established.

In AR(1), it has stationarity if and only if . In AR(2), it has stationarity if and only if and .

Now let's consider some other model.

For stock price, it might also be affected by the difference between open price and close price on the day before. If there was a rapid decline in stock price on the day t, the price might rebound on the day t + 1.

In such a case, we can use the following model called **MA (moving average)**.

The part of (μ is the mean) is the same like AR, but MA depends on the t-1 noise , instead of .

Same as AR, the general MA(q) expression is given as follows.

It is known that MA(q) is always stationary process.

Let's see how the graph will change with values.

As you can see below, it will smoothly transit when is positively related. It will be jagged when it is negatively related.

Normal Distribution (No MA)

MA(3) with

MA(2) with

```
data_count <- 300
mu <- 70
# sample1
#theta1 <- 0
#theta2 <- 0
#theta3 <- 0
# sample2
theta1 <- 1
theta2 <- 1
theta3 <- 1
# sample3
#theta1 <- -1
#theta2 <- 1
#theta3 <- 0
e <- rnorm(data_count) * 10
y1 <- mu + e[1]
y2 <- mu + e[2] + (theta1 * e[1])
y3 <- mu + e[3] + (theta1 * e[2]) + (theta2 * e[1])
y <- data.frame(
Indexes=c(1, 2, 3),
Values=c(y1, y2, y3))
for (i in 4:data_count) {
yi <- mu + e[i] + (theta1 * e[i-1]) + (theta2 * e[i-2]) + (theta3 * e[i-3])
y <- rbind(
y,
data.frame(Indexes=c(i), Values=c(yi)))
}
plot(y, type="l")
```

Finally, the combined model is called **ARMA** model, and it's given as follows.

As you can see below, the former part is AR(p) and the latter is MA(q).

# Estimate - Choose your model and parameters

Now we consider the prediction of time-series data. First we select (identify) the appropriate model and parameters (p, q, , , etc).

When you select the model (AR, MA, ARMA) and dimensions (p, q), you must focus on the charactristics of each models. Each model is having the different characteristics for autocorrelation (ACF) and partial autocorrelation (PACF).

Autocorrelation (Corr) is the autocovariance divided by variance as follows. By dividing with variance, it's not be affected by the size (quantity) of unit.

As you know, Corr only depends on the lag k (not depend on t), because Cov depends only on k. That is, the autocorrelation means the effect (correlation) by the lag k,

On the other hand, the partial autocorrelation is the value of the effect by the lag k but it eliminates the effect of the lag 1, 2, ..., k-1. It is given by calculating the linear equations for the covariance. (It needs the calculation of matrices, but we don't describe the definition of the partial autocorrelation in this post.)

Each AR(p), MA(q), and ARMA(p, q) model is having the following different characteristics for the autocorrelation and partial autocorrelation.

For example, if the partial autocorrelation of samples equals to 0 when the lag >= 3, there is high probability that the model is AR(2).

- AR(p) : When the lag is getting large, the autocorrelation decreases exponentially (but, non-0 value). When the lag is larger than p, the partial autocorrelation equals to 0.
- MA(q) : When the lag is larger than q, the autocorrelation equals to 0. When the lag is getting large, the partial autocorrelation decreases exponentially (but, non-0 value).
- ARMA(p, q) : When the lag is getting large, both autocorrelation and partial autocorrelation decreases exponentially (but, non-0 value).

autocorrelation (ACF) | partial autocorrelation (PACF) | |
---|---|---|

AR(p) | decrease exponentially | 0 if >= p + 1 |

MA(q) | 0 if >= q + 1 | decrease exponentially |

ARMA(p, q) | decrease exponentially | decrease exponentially |

Note : You may wonder... Why the partial autocorrelation of MA is not 0, even though Corr is 0 when the lag >= q + 1 ?

The reason is because of the difference between definition of autocorrelation and partial autocorreletion, and it's not so important.

The following is the autocorrelation and partial autocorrelation plotting sample with R. (The x axis is lags, and the y axis is correlation values)

```
library(forecast)
data_count <- 10000
cons <- 70
y1 <- 10
phi1 <- 1.8
phi2 <- -0.9
e <- rnorm(data_count) * 10
y2 <- cons + (phi1 * y1) + e[2]
y <- data.frame(
Indexes=c(1, 2),
Values=c(y1, y2))
for (i in 3:data_count) {
yi <- cons + (phi1 * y$Values[i-1]) + (phi2 * y$Values[i-2]) + e[i]
y <- rbind(
y,
data.frame(Indexes=c(i), Values=c(yi)))
}
# Autocorrelation plotting
acf(
y$Values,
lag.max = 80,
type = c("correlation"),
plot = T)
# Partial Autocorrelation plotting
pacf(
y$Values,
lag.max = 80,
plot = T)
```

Autocorrelation sample of AR(2)

Partial Autocorrelation sample of AR(2)

```
data_count <- 10000
mu <- 70
theta1 <- 1
theta2 <- -1
e <- rnorm(data_count) * 10
y1 <- mu + e[1]
y2 <- mu + e[2] + (theta1 * e[1])
y <- data.frame(
Indexes=c(1, 2),
Values=c(y1, y2))
for (i in 3:data_count) {
yi <- mu + e[i] + (theta1 * e[i-1]) + (theta2 * e[i-2])
y <- rbind(
y,
data.frame(Indexes=c(i), Values=c(yi)))
}
# Autocorrelation plotting
acf(
y$Values,
lag.max = 80,
type = c("correlation"),
plot = T)
# Partial Autocorrelation plotting
pacf(
y$Values,
lag.max = 80,
plot = T)
```

Autocorrelation sample of MA(2)

Partial Autocorrelation sample of MA(2)

Note : As you can see, it's so hard to distinguish whether it equals to 0 or is decreasing exponentially, because the values are approximated.

In the real estimation, AIC or BIC (information criteria) is often used for validating the choice (the count of parameters) of models.

For the estimation of parameter values, you can use the estimation techniques like moment, least squares, or maximum likelihood. For example, using the given samples , you can determine the parameter values (, etc) which maximize the likelihood of . (We assume represents the normal distribution.)

After you determine the model and parameters, you can survey the sample (the given real data) with the normal techniques of the statistical sample survey.

With R, you can analyze without any difficulties using `auto.arima()`

in "forecast" package. The following is the brief example, in which we create the data with AR(2) () and analyze with `auto.arima()`

.

As you can see below, we can get the good result (). (Here ARIMA(p, 0, q) means ARMA(p, q).)

```
library(forecast)
# create time-series data with ARMA(2, 0)
data_count <- 1000
cons <- 70
y1 <- 10
phi1 <- 1.8
phi2 <- -0.9
e <- rnorm(data_count) * 10
y2 <- cons + (phi1 * y1) + e[2]
y <- data.frame(
Indexes=c(1, 2),
Values=c(y1, y2))
for (i in 3:data_count) {
yi <- cons + (phi1 * y$Values[i-1]) + (phi2 * y$Values[i-2]) + e[i]
y <- rbind(
y,
data.frame(Indexes=c(i), Values=c(yi)))
}
# analyze
test <- auto.arima(y$Values, seasonal = F)
summary(test)
```

Let's consider another example with R.

The following data aligns with `sin()`

(with some noise), and it will be easy to analyze with spectrum technique, but here we shall represent it as similar stationary model with `auto.arima()`

.

```
library(forecast)
data_count <- 200
y <- data.frame(
Indexes=c(1:data_count),
Noise=rnorm(data_count))
y$Values <- 150 + (100 * sin(pi * y$Indexes / 10)) + (5 * y$Noise)
plot(x = y$Indexes, y = y$Values)
test <- auto.arima(y$Values, seasonal = F)
summary(test)
```

The following is the ARMA simulation with these parameters. (As I described above, you can use `arima.sim()`

for simulation, but here we don't use this function for your understanding.)

As you can see below, this model seems to be fitted closely by the cycle of given data (real data).

```
library(forecast)
data_count <- 200
phi1 <- 1.1690
phi2 <- 0.3636
phi3 <- -0.6958
c <- 149.9833 * abs(1 - phi1)
e <- rnorm(data_count, sd = sqrt(66.94))
y1 <- c + (phi1 * c) + e[1]
y2 <- c + (phi1 * y1) + e[2]
y3 <- c + (phi1 * y2) + (phi2 * y1) + e[3]
y <- data.frame(
Indexes=c(1, 2, 3),
Values=c(y1, y2, y3))
for (i in 4:data_count) {
yi <- c + (phi1 * y$Values[i-1]) + (phi2 * y$Values[i-2]) + (phi3 * y$Values[i-3]) + e[i]
y <- rbind(
y,
data.frame(Indexes=c(i), Values=c(yi)))
}
plot(y, type="l")
```

These (above) are the primitive examples, but now we change the original data as follows.

This data is much affected by the white noise compared with `sin()`

cycle.

```
y <- data.frame(
Indexes=c(1:2000),
Noise=rnorm(2000))
y$Values <- 20 + (
```**10** * sin(y$Indexes / 30)) + (5 * y$Noise)
plot(x = y$Indexes, y = y$Values)
test <- auto.arima(y$Values, seasonal = F)
summary(test)

The following is the result which we analyze with `auto.arima()`

.

When you simulate the result, you can find the difference (errors) between the original data and the simulated one.

The effect of noise will affect the result of estimation. (You must care the probability of errors in the real practical analysis.)

```
library(forecast)
data_count <- 2000
phi1 <- 0.9865
theta1 <- -0.8565
theta2 <- -0.0053
theta3 <- 0.0187
theta4 <- -0.0188
theta5 <- 0.0864
c <- 20.4084 * abs(1 - phi1)
e <- rnorm(data_count, sd = sqrt(28.21))
y1 <- c + (phi1 * c) + e[1]
y2 <- c + (phi1 * y1) + e[2] + (theta1 * e[1])
y3 <- c + (phi1 * y2) + e[3] + (theta1 * e[2]) + (theta2 * e[1])
y4 <- c + (phi1 * y3) + e[4] + (theta1 * e[3]) + (theta2 * e[2]) + (theta3 * e[1])
y5 <- c + (phi1 * y4) + e[5] + (theta1 * e[4]) + (theta2 * e[3]) + (theta3 * e[2]) + (theta4 * e[1])
y <- data.frame(
Indexes=c(1, 2, 3, 4, 5),
Values=c(y1, y2, y3, y4, y5))
for (i in 6:data_count) {
yi <- c + (phi1 * y$Values[i-1]) + e[i] + (theta1 * e[i-1]) + (theta2 * e[i-2]) + (theta3 * e[i-3]) + (theta4 * e[i-4]) + (theta5 * e[i-5])
y <- rbind(
y,
data.frame(Indexes=c(i), Values=c(yi)))
}
plot(y, type="l")
```

In these examples I used the data with the trivial `sin()`

cycles for your understanding, but you must remember that our goal is to give the model of maximum possibility with explanatory variables, not to give the similar graph.

Even though it's same model (the model is correct), the graph could be vastly different in the simulation, because it's including stochastic uncertain values and it causes subsequent differences. The important thing is "modeling", not "drawing".

When you want to involve some consideration of known periodical cycles, trend, or noise for your analysis, you can mix the approach to get the moving trend (by `ma()`

function) or something, and then you could analyze only targeting trends or remainders. For example, the following original data (see "data" section) is having the pair of cycles with small peak and large peak repeatedly, and is also having the big trend wave with rising and falling. (Here we have decomposed with `stl()`

function as follows.)

# Predict the future

When the model and parameters are determined, now you can predict the future values with the generated model.

The predicted values are called "conditional expected value" or "conditional mean", because we predict the future values with distribution under the condition of the given values (the values in the past) . (We assume that represents the normal distribution.)

On the contrary, the prediction approach to show some interval of possibility is called "interval forecast". For example, showing the interval of 95th percentile.

This possibility zone (the variance called mean squared errors (MSE)) is getting larger, when the time goes to the future. With interval forecast, you can visually know these possibility trends.

In computing process, the probability distribution will not be calculated algebraically, but the sufficiently large number of values with the normal distribution are generated, and it's approximated. (Because of the computing cost.) This kind of approach is often used in the real computer system. (See Gradient descent, Newton's method, or Monte Carlo, etc.)

The following is the R example of this prediction (forecast) with primitive AR(2).

The line is the conditional expected value, inner range is the interval forecast with the possibility of 85th percentile, and outer range is the one with the possibility of 95th percentile.

```
library("forecast")
# create data with AR(2)
data_count <- 1000
cons <- 70
y1 <- 10
phi1 <- 1.8
phi2 <- -0.9
e <- rnorm(data_count) * 10
y2 <- cons + (phi1 * y1) + e[2]
y <- data.frame(
Indexes=c(1, 2),
Values=c(y1, y2))
for (i in 3:data_count) {
yi <- cons + (phi1 * y$Values[i-1]) + (phi2 * y$Values[i-2]) + e[i]
y <- rbind(
y,
data.frame(Indexes=c(i), Values=c(yi)))
}
# estimate the model
test <- auto.arima(y$Values, seasonal = F)
summary(test)
```**# plot with prediction
pred <- forecast(
test,
level=c(0.85,0.95),
h=100)
plot(pred)**

As you can see in the graph, the conditional expected value of AR approaches (converges) to the expected value (the mean) itself of this model. (And MSE converges to the variance.)

Therefore it's meaningless to forecast further future, because it will just be the same value as the expected value (the mean) of the model.

Next we will discuss more advanced topics about time series analysis including non-stationarity.