Prediction (out of sample)

[1]:
%matplotlib inline
[2]:
import numpy as np
import matplotlib.pyplot as plt

import statsmodels.api as sm

plt.rc("figure", figsize=(16, 8))
plt.rc("font", size=14)

Artificial data

[3]:
nsample = 50
sig = 0.25
x1 = np.linspace(0, 20, nsample)
X = np.column_stack((x1, np.sin(x1), (x1 - 5) ** 2))
X = sm.add_constant(X)
beta = [5.0, 0.5, 0.5, -0.02]
y_true = np.dot(X, beta)
y = y_true + sig * np.random.normal(size=nsample)

Estimation

[4]:
olsmod = sm.OLS(y, X)
olsres = olsmod.fit()
print(olsres.summary())
                            OLS Regression Results
==============================================================================
Dep. Variable:                      y   R-squared:                       0.982
Model:                            OLS   Adj. R-squared:                  0.981
Method:                 Least Squares   F-statistic:                     843.9
Date:                Tue, 14 Feb 2023   Prob (F-statistic):           3.32e-40
Time:                        22:28:59   Log-Likelihood:                -1.5410
No. Observations:                  50   AIC:                             11.08
Df Residuals:                      46   BIC:                             18.73
Df Model:                           3
Covariance Type:            nonrobust
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          4.9179      0.089     55.459      0.000       4.739       5.096
x1             0.5080      0.014     37.145      0.000       0.480       0.536
x2             0.5086      0.054      9.460      0.000       0.400       0.617
x3            -0.0208      0.001    -17.330      0.000      -0.023      -0.018
==============================================================================
Omnibus:                        1.964   Durbin-Watson:                   2.164
Prob(Omnibus):                  0.374   Jarque-Bera (JB):                1.099
Skew:                           0.260   Prob(JB):                        0.577
Kurtosis:                       3.507   Cond. No.                         221.
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

In-sample prediction

[5]:
ypred = olsres.predict(X)
print(ypred)
[ 4.39762671  4.88832201  5.33891602  5.7216905   6.01893057  6.22583523
  6.35130613  6.4164851   6.45128055  6.48945364  6.56307142  6.69723827
  6.90597136  7.18989764  7.53615086  7.92048542  8.31125991  8.67463564
  8.98013629  9.20565507  9.34108678  9.38998811  9.36899324  9.30508092
  9.23114192  9.1805752   9.18180052  9.25358916  9.40197988  9.61928674
  9.88536188 10.17090577 10.44228104 10.667039   10.81925084 10.88376771
 10.8587096  10.75577425 10.59831519 10.41750488 10.24721313 10.11844239
 10.05423415 10.06588328 10.15108229 10.29430049 10.46933561 10.64361875
 10.7835654  10.86009323]

Create a new sample of explanatory variables Xnew, predict and plot

[6]:
x1n = np.linspace(20.5, 25, 10)
Xnew = np.column_stack((x1n, np.sin(x1n), (x1n - 5) ** 2))
Xnew = sm.add_constant(Xnew)
ynewpred = olsres.predict(Xnew)  # predict out of sample
print(ynewpred)
[10.83941059 10.68419484 10.41439134 10.07545305  9.72721209  9.42923069
  9.22621832  9.13708527  9.15031269  9.22677285]

Plot comparison

[7]:
import matplotlib.pyplot as plt

fig, ax = plt.subplots()
ax.plot(x1, y, "o", label="Data")
ax.plot(x1, y_true, "b-", label="True")
ax.plot(np.hstack((x1, x1n)), np.hstack((ypred, ynewpred)), "r", label="OLS prediction")
ax.legend(loc="best")
[7]:
<matplotlib.legend.Legend at 0x7fd8db0cea90>
../../../_images/examples_notebooks_generated_predict_12_1.png

Predicting with Formulas

Using formulas can make both estimation and prediction a lot easier

[8]:
from statsmodels.formula.api import ols

data = {"x1": x1, "y": y}

res = ols("y ~ x1 + np.sin(x1) + I((x1-5)**2)", data=data).fit()

We use the I to indicate use of the Identity transform. Ie., we do not want any expansion magic from using **2

[9]:
res.params
[9]:
Intercept           4.917870
x1                  0.508005
np.sin(x1)          0.508600
I((x1 - 5) ** 2)   -0.020810
dtype: float64

Now we only have to pass the single variable and we get the transformed right-hand side variables automatically

[10]:
res.predict(exog=dict(x1=x1n))
[10]:
0    10.839411
1    10.684195
2    10.414391
3    10.075453
4     9.727212
5     9.429231
6     9.226218
7     9.137085
8     9.150313
9     9.226773
dtype: float64