Prediction (out of sample)

In [1]:
%matplotlib inline

from __future__ import print_function
import numpy as np
import statsmodels.api as sm
/build/statsmodels-q9UdDz/statsmodels-0.8.0/.pybuild/cpython3_3.7_statsmodels/build/statsmodels/compat/pandas.py:56: FutureWarning: The pandas.core.datetools module is deprecated and will be removed in a future version. Please use the pandas.tseries module instead.
  from pandas.core import datetools

Artificial data

In [2]:
nsample = 50
sig = 0.25
x1 = np.linspace(0, 20, nsample)
X = np.column_stack((x1, np.sin(x1), (x1-5)**2))
X = sm.add_constant(X)
beta = [5., 0.5, 0.5, -0.02]
y_true = np.dot(X, beta)
y = y_true + sig * np.random.normal(size=nsample)

Estimation

In [3]:
olsmod = sm.OLS(y, X)
olsres = olsmod.fit()
print(olsres.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.988
Model:                            OLS   Adj. R-squared:                  0.987
Method:                 Least Squares   F-statistic:                     1271.
Date:                Sat, 02 Mar 2019   Prob (F-statistic):           3.09e-44
Time:                        17:10:00   Log-Likelihood:                 9.1679
No. Observations:                  50   AIC:                            -10.34
Df Residuals:                      46   BIC:                            -2.688
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          5.0138      0.072     70.045      0.000       4.870       5.158
x1             0.5052      0.011     45.768      0.000       0.483       0.527
x2             0.4799      0.043     11.058      0.000       0.393       0.567
x3            -0.0209      0.001    -21.514      0.000      -0.023      -0.019
==============================================================================
Omnibus:                        3.703   Durbin-Watson:                   2.155
Prob(Omnibus):                  0.157   Jarque-Bera (JB):                2.741
Skew:                           0.545   Prob(JB):                        0.254
Kurtosis:                       3.356   Cond. No.                         221.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

In-sample prediction

In [4]:
ypred = olsres.predict(X)
print(ypred)
[ 4.49246711  4.97080638  5.4109031   5.78660445  6.08119601  6.29014792
  6.42185912  6.49627741  6.5416221   6.58974765  6.67091008  6.80879598
  7.01663087  7.2950063   7.63178266  8.00408365  8.38205483  8.7337686
  9.03046984  9.25130023  9.38672531  9.44010087  9.42712167  9.37324281
  9.30949751  9.26739849  9.27376045  9.34629458  9.49069856  9.69972053
  9.95435052 10.22694371 10.48576244 10.70019059 10.84576373 10.90818854
 10.8856912  10.78930878 10.64107592 10.47040458 10.30925171 10.18686889
 10.12499629 10.1342909  10.21257555 10.34519667 10.50743178 10.66855136
 10.7968676  10.86494065]

Create a new sample of explanatory variables Xnew, predict and plot

In [5]:
x1n = np.linspace(20.5,25, 10)
Xnew = np.column_stack((x1n, np.sin(x1n), (x1n-5)**2))
Xnew = sm.add_constant(Xnew)
ynewpred =  olsres.predict(Xnew) # predict out of sample
print(ynewpred)
[10.83981009 10.68713502 10.42573439 10.09849414  9.7618673   9.47205226
  9.27123349  9.17725326  9.17924328  9.24028561]

Plot comparison

In [6]:
import matplotlib.pyplot as plt

fig, ax = plt.subplots()
ax.plot(x1, y, 'o', label="Data")
ax.plot(x1, y_true, 'b-', label="True")
ax.plot(np.hstack((x1, x1n)), np.hstack((ypred, ynewpred)), 'r', label="OLS prediction")
ax.legend(loc="best");

Predicting with Formulas

Using formulas can make both estimation and prediction a lot easier

In [7]:
from statsmodels.formula.api import ols

data = {"x1" : x1, "y" : y}

res = ols("y ~ x1 + np.sin(x1) + I((x1-5)**2)", data=data).fit()

We use the I to indicate use of the Identity transform. Ie., we don't want any expansion magic from using **2

In [8]:
res.params
Out[8]:
Intercept           5.013794
x1                  0.505249
np.sin(x1)          0.479876
I((x1 - 5) ** 2)   -0.020853
dtype: float64

Now we only have to pass the single variable and we get the transformed right-hand side variables automatically

In [9]:
res.predict(exog=dict(x1=x1n))
Out[9]:
0    10.839810
1    10.687135
2    10.425734
3    10.098494
4     9.761867
5     9.472052
6     9.271233
7     9.177253
8     9.179243
9     9.240286
dtype: float64