Chapter 2 Predictive Inference

In labor economics an important question is what determines the wage of workers. This is a causal question, but we could begin to investigate from a predictive perspective.

In the following wage example, \(Y\) is the hourly wage of a worker and \(X\) is a vector of worker’s characteristics, e.g., education, experience, gender. Two main questions here are:

How to use job-relevant characteristics, such as education and experience, to best predict wages?
What is the difference in predicted wages between men and women with the same job-relevant characteristics?

In this lab, we focus on the prediction question first.

2.1 Data

The data set we consider is from the March Supplement of the U.S. Current Population Survey, year 2015. We select white non-hispanic individuals, aged 25 to 64 years, and working more than 35 hours per week during at least 50 weeks of the year. We exclude self-employed workers; individuals living in group quarters; individuals in the military, agricultural or private household sectors; individuals with inconsistent reports on earnings and employment status; individuals with allocated or missing information in any of the variables used in the analysis; and individuals with hourly wage below \(3\).

The variable of interest \(Y\) is the hourly wage rate constructed as the ratio of the annual earnings to the total number of hours worked, which is constructed in turn as the product of number of weeks worked and the usual number of hours worked per week. In our analysis, we also focus on single (never married) workers. The final sample is of size \(n = 5150\).

2.2 Data Analysis

2.2.1 R and Python code

Import relevant packages

R code

library(dplyr)
library(kableExtra)
library(reticulate) # to run python

Python code

import pandas as pd
import numpy as np
import pyreadr

We start loading the data set.

R code

# to import RData file
load("./data/wage2015_subsample_inference.Rdata")
# to get data dimensions
dim(data)

## [1] 5150   20

Python code

rdata_read = pyreadr.read_r("./data/wage2015_subsample_inference.Rdata")
data = rdata_read['data']
data.shape

## (5150, 20)

The dimensions are 5150 rows and 20 columns.

Let’s have a look at the structure of the data.

R code

# Calculate the means and convert it into a dataframe
table0 <- data.frame(lapply(data,class))%>%
  tidyr::gather("Variable","Type")

# Table presentation
table0 %>%
  kable("markdown",caption = "Type of the Variables")

Table 2.1: Type of the Variables
Variable	Type
wage	numeric
lwage	numeric
sex	numeric
shs	numeric
hsg	numeric
scl	numeric
clg	numeric
ad	numeric
mw	numeric
so	numeric
we	numeric
ne	numeric
exp1	numeric
exp2	numeric
exp3	numeric
exp4	numeric
occ	factor
occ2	factor
ind	factor
ind2	factor

Python code

data.info()

## <class 'pandas.core.frame.DataFrame'>
## Index: 5150 entries, 10 to 32643
## Data columns (total 20 columns):
##  #   Column  Non-Null Count  Dtype   
## ---  ------  --------------  -----   
##  0   wage    5150 non-null   float64 
##  1   lwage   5150 non-null   float64 
##  2   sex     5150 non-null   float64 
##  3   shs     5150 non-null   float64 
##  4   hsg     5150 non-null   float64 
##  5   scl     5150 non-null   float64 
##  6   clg     5150 non-null   float64 
##  7   ad      5150 non-null   float64 
##  8   mw      5150 non-null   float64 
##  9   so      5150 non-null   float64 
##  10  we      5150 non-null   float64 
##  11  ne      5150 non-null   float64 
##  12  exp1    5150 non-null   float64 
##  13  exp2    5150 non-null   float64 
##  14  exp3    5150 non-null   float64 
##  15  exp4    5150 non-null   float64 
##  16  occ     5150 non-null   category
##  17  occ2    5150 non-null   category
##  18  ind     5150 non-null   category
##  19  ind2    5150 non-null   category
## dtypes: category(4), float64(16)
## memory usage: 736.3+ KB

data.describe()

##               wage        lwage  ...         exp3         exp4
## count  5150.000000  5150.000000  ...  5150.000000  5150.000000
## mean     23.410410     2.970787  ...     8.235867    25.118038
## std      21.003016     0.570385  ...    14.488962    53.530225
## min       3.021978     1.105912  ...     0.000000     0.000000
## 25%      13.461538     2.599837  ...     0.125000     0.062500
## 50%      19.230769     2.956512  ...     1.000000     1.000000
## 75%      27.777778     3.324236  ...     9.261000    19.448100
## max     528.845673     6.270697  ...   103.823000   487.968100
## 
## [8 rows x 16 columns]

Give structure to the variables

We are constructing the output variable \(Y\) and the matrix \(Z\) which includes the characteristics of workers that are given in the data.

R code

# Calculate the log wage.
Y <- log(data$wage)
# Number of observaciones
n <- length(Y)
# Regressors
Z <- data[,c("wage","lwage")]
# Number of regressors
p <- dim(Z)[2]

Number of observation: 5150
Number of raw regressors:2

Python code

# Calculate the log wage.
Y = np.log2(data['wage'])
# Number of observaciones
n = len(Y)
# Regressors
z = data.loc[:, ~data.columns.isin(['wage', 'lwage','Unnamed: 0'])]
# Number of regressors
p = z.shape[1]
print("Number of observation:", n, '\n')

## Number of observation: 5150

print( "Number of raw regressors:", p)

## Number of raw regressors: 18

For the outcome variable wage and a subset of the raw regressors, we calculate the empirical mean to get familiar with the data.

R code

# Select the variables.
Z_subset <- data[,c("lwage","sex","shs","hsg","scl","clg","ad","mw","so","we","ne","exp1")]
# Create table
table1 <- data.frame(as.numeric(lapply(Z_subset,mean))) %>%
  mutate(Variables = c("Log Wage","Sex","Some High School","High School Graduate","Some College","College Graduate", "Advanced Degree","Midwest","South","West","Northeast","Experience")) %>%
  rename(`Sample Mean` = `as.numeric.lapply.Z_subset..mean..`) %>%
  select(2,1)
# HTML table
table1 %>%
  kable("markdown",caption = "Descriptive Statistics")

Table 2.2: Descriptive Statistics
Variables	Sample Mean
Log Wage	2.9707867
Sex	0.4444660
Some High School	0.0233010
High School Graduate	0.2438835
Some College	0.2780583
College Graduate	0.3176699
Advanced Degree	0.1370874
Midwest	0.2596117
South	0.2965049
West	0.2161165
Northeast	0.2277670
Experience	13.7605825

Python code

Z_subset = data.loc[:, data.columns.isin(["lwage","sex","shs","hsg","scl","clg","ad","mw","so","we","ne","exp1"])]
table = Z_subset.mean(axis=0)
table

## lwage     2.970787
## sex       0.444466
## shs       0.023301
## hsg       0.243883
## scl       0.278058
## clg       0.317670
## ad        0.137087
## mw        0.259612
## so        0.296505
## we        0.216117
## ne        0.227767
## exp1     13.760583
## dtype: float64

table = pd.DataFrame(data=table, columns={"Sample mean":"0"} )
# table.index
index1 = list(table.index)
index2 = ["Log Wage","Sex","Some High School","High School Graduate",\
          "Some College","College Graduate", "Advanced Degree","Midwest",\
          "South","West","Northeast","Experience"]

table = table.rename(index=dict(zip(index1,index2)))

E.g., the share of female workers in our sample is ~41% (\(sex=1\) if female).

Alternatively, we can also print the table as latex.

R code

print(table1, type="latex")

##               Variables Sample Mean
## 1              Log Wage  2.97078670
## 2                   Sex  0.44446602
## 3      Some High School  0.02330097
## 4  High School Graduate  0.24388350
## 5          Some College  0.27805825
## 6      College Graduate  0.31766990
## 7       Advanced Degree  0.13708738
## 8               Midwest  0.25961165
## 9                 South  0.29650485
## 10                 West  0.21611650
## 11            Northeast  0.22776699
## 12           Experience 13.76058252

Python code

print(table.to_latex())

## \begin{tabular}{lr}
## \toprule
## {} &  Sample mean \\
## \midrule
## Log Wage             &     2.970787 \\
## Sex                  &     0.444466 \\
## Some High School     &     0.023301 \\
## High School Graduate &     0.243883 \\
## Some College         &     0.278058 \\
## College Graduate     &     0.317670 \\
## Advanced Degree      &     0.137087 \\
## Midwest              &     0.259612 \\
## South                &     0.296505 \\
## West                 &     0.216117 \\
## Northeast            &     0.227767 \\
## Experience           &    13.760583 \\
## \bottomrule
## \end{tabular}

2.3 Prediction Question

Now, we will construct a prediction rule for hourly wage \(Y\) , which depends linearly on job-relevant characteristics \(X\):

\[Y = \beta ′X + \epsilon \]

Our goals are

Predict wages using various characteristics of workers.
Assess the predictive performance using the (adjusted) sample MSE, the (adjusted) sample \(R^2\) and the out-of-sample \(MSE\) and \(R^2\).

We employ two different specifications for prediction:

Basic Model: \(X\) consists of a set of raw regressors (e.g. gender, experience, education indicators, occupation and industry indicators, regional indicators).
Flexible Model: \(X\) consists of all raw regressors from the basic model plus occupation and industry indicators, transformations (e.g.,\(exp2\) and \(exp3\)) and additional two-way interactions of polynomial in experience with other regressors. An example of a regressor created through a two-way interaction is experience times the indicator of having a college degree.

Using the Flexible Model, enables us to approximate the real relationship by a more complex regression model and therefore to reduce the bias. The Flexible Model increases the range of potential shapes of the estimated regression function. In general, flexible models often deliver good prediction accuracy but give models which are harder to interpret.

Now, let us fit both models to our data by running ordinary least squares (ols):

2.3.1 Basic Model:

R code

basic <- lwage~ (sex + exp1 + shs + hsg+ scl + clg + mw + so + we+occ2+ind2)
regbasic <- lm(basic, data=data)
summary(regbasic) # estimated coefficients

## 
## Call:
## lm(formula = basic, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.2479 -0.2885 -0.0035  0.2724  3.6529 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.7222354  0.0803414  46.330  < 2e-16 ***
## sex         -0.0728575  0.0150269  -4.848 1.28e-06 ***
## exp1         0.0085677  0.0006537  13.106  < 2e-16 ***
## shs         -0.5927984  0.0505549 -11.726  < 2e-16 ***
## hsg         -0.5043375  0.0270767 -18.626  < 2e-16 ***
## scl         -0.4119936  0.0252036 -16.347  < 2e-16 ***
## clg         -0.1822160  0.0229524  -7.939 2.49e-15 ***
## mw          -0.0275413  0.0193301  -1.425 0.154280    
## so          -0.0344538  0.0187063  -1.842 0.065558 .  
## we           0.0172492  0.0200860   0.859 0.390510    
## occ22       -0.0764717  0.0342039  -2.236 0.025411 *  
## occ23       -0.0346777  0.0387595  -0.895 0.370995    
## occ24       -0.0962017  0.0519073  -1.853 0.063892 .  
## occ25       -0.1879150  0.0603999  -3.111 0.001874 ** 
## occ26       -0.4149333  0.0502176  -8.263  < 2e-16 ***
## occ27       -0.0459867  0.0565054  -0.814 0.415771    
## occ28       -0.3778470  0.0439290  -8.601  < 2e-16 ***
## occ29       -0.2157519  0.0461229  -4.678 2.98e-06 ***
## occ210      -0.0106235  0.0396274  -0.268 0.788645    
## occ211      -0.4558342  0.0594409  -7.669 2.07e-14 ***
## occ212      -0.3075889  0.0555146  -5.541 3.16e-08 ***
## occ213      -0.3614403  0.0455401  -7.937 2.53e-15 ***
## occ214      -0.4994955  0.0506204  -9.867  < 2e-16 ***
## occ215      -0.4644817  0.0517634  -8.973  < 2e-16 ***
## occ216      -0.2337150  0.0324348  -7.206 6.62e-13 ***
## occ217      -0.4125884  0.0279079 -14.784  < 2e-16 ***
## occ218      -0.3404183  0.1966277  -1.731 0.083462 .  
## occ219      -0.2414797  0.0494794  -4.880 1.09e-06 ***
## occ220      -0.2126282  0.0408854  -5.201 2.06e-07 ***
## occ221      -0.2884133  0.0380839  -7.573 4.30e-14 ***
## occ222      -0.4223936  0.0414626 -10.187  < 2e-16 ***
## ind23       -0.1168365  0.0983990  -1.187 0.235135    
## ind24       -0.2444926  0.0772658  -3.164 0.001564 ** 
## ind25       -0.2735325  0.0810190  -3.376 0.000741 ***
## ind26       -0.2493683  0.0781049  -3.193 0.001418 ** 
## ind27       -0.1395884  0.0931442  -1.499 0.134032    
## ind28       -0.2429480  0.0940642  -2.583 0.009828 ** 
## ind29       -0.3874847  0.0765762  -5.060 4.34e-07 ***
## ind210      -0.1938509  0.0842585  -2.301 0.021451 *  
## ind211      -0.1690628  0.0823701  -2.052 0.040174 *  
## ind212      -0.0774358  0.0789759  -0.980 0.326887    
## ind213      -0.1726041  0.0901297  -1.915 0.055540 .  
## ind214      -0.1870052  0.0768288  -2.434 0.014965 *  
## ind215      -0.3253637  0.1489158  -2.185 0.028943 *  
## ind216      -0.3153990  0.0815927  -3.866 0.000112 ***
## ind217      -0.3044052  0.0806806  -3.773 0.000163 ***
## ind218      -0.3353864  0.0777377  -4.314 1.63e-05 ***
## ind219      -0.3741207  0.0879131  -4.256 2.12e-05 ***
## ind220      -0.5519322  0.0816545  -6.759 1.54e-11 ***
## ind221      -0.3166788  0.0802596  -3.946 8.06e-05 ***
## ind222      -0.1189713  0.0791489  -1.503 0.132866    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4761 on 5099 degrees of freedom
## Multiple R-squared:   0.31,  Adjusted R-squared:  0.3033 
## F-statistic: 45.83 on 50 and 5099 DF,  p-value: < 2.2e-16

cat( "Number of regressors in the basic model:",length(regbasic$coef), '\n')

## Number of regressors in the basic model: 51

Python

import statsmodels.api as sm
import statsmodels.formula.api as smf

basic = 'lwage ~ sex + exp1 + shs + hsg+ scl + clg + mw + so + we + occ2+ ind2'
basic_results = smf.ols(basic , data=data).fit()
print(basic_results.summary()) # estimated coefficients

##                             OLS Regression Results                            
## ==============================================================================
## Dep. Variable:                  lwage   R-squared:                       0.310
## Model:                            OLS   Adj. R-squared:                  0.303
## Method:                 Least Squares   F-statistic:                     45.83
## Date:                Wed, 24 Nov 2021   Prob (F-statistic):               0.00
## Time:                        12:19:42   Log-Likelihood:                -3459.9
## No. Observations:                5150   AIC:                             7022.
## Df Residuals:                    5099   BIC:                             7356.
## Df Model:                          50                                         
## Covariance Type:            nonrobust                                         
## ==============================================================================
##                  coef    std err          t      P>|t|      [0.025      0.975]
## ------------------------------------------------------------------------------
## Intercept      3.5284      0.054     65.317      0.000       3.422       3.634
## occ2[T.10]    -0.0106      0.040     -0.268      0.789      -0.088       0.067
## occ2[T.11]    -0.4558      0.059     -7.669      0.000      -0.572      -0.339
## occ2[T.12]    -0.3076      0.056     -5.541      0.000      -0.416      -0.199
## occ2[T.13]    -0.3614      0.046     -7.937      0.000      -0.451      -0.272
## occ2[T.14]    -0.4995      0.051     -9.867      0.000      -0.599      -0.400
## occ2[T.15]    -0.4645      0.052     -8.973      0.000      -0.566      -0.363
## occ2[T.16]    -0.2337      0.032     -7.206      0.000      -0.297      -0.170
## occ2[T.17]    -0.4126      0.028    -14.784      0.000      -0.467      -0.358
## occ2[T.18]    -0.3404      0.197     -1.731      0.083      -0.726       0.045
## occ2[T.19]    -0.2415      0.049     -4.880      0.000      -0.338      -0.144
## occ2[T.2]     -0.0765      0.034     -2.236      0.025      -0.144      -0.009
## occ2[T.20]    -0.2126      0.041     -5.201      0.000      -0.293      -0.132
## occ2[T.21]    -0.2884      0.038     -7.573      0.000      -0.363      -0.214
## occ2[T.22]    -0.4224      0.041    -10.187      0.000      -0.504      -0.341
## occ2[T.3]     -0.0347      0.039     -0.895      0.371      -0.111       0.041
## occ2[T.4]     -0.0962      0.052     -1.853      0.064      -0.198       0.006
## occ2[T.5]     -0.1879      0.060     -3.111      0.002      -0.306      -0.070
## occ2[T.6]     -0.4149      0.050     -8.263      0.000      -0.513      -0.316
## occ2[T.7]     -0.0460      0.057     -0.814      0.416      -0.157       0.065
## occ2[T.8]     -0.3778      0.044     -8.601      0.000      -0.464      -0.292
## occ2[T.9]     -0.2158      0.046     -4.678      0.000      -0.306      -0.125
## ind2[T.11]     0.0248      0.058      0.427      0.669      -0.089       0.139
## ind2[T.12]     0.1164      0.053      2.201      0.028       0.013       0.220
## ind2[T.13]     0.0212      0.068      0.312      0.755      -0.112       0.155
## ind2[T.14]     0.0068      0.050      0.136      0.892      -0.092       0.106
## ind2[T.15]    -0.1315      0.137     -0.959      0.338      -0.400       0.137
## ind2[T.16]    -0.1215      0.056     -2.177      0.029      -0.231      -0.012
## ind2[T.17]    -0.1106      0.056     -1.987      0.047      -0.220      -0.001
## ind2[T.18]    -0.1415      0.051     -2.774      0.006      -0.242      -0.042
## ind2[T.19]    -0.1803      0.065     -2.761      0.006      -0.308      -0.052
## ind2[T.2]      0.1939      0.084      2.301      0.021       0.029       0.359
## ind2[T.20]    -0.3581      0.057     -6.332      0.000      -0.469      -0.247
## ind2[T.21]    -0.1228      0.055     -2.239      0.025      -0.230      -0.015
## ind2[T.22]     0.0749      0.053      1.407      0.160      -0.029       0.179
## ind2[T.3]      0.0770      0.079      0.972      0.331      -0.078       0.232
## ind2[T.4]     -0.0506      0.058     -0.874      0.382      -0.164       0.063
## ind2[T.5]     -0.0797      0.056     -1.432      0.152      -0.189       0.029
## ind2[T.6]     -0.0555      0.052     -1.072      0.284      -0.157       0.046
## ind2[T.7]      0.0543      0.072      0.756      0.450      -0.086       0.195
## ind2[T.8]     -0.0491      0.072     -0.679      0.497      -0.191       0.093
## ind2[T.9]     -0.1936      0.048     -4.017      0.000      -0.288      -0.099
## sex           -0.0729      0.015     -4.848      0.000      -0.102      -0.043
## exp1           0.0086      0.001     13.106      0.000       0.007       0.010
## shs           -0.5928      0.051    -11.726      0.000      -0.692      -0.494
## hsg           -0.5043      0.027    -18.626      0.000      -0.557      -0.451
## scl           -0.4120      0.025    -16.347      0.000      -0.461      -0.363
## clg           -0.1822      0.023     -7.939      0.000      -0.227      -0.137
## mw            -0.0275      0.019     -1.425      0.154      -0.065       0.010
## so            -0.0345      0.019     -1.842      0.066      -0.071       0.002
## we             0.0172      0.020      0.859      0.391      -0.022       0.057
## ==============================================================================
## Omnibus:                      437.645   Durbin-Watson:                   1.885
## Prob(Omnibus):                  0.000   Jarque-Bera (JB):             1862.313
## Skew:                           0.322   Prob(JB):                         0.00
## Kurtosis:                       5.875   Cond. No.                         541.
## ==============================================================================
## 
## Notes:
## [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

print( "Number of regressors in the basic model:",len(basic_results.params), '\n')  # number of regressors in the Basic Model

## Number of regressors in the basic model: 51

Number of regressors in the basic model 51.

2.3.2 Flexible Model:

R code

flex <- lwage ~ sex + shs+hsg+scl+clg+occ2+ind2+mw+so+we + (exp1+exp2+exp3+exp4)*(shs+hsg+scl+clg+occ2+ind2+mw+so+we)
regflex <- lm(flex, data=data)
summary(regflex) # estimated coefficients

## 
## Call:
## lm(formula = flex, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.9384 -0.2782 -0.0041  0.2733  3.4934 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.8602606  0.4286188   9.006  < 2e-16 ***
## sex         -0.0695532  0.0152180  -4.570 4.99e-06 ***
## shs         -0.1233089  0.9068325  -0.136 0.891845    
## hsg         -0.5289024  0.1977559  -2.675 0.007508 ** 
## scl         -0.2920581  0.1260155  -2.318 0.020510 *  
## clg         -0.0411641  0.0703862  -0.585 0.558688    
## occ22        0.1613397  0.1297243   1.244 0.213665    
## occ23        0.2101514  0.1686774   1.246 0.212869    
## occ24        0.0708570  0.1837167   0.386 0.699746    
## occ25       -0.3960076  0.1885398  -2.100 0.035745 *  
## occ26       -0.2310611  0.1869662  -1.236 0.216576    
## occ27        0.3147249  0.1941519   1.621 0.105077    
## occ28       -0.1875417  0.1692988  -1.108 0.268022    
## occ29       -0.3390270  0.1672301  -2.027 0.042685 *  
## occ210       0.0209545  0.1564982   0.134 0.893490    
## occ211      -0.6424177  0.3090899  -2.078 0.037723 *  
## occ212      -0.0674774  0.2520486  -0.268 0.788929    
## occ213      -0.2329781  0.2315379  -1.006 0.314359    
## occ214       0.2562009  0.3226729   0.794 0.427236    
## occ215      -0.1938585  0.2595082  -0.747 0.455086    
## occ216      -0.0551256  0.1470658  -0.375 0.707798    
## occ217      -0.4156093  0.1361144  -3.053 0.002275 ** 
## occ218      -0.4822168  1.0443540  -0.462 0.644290    
## occ219      -0.2579412  0.3325215  -0.776 0.437956    
## occ220      -0.3010203  0.2341022  -1.286 0.198556    
## occ221      -0.4271811  0.2206486  -1.936 0.052922 .  
## occ222      -0.8694527  0.2975222  -2.922 0.003490 ** 
## ind23       -1.2473654  0.6454941  -1.932 0.053365 .  
## ind24       -0.0948281  0.4636021  -0.205 0.837935    
## ind25       -0.5293860  0.4345990  -1.218 0.223244    
## ind26       -0.6221688  0.4347226  -1.431 0.152441    
## ind27       -0.5047497  0.5024770  -1.005 0.315176    
## ind28       -0.7295442  0.4674008  -1.561 0.118623    
## ind29       -0.8025334  0.4252462  -1.887 0.059190 .  
## ind210      -0.5805840  0.4808776  -1.207 0.227358    
## ind211      -0.9852350  0.4481566  -2.198 0.027966 *  
## ind212      -0.7375777  0.4243260  -1.738 0.082232 .  
## ind213      -1.0183283  0.4826544  -2.110 0.034922 *  
## ind214      -0.5860174  0.4159033  -1.409 0.158892    
## ind215      -0.3801359  0.5908517  -0.643 0.520014    
## ind216      -0.5703905  0.4386579  -1.300 0.193556    
## ind217      -0.8201843  0.4259846  -1.925 0.054239 .  
## ind218      -0.7613604  0.4238287  -1.796 0.072495 .  
## ind219      -0.8812815  0.4565671  -1.930 0.053635 .  
## ind220      -0.9099021  0.4484198  -2.029 0.042499 *  
## ind221      -0.7586534  0.4405801  -1.722 0.085143 .  
## ind222      -0.4040775  0.4328735  -0.933 0.350620    
## mw           0.1106834  0.0814463   1.359 0.174218    
## so           0.0224244  0.0743855   0.301 0.763075    
## we          -0.0215659  0.0841591  -0.256 0.797767    
## exp1        -0.0677247  0.1519756  -0.446 0.655885    
## exp2         1.6362944  1.6909253   0.968 0.333246    
## exp3        -0.9154735  0.6880249  -1.331 0.183388    
## exp4         0.1429357  0.0907569   1.575 0.115337    
## shs:exp1    -0.1919981  0.1955408  -0.982 0.326206    
## hsg:exp1    -0.0173433  0.0572279  -0.303 0.761859    
## scl:exp1    -0.0664505  0.0433730  -1.532 0.125570    
## clg:exp1    -0.0550346  0.0310279  -1.774 0.076172 .  
## occ22:exp1  -0.0736239  0.0501108  -1.469 0.141837    
## occ23:exp1  -0.0714859  0.0637688  -1.121 0.262336    
## occ24:exp1  -0.0723997  0.0747715  -0.968 0.332953    
## occ25:exp1   0.0946732  0.0794005   1.192 0.233182    
## occ26:exp1  -0.0348928  0.0712136  -0.490 0.624175    
## occ27:exp1  -0.2279338  0.0784860  -2.904 0.003699 ** 
## occ28:exp1  -0.0727459  0.0645883  -1.126 0.260094    
## occ29:exp1   0.0274143  0.0669517   0.409 0.682217    
## occ210:exp1  0.0075628  0.0581715   0.130 0.896564    
## occ211:exp1  0.1014221  0.1005094   1.009 0.312986    
## occ212:exp1 -0.0862744  0.0874768  -0.986 0.324057    
## occ213:exp1  0.0067149  0.0761825   0.088 0.929768    
## occ214:exp1 -0.1369153  0.0974458  -1.405 0.160073    
## occ215:exp1 -0.0400425  0.0898931  -0.445 0.656017    
## occ216:exp1 -0.0539314  0.0520926  -1.035 0.300580    
## occ217:exp1  0.0147277  0.0467903   0.315 0.752958    
## occ218:exp1  0.1074099  0.4718440   0.228 0.819937    
## occ219:exp1  0.0047165  0.1060745   0.044 0.964536    
## occ220:exp1  0.0243156  0.0743274   0.327 0.743575    
## occ221:exp1  0.0791776  0.0696947   1.136 0.255985    
## occ222:exp1  0.1093246  0.0880828   1.241 0.214607    
## ind23:exp1   0.4758891  0.2227484   2.136 0.032693 *  
## ind24:exp1   0.0147304  0.1571102   0.094 0.925305    
## ind25:exp1   0.1256987  0.1531626   0.821 0.411864    
## ind26:exp1   0.1540275  0.1524289   1.010 0.312312    
## ind27:exp1   0.1029245  0.1786939   0.576 0.564654    
## ind28:exp1   0.2357669  0.1689203   1.396 0.162859    
## ind29:exp1   0.1359079  0.1489486   0.912 0.361578    
## ind210:exp1  0.1512578  0.1644341   0.920 0.357687    
## ind211:exp1  0.3174885  0.1590023   1.997 0.045907 *  
## ind212:exp1  0.2591089  0.1510588   1.715 0.086356 .  
## ind213:exp1  0.3396094  0.1669241   2.035 0.041954 *  
## ind214:exp1  0.1441411  0.1477994   0.975 0.329485    
## ind215:exp1 -0.0568181  0.2349853  -0.242 0.808950    
## ind216:exp1  0.0847295  0.1550425   0.546 0.584753    
## ind217:exp1  0.1728867  0.1513280   1.142 0.253317    
## ind218:exp1  0.1565399  0.1494171   1.048 0.294842    
## ind219:exp1  0.1516103  0.1620851   0.935 0.349641    
## ind220:exp1  0.1326629  0.1566883   0.847 0.397222    
## ind221:exp1  0.2190905  0.1555052   1.409 0.158930    
## ind222:exp1  0.1145814  0.1523427   0.752 0.452010    
## mw:exp1     -0.0279931  0.0296572  -0.944 0.345274    
## so:exp1     -0.0099678  0.0266868  -0.374 0.708786    
## we:exp1      0.0063077  0.0301417   0.209 0.834248    
## shs:exp2     1.9005060  1.4502480   1.310 0.190098    
## hsg:exp2     0.1171642  0.5509729   0.213 0.831609    
## scl:exp2     0.6217923  0.4629986   1.343 0.179344    
## clg:exp2     0.4096746  0.3802171   1.077 0.281321    
## occ22:exp2   0.6632173  0.5523220   1.201 0.229895    
## occ23:exp2   0.6415456  0.7102783   0.903 0.366448    
## occ24:exp2   0.9748422  0.8655351   1.126 0.260099    
## occ25:exp2  -0.9778823  0.9737990  -1.004 0.315335    
## occ26:exp2   0.1050860  0.8002267   0.131 0.895527    
## occ27:exp2   3.1407119  0.9389423   3.345 0.000829 ***
## occ28:exp2   0.6710877  0.7192077   0.933 0.350818    
## occ29:exp2   0.0231977  0.7629142   0.030 0.975744    
## occ210:exp2 -0.2692292  0.6405270  -0.420 0.674267    
## occ211:exp2 -1.0816539  1.0057575  -1.075 0.282221    
## occ212:exp2  0.8323737  0.9341245   0.891 0.372933    
## occ213:exp2 -0.2209813  0.7728463  -0.286 0.774942    
## occ214:exp2  0.7511163  0.9272548   0.810 0.417955    
## occ215:exp2 -0.0326858  0.9409116  -0.035 0.972290    
## occ216:exp2  0.3635814  0.5509550   0.660 0.509342    
## occ217:exp2 -0.2659285  0.4861131  -0.547 0.584369    
## occ218:exp2 -2.5608762  5.1700911  -0.495 0.620393    
## occ219:exp2 -0.1291756  1.0616901  -0.122 0.903165    
## occ220:exp2 -0.3323297  0.7229071  -0.460 0.645743    
## occ221:exp2 -0.9099997  0.6854114  -1.328 0.184349    
## occ222:exp2 -0.8550536  0.8279414  -1.033 0.301773    
## ind23:exp2  -5.9368948  2.4067939  -2.467 0.013670 *  
## ind24:exp2  -1.1053411  1.7101982  -0.646 0.518100    
## ind25:exp2  -2.0149181  1.6919190  -1.191 0.233748    
## ind26:exp2  -2.2277748  1.6816902  -1.325 0.185325    
## ind27:exp2  -1.4648099  2.0137888  -0.727 0.467022    
## ind28:exp2  -2.9479949  1.8595425  -1.585 0.112955    
## ind29:exp2  -1.7796219  1.6471248  -1.080 0.279999    
## ind210:exp2 -2.1973300  1.7738638  -1.239 0.215507    
## ind211:exp2 -3.8776807  1.7637372  -2.199 0.027956 *  
## ind212:exp2 -3.1690425  1.6819362  -1.884 0.059602 .  
## ind213:exp2 -3.9651983  1.8130709  -2.187 0.028789 *  
## ind214:exp2 -2.0783289  1.6490355  -1.260 0.207610    
## ind215:exp2  0.1911692  2.6075396   0.073 0.941559    
## ind216:exp2 -1.3265850  1.7185648  -0.772 0.440202    
## ind217:exp2 -2.2002873  1.6837183  -1.307 0.191341    
## ind218:exp2 -2.2006232  1.6566630  -1.328 0.184125    
## ind219:exp2 -1.9308536  1.7876673  -1.080 0.280152    
## ind220:exp2 -1.9467267  1.7244008  -1.129 0.258983    
## ind221:exp2 -3.1127363  1.7237908  -1.806 0.071019 .  
## ind222:exp2 -1.8578340  1.6849542  -1.103 0.270254    
## mw:exp2      0.2005611  0.3172911   0.632 0.527348    
## so:exp2      0.0544354  0.2815662   0.193 0.846708    
## we:exp2      0.0012717  0.3207873   0.004 0.996837    
## shs:exp3    -0.6721239  0.4426627  -1.518 0.128987    
## hsg:exp3    -0.0179937  0.2083176  -0.086 0.931171    
## scl:exp3    -0.1997877  0.1855189  -1.077 0.281572    
## clg:exp3    -0.1025230  0.1643648  -0.624 0.532819    
## occ22:exp3  -0.2039403  0.2211386  -0.922 0.356455    
## occ23:exp3  -0.2369620  0.2870372  -0.826 0.409103    
## occ24:exp3  -0.4366958  0.3520168  -1.241 0.214830    
## occ25:exp3   0.3885298  0.4118861   0.943 0.345577    
## occ26:exp3   0.0484737  0.3293525   0.147 0.882997    
## occ27:exp3  -1.3949288  0.4050109  -3.444 0.000578 ***
## occ28:exp3  -0.2053899  0.2895727  -0.709 0.478181    
## occ29:exp3  -0.0909660  0.3143348  -0.289 0.772293    
## occ210:exp3  0.1854753  0.2575565   0.720 0.471477    
## occ211:exp3  0.3931553  0.3817758   1.030 0.303152    
## occ212:exp3 -0.2202559  0.3660206  -0.602 0.547363    
## occ213:exp3  0.0950356  0.2904370   0.327 0.743519    
## occ214:exp3 -0.1443933  0.3341622  -0.432 0.665684    
## occ215:exp3  0.1477077  0.3645191   0.405 0.685339    
## occ216:exp3 -0.0378548  0.2151288  -0.176 0.860330    
## occ217:exp3  0.1510497  0.1878081   0.804 0.421276    
## occ218:exp3  1.4084443  1.8852467   0.747 0.455047    
## occ219:exp3  0.0923425  0.4042308   0.228 0.819314    
## occ220:exp3  0.1806994  0.2652079   0.681 0.495682    
## occ221:exp3  0.3779083  0.2553031   1.480 0.138875    
## occ222:exp3  0.2855058  0.2984206   0.957 0.338754    
## ind23:exp3   2.6665808  0.9807497   2.719 0.006573 ** 
## ind24:exp3   0.7298431  0.6879811   1.061 0.288811    
## ind25:exp3   0.9942250  0.6842435   1.453 0.146280    
## ind26:exp3   1.0641428  0.6800948   1.565 0.117718    
## ind27:exp3   0.7089089  0.8337963   0.850 0.395245    
## ind28:exp3   1.2340948  0.7483474   1.649 0.099193 .  
## ind29:exp3   0.8287315  0.6675904   1.241 0.214526    
## ind210:exp3  1.0448162  0.7066717   1.479 0.139337    
## ind211:exp3  1.6877578  0.7162155   2.356 0.018487 *  
## ind212:exp3  1.3734455  0.6835570   2.009 0.044564 *  
## ind213:exp3  1.6376669  0.7259301   2.256 0.024117 *  
## ind214:exp3  1.0162910  0.6714525   1.514 0.130199    
## ind215:exp3  0.1879483  1.0299675   0.182 0.855214    
## ind216:exp3  0.6889680  0.6968028   0.989 0.322831    
## ind217:exp3  1.0085540  0.6836992   1.475 0.140238    
## ind218:exp3  1.0605598  0.6725232   1.577 0.114863    
## ind219:exp3  0.8959865  0.7225602   1.240 0.215029    
## ind220:exp3  0.9768944  0.6955822   1.404 0.160255    
## ind221:exp3  1.4415215  0.6996480   2.060 0.039418 *  
## ind222:exp3  0.9687884  0.6828498   1.419 0.156037    
## mw:exp3     -0.0625771  0.1241291  -0.504 0.614194    
## so:exp3     -0.0115842  0.1084217  -0.107 0.914917    
## we:exp3     -0.0124875  0.1251376  -0.100 0.920515    
## shs:exp4     0.0777418  0.0475427   1.635 0.102071    
## hsg:exp4     0.0004913  0.0265964   0.018 0.985264    
## scl:exp4     0.0210760  0.0245289   0.859 0.390256    
## clg:exp4     0.0078695  0.0227528   0.346 0.729457    
## occ22:exp4   0.0176389  0.0289257   0.610 0.542021    
## occ23:exp4   0.0303057  0.0376552   0.805 0.420962    
## occ24:exp4   0.0584146  0.0457704   1.276 0.201927    
## occ25:exp4  -0.0515181  0.0549489  -0.938 0.348514    
## occ26:exp4  -0.0170182  0.0440847  -0.386 0.699488    
## occ27:exp4   0.1905353  0.0558757   3.410 0.000655 ***
## occ28:exp4   0.0196522  0.0379084   0.518 0.604195    
## occ29:exp4   0.0190014  0.0421099   0.451 0.651841    
## occ210:exp4 -0.0333347  0.0338825  -0.984 0.325246    
## occ211:exp4 -0.0465914  0.0479018  -0.973 0.330778    
## occ212:exp4  0.0110212  0.0470536   0.234 0.814820    
## occ213:exp4 -0.0136895  0.0358988  -0.381 0.702970    
## occ214:exp4  0.0055582  0.0400331   0.139 0.889581    
## occ215:exp4 -0.0327444  0.0462379  -0.708 0.478872    
## occ216:exp4 -0.0089706  0.0275729  -0.325 0.744937    
## occ217:exp4 -0.0256735  0.0239306  -1.073 0.283400    
## occ218:exp4 -0.2121372  0.2204003  -0.963 0.335841    
## occ219:exp4 -0.0169398  0.0513428  -0.330 0.741463    
## occ220:exp4 -0.0296125  0.0323353  -0.916 0.359819    
## occ221:exp4 -0.0524577  0.0317251  -1.654 0.098291 .  
## occ222:exp4 -0.0350646  0.0360687  -0.972 0.331018    
## ind23:exp4  -0.3851791  0.1329065  -2.898 0.003771 ** 
## ind24:exp4  -0.1209478  0.0899580  -1.344 0.178852    
## ind25:exp4  -0.1441045  0.0897994  -1.605 0.108616    
## ind26:exp4  -0.1526110  0.0892689  -1.710 0.087410 .  
## ind27:exp4  -0.1001993  0.1119398  -0.895 0.370768    
## ind28:exp4  -0.1609664  0.0979780  -1.643 0.100471    
## ind29:exp4  -0.1178080  0.0877821  -1.342 0.179642    
## ind210:exp4 -0.1482842  0.0918416  -1.615 0.106469    
## ind211:exp4 -0.2322961  0.0944506  -2.459 0.013949 *  
## ind212:exp4 -0.1872911  0.0899985  -2.081 0.037481 *  
## ind213:exp4 -0.2155617  0.0946011  -2.279 0.022731 *  
## ind214:exp4 -0.1483524  0.0884992  -1.676 0.093740 .  
## ind215:exp4 -0.0532195  0.1313815  -0.405 0.685439    
## ind216:exp4 -0.1044336  0.0916252  -1.140 0.254429    
## ind217:exp4 -0.1427349  0.0899315  -1.587 0.112543    
## ind218:exp4 -0.1546248  0.0885883  -1.745 0.080973 .  
## ind219:exp4 -0.1269592  0.0948784  -1.338 0.180918    
## ind220:exp4 -0.1468554  0.0911188  -1.612 0.107094    
## ind221:exp4 -0.2032619  0.0920972  -2.207 0.027358 *  
## ind222:exp4 -0.1480951  0.0897937  -1.649 0.099154 .  
## mw:exp4      0.0062439  0.0158699   0.393 0.694007    
## so:exp4      0.0003145  0.0136275   0.023 0.981591    
## we:exp4      0.0017685  0.0159602   0.111 0.911776    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4708 on 4904 degrees of freedom
## Multiple R-squared:  0.3511, Adjusted R-squared:  0.3187 
## F-statistic: 10.83 on 245 and 4904 DF,  p-value: < 2.2e-16

cat( "Number of regressors in the flexible model:",length(regflex$coef))

## Number of regressors in the flexible model: 246

Python code

flex = 'lwage ~ sex + shs+hsg+scl+clg+occ2+ind2+mw+so+we + (exp1+exp2+exp3+exp4)*(shs+hsg+scl+clg+occ2+ind2+mw+so+we)'
flex_results_0 = smf.ols(flex , data=data)
flex_results = smf.ols(flex , data=data).fit()
print(flex_results.summary()) # estimated coefficients

##                             OLS Regression Results                            
## ==============================================================================
## Dep. Variable:                  lwage   R-squared:                       0.351
## Model:                            OLS   Adj. R-squared:                  0.319
## Method:                 Least Squares   F-statistic:                     10.83
## Date:                Wed, 24 Nov 2021   Prob (F-statistic):          2.69e-305
## Time:                        12:19:43   Log-Likelihood:                -3301.9
## No. Observations:                5150   AIC:                             7096.
## Df Residuals:                    4904   BIC:                             8706.
## Df Model:                         245                                         
## Covariance Type:            nonrobust                                         
## ===================================================================================
##                       coef    std err          t      P>|t|      [0.025      0.975]
## -----------------------------------------------------------------------------------
## Intercept           3.2797      0.284     11.540      0.000       2.723       3.837
## occ2[T.10]          0.0210      0.156      0.134      0.893      -0.286       0.328
## occ2[T.11]         -0.6424      0.309     -2.078      0.038      -1.248      -0.036
## occ2[T.12]         -0.0675      0.252     -0.268      0.789      -0.562       0.427
## occ2[T.13]         -0.2330      0.232     -1.006      0.314      -0.687       0.221
## occ2[T.14]          0.2562      0.323      0.794      0.427      -0.376       0.889
## occ2[T.15]         -0.1939      0.260     -0.747      0.455      -0.703       0.315
## occ2[T.16]         -0.0551      0.147     -0.375      0.708      -0.343       0.233
## occ2[T.17]         -0.4156      0.136     -3.053      0.002      -0.682      -0.149
## occ2[T.18]         -0.4822      1.044     -0.462      0.644      -2.530       1.565
## occ2[T.19]         -0.2579      0.333     -0.776      0.438      -0.910       0.394
## occ2[T.2]           0.1613      0.130      1.244      0.214      -0.093       0.416
## occ2[T.20]         -0.3010      0.234     -1.286      0.199      -0.760       0.158
## occ2[T.21]         -0.4272      0.221     -1.936      0.053      -0.860       0.005
## occ2[T.22]         -0.8695      0.298     -2.922      0.003      -1.453      -0.286
## occ2[T.3]           0.2102      0.169      1.246      0.213      -0.121       0.541
## occ2[T.4]           0.0709      0.184      0.386      0.700      -0.289       0.431
## occ2[T.5]          -0.3960      0.189     -2.100      0.036      -0.766      -0.026
## occ2[T.6]          -0.2311      0.187     -1.236      0.217      -0.598       0.135
## occ2[T.7]           0.3147      0.194      1.621      0.105      -0.066       0.695
## occ2[T.8]          -0.1875      0.169     -1.108      0.268      -0.519       0.144
## occ2[T.9]          -0.3390      0.167     -2.027      0.043      -0.667      -0.011
## ind2[T.11]         -0.4047      0.314     -1.288      0.198      -1.021       0.211
## ind2[T.12]         -0.1570      0.279     -0.562      0.574      -0.705       0.391
## ind2[T.13]         -0.4377      0.362     -1.210      0.226      -1.147       0.271
## ind2[T.14]         -0.0054      0.270     -0.020      0.984      -0.535       0.524
## ind2[T.15]          0.2004      0.501      0.400      0.689      -0.781       1.182
## ind2[T.16]          0.0102      0.300      0.034      0.973      -0.577       0.598
## ind2[T.17]         -0.2396      0.285     -0.841      0.401      -0.798       0.319
## ind2[T.18]         -0.1808      0.278     -0.649      0.516      -0.727       0.365
## ind2[T.19]         -0.3007      0.327     -0.921      0.357      -0.941       0.339
## ind2[T.2]           0.5806      0.481      1.207      0.227      -0.362       1.523
## ind2[T.20]         -0.3293      0.313     -1.052      0.293      -0.943       0.284
## ind2[T.21]         -0.1781      0.304     -0.586      0.558      -0.773       0.417
## ind2[T.22]          0.1765      0.296      0.596      0.551      -0.404       0.757
## ind2[T.3]          -0.6668      0.561     -1.189      0.234      -1.766       0.433
## ind2[T.4]           0.4858      0.350      1.389      0.165      -0.200       1.171
## ind2[T.5]           0.0512      0.298      0.172      0.863      -0.532       0.635
## ind2[T.6]          -0.0416      0.299     -0.139      0.889      -0.628       0.545
## ind2[T.7]           0.0758      0.387      0.196      0.845      -0.683       0.835
## ind2[T.8]          -0.1490      0.337     -0.441      0.659      -0.811       0.513
## ind2[T.9]          -0.2219      0.275     -0.808      0.419      -0.760       0.316
## sex                -0.0696      0.015     -4.570      0.000      -0.099      -0.040
## shs                -0.1233      0.907     -0.136      0.892      -1.901       1.654
## hsg                -0.5289      0.198     -2.675      0.008      -0.917      -0.141
## scl                -0.2921      0.126     -2.318      0.021      -0.539      -0.045
## clg                -0.0412      0.070     -0.585      0.559      -0.179       0.097
## mw                  0.1107      0.081      1.359      0.174      -0.049       0.270
## so                  0.0224      0.074      0.301      0.763      -0.123       0.168
## we                 -0.0216      0.084     -0.256      0.798      -0.187       0.143
## exp1                0.0835      0.094      0.884      0.377      -0.102       0.269
## exp1:occ2[T.10]     0.0076      0.058      0.130      0.897      -0.106       0.122
## exp1:occ2[T.11]     0.1014      0.101      1.009      0.313      -0.096       0.298
## exp1:occ2[T.12]    -0.0863      0.087     -0.986      0.324      -0.258       0.085
## exp1:occ2[T.13]     0.0067      0.076      0.088      0.930      -0.143       0.156
## exp1:occ2[T.14]    -0.1369      0.097     -1.405      0.160      -0.328       0.054
## exp1:occ2[T.15]    -0.0400      0.090     -0.445      0.656      -0.216       0.136
## exp1:occ2[T.16]    -0.0539      0.052     -1.035      0.301      -0.156       0.048
## exp1:occ2[T.17]     0.0147      0.047      0.315      0.753      -0.077       0.106
## exp1:occ2[T.18]     0.1074      0.472      0.228      0.820      -0.818       1.032
## exp1:occ2[T.19]     0.0047      0.106      0.044      0.965      -0.203       0.213
## exp1:occ2[T.2]     -0.0736      0.050     -1.469      0.142      -0.172       0.025
## exp1:occ2[T.20]     0.0243      0.074      0.327      0.744      -0.121       0.170
## exp1:occ2[T.21]     0.0792      0.070      1.136      0.256      -0.057       0.216
## exp1:occ2[T.22]     0.1093      0.088      1.241      0.215      -0.063       0.282
## exp1:occ2[T.3]     -0.0715      0.064     -1.121      0.262      -0.197       0.054
## exp1:occ2[T.4]     -0.0724      0.075     -0.968      0.333      -0.219       0.074
## exp1:occ2[T.5]      0.0947      0.079      1.192      0.233      -0.061       0.250
## exp1:occ2[T.6]     -0.0349      0.071     -0.490      0.624      -0.175       0.105
## exp1:occ2[T.7]     -0.2279      0.078     -2.904      0.004      -0.382      -0.074
## exp1:occ2[T.8]     -0.0727      0.065     -1.126      0.260      -0.199       0.054
## exp1:occ2[T.9]      0.0274      0.067      0.409      0.682      -0.104       0.159
## exp1:ind2[T.11]     0.1662      0.106      1.570      0.116      -0.041       0.374
## exp1:ind2[T.12]     0.1079      0.093      1.156      0.248      -0.075       0.291
## exp1:ind2[T.13]     0.1884      0.117      1.605      0.109      -0.042       0.418
## exp1:ind2[T.14]    -0.0071      0.089     -0.080      0.936      -0.182       0.168
## exp1:ind2[T.15]    -0.2081      0.203     -1.024      0.306      -0.607       0.190
## exp1:ind2[T.16]    -0.0665      0.099     -0.671      0.502      -0.261       0.128
## exp1:ind2[T.17]     0.0216      0.095      0.229      0.819      -0.164       0.207
## exp1:ind2[T.18]     0.0053      0.091      0.058      0.954      -0.172       0.183
## exp1:ind2[T.19]     0.0004      0.111      0.003      0.997      -0.217       0.217
## exp1:ind2[T.2]     -0.1513      0.164     -0.920      0.358      -0.474       0.171
## exp1:ind2[T.20]    -0.0186      0.102     -0.182      0.855      -0.219       0.182
## exp1:ind2[T.21]     0.0678      0.101      0.673      0.501      -0.130       0.265
## exp1:ind2[T.22]    -0.0367      0.096     -0.380      0.704      -0.226       0.152
## exp1:ind2[T.3]      0.3246      0.188      1.722      0.085      -0.045       0.694
## exp1:ind2[T.4]     -0.1365      0.114     -1.194      0.233      -0.361       0.088
## exp1:ind2[T.5]     -0.0256      0.096     -0.265      0.791      -0.215       0.163
## exp1:ind2[T.6]      0.0028      0.096      0.029      0.977      -0.185       0.191
## exp1:ind2[T.7]     -0.0483      0.133     -0.364      0.716      -0.309       0.212
## exp1:ind2[T.8]      0.0845      0.118      0.715      0.475      -0.147       0.316
## exp1:ind2[T.9]     -0.0153      0.087     -0.176      0.860      -0.186       0.156
## exp2               -0.5610      0.949     -0.591      0.554      -2.421       1.299
## exp2:occ2[T.10]    -0.2692      0.641     -0.420      0.674      -1.525       0.986
## exp2:occ2[T.11]    -1.0817      1.006     -1.075      0.282      -3.053       0.890
## exp2:occ2[T.12]     0.8324      0.934      0.891      0.373      -0.999       2.664
## exp2:occ2[T.13]    -0.2210      0.773     -0.286      0.775      -1.736       1.294
## exp2:occ2[T.14]     0.7511      0.927      0.810      0.418      -1.067       2.569
## exp2:occ2[T.15]    -0.0327      0.941     -0.035      0.972      -1.877       1.812
## exp2:occ2[T.16]     0.3636      0.551      0.660      0.509      -0.717       1.444
## exp2:occ2[T.17]    -0.2659      0.486     -0.547      0.584      -1.219       0.687
## exp2:occ2[T.18]    -2.5609      5.170     -0.495      0.620     -12.697       7.575
## exp2:occ2[T.19]    -0.1292      1.062     -0.122      0.903      -2.211       1.952
## exp2:occ2[T.2]      0.6632      0.552      1.201      0.230      -0.420       1.746
## exp2:occ2[T.20]    -0.3323      0.723     -0.460      0.646      -1.750       1.085
## exp2:occ2[T.21]    -0.9100      0.685     -1.328      0.184      -2.254       0.434
## exp2:occ2[T.22]    -0.8551      0.828     -1.033      0.302      -2.478       0.768
## exp2:occ2[T.3]      0.6415      0.710      0.903      0.366      -0.751       2.034
## exp2:occ2[T.4]      0.9748      0.866      1.126      0.260      -0.722       2.672
## exp2:occ2[T.5]     -0.9779      0.974     -1.004      0.315      -2.887       0.931
## exp2:occ2[T.6]      0.1051      0.800      0.131      0.896      -1.464       1.674
## exp2:occ2[T.7]      3.1407      0.939      3.345      0.001       1.300       4.981
## exp2:occ2[T.8]      0.6711      0.719      0.933      0.351      -0.739       2.081
## exp2:occ2[T.9]      0.0232      0.763      0.030      0.976      -1.472       1.519
## exp2:ind2[T.11]    -1.6804      1.080     -1.555      0.120      -3.798       0.438
## exp2:ind2[T.12]    -0.9717      0.935     -1.039      0.299      -2.804       0.861
## exp2:ind2[T.13]    -1.7679      1.156     -1.529      0.126      -4.035       0.499
## exp2:ind2[T.14]     0.1190      0.888      0.134      0.893      -1.622       1.860
## exp2:ind2[T.15]     2.3885      2.202      1.085      0.278      -1.928       6.705
## exp2:ind2[T.16]     0.8707      0.990      0.879      0.379      -1.070       2.812
## exp2:ind2[T.17]    -0.0030      0.948     -0.003      0.998      -1.862       1.856
## exp2:ind2[T.18]    -0.0033      0.889     -0.004      0.997      -1.746       1.740
## exp2:ind2[T.19]     0.2665      1.119      0.238      0.812      -1.927       2.460
## exp2:ind2[T.2]      2.1973      1.774      1.239      0.216      -1.280       5.675
## exp2:ind2[T.20]     0.2506      1.010      0.248      0.804      -1.729       2.231
## exp2:ind2[T.21]    -0.9154      1.015     -0.902      0.367      -2.904       1.074
## exp2:ind2[T.22]     0.3395      0.949      0.358      0.721      -1.522       2.201
## exp2:ind2[T.3]     -3.7396      1.959     -1.908      0.056      -7.581       0.102
## exp2:ind2[T.4]      1.0920      1.143      0.955      0.339      -1.149       3.333
## exp2:ind2[T.5]      0.1824      0.939      0.194      0.846      -1.658       2.023
## exp2:ind2[T.6]     -0.0304      0.930     -0.033      0.974      -1.853       1.792
## exp2:ind2[T.7]      0.7325      1.440      0.509      0.611      -2.090       3.555
## exp2:ind2[T.8]     -0.7507      1.203     -0.624      0.533      -3.108       1.607
## exp2:ind2[T.9]      0.4177      0.839      0.498      0.619      -1.227       2.062
## exp3                0.1293      0.357      0.363      0.717      -0.570       0.828
## exp3:occ2[T.10]     0.1855      0.258      0.720      0.471      -0.319       0.690
## exp3:occ2[T.11]     0.3932      0.382      1.030      0.303      -0.355       1.142
## exp3:occ2[T.12]    -0.2203      0.366     -0.602      0.547      -0.938       0.497
## exp3:occ2[T.13]     0.0950      0.290      0.327      0.744      -0.474       0.664
## exp3:occ2[T.14]    -0.1444      0.334     -0.432      0.666      -0.800       0.511
## exp3:occ2[T.15]     0.1477      0.365      0.405      0.685      -0.567       0.862
## exp3:occ2[T.16]    -0.0379      0.215     -0.176      0.860      -0.460       0.384
## exp3:occ2[T.17]     0.1510      0.188      0.804      0.421      -0.217       0.519
## exp3:occ2[T.18]     1.4084      1.885      0.747      0.455      -2.287       5.104
## exp3:occ2[T.19]     0.0923      0.404      0.228      0.819      -0.700       0.885
## exp3:occ2[T.2]     -0.2039      0.221     -0.922      0.356      -0.637       0.230
## exp3:occ2[T.20]     0.1807      0.265      0.681      0.496      -0.339       0.701
## exp3:occ2[T.21]     0.3779      0.255      1.480      0.139      -0.123       0.878
## exp3:occ2[T.22]     0.2855      0.298      0.957      0.339      -0.300       0.871
## exp3:occ2[T.3]     -0.2370      0.287     -0.826      0.409      -0.800       0.326
## exp3:occ2[T.4]     -0.4367      0.352     -1.241      0.215      -1.127       0.253
## exp3:occ2[T.5]      0.3885      0.412      0.943      0.346      -0.419       1.196
## exp3:occ2[T.6]      0.0485      0.329      0.147      0.883      -0.597       0.694
## exp3:occ2[T.7]     -1.3949      0.405     -3.444      0.001      -2.189      -0.601
## exp3:occ2[T.8]     -0.2054      0.290     -0.709      0.478      -0.773       0.362
## exp3:occ2[T.9]     -0.0910      0.314     -0.289      0.772      -0.707       0.525
## exp3:ind2[T.11]     0.6429      0.411      1.564      0.118      -0.163       1.449
## exp3:ind2[T.12]     0.3286      0.347      0.947      0.344      -0.351       1.009
## exp3:ind2[T.13]     0.5929      0.426      1.392      0.164      -0.242       1.428
## exp3:ind2[T.14]    -0.0285      0.328     -0.087      0.931      -0.672       0.615
## exp3:ind2[T.15]    -0.8569      0.844     -1.016      0.310      -2.511       0.797
## exp3:ind2[T.16]    -0.3558      0.368     -0.968      0.333      -1.077       0.365
## exp3:ind2[T.17]    -0.0363      0.352     -0.103      0.918      -0.727       0.654
## exp3:ind2[T.18]     0.0157      0.325      0.048      0.961      -0.621       0.653
## exp3:ind2[T.19]    -0.1488      0.421     -0.354      0.724      -0.974       0.676
## exp3:ind2[T.2]     -1.0448      0.707     -1.479      0.139      -2.430       0.341
## exp3:ind2[T.20]    -0.0679      0.370     -0.183      0.855      -0.794       0.658
## exp3:ind2[T.21]     0.3967      0.381      1.041      0.298      -0.351       1.144
## exp3:ind2[T.22]    -0.0760      0.349     -0.218      0.827      -0.760       0.608
## exp3:ind2[T.3]      1.6218      0.785      2.067      0.039       0.084       3.160
## exp3:ind2[T.4]     -0.3150      0.429     -0.735      0.463      -1.155       0.526
## exp3:ind2[T.5]     -0.0506      0.340     -0.149      0.882      -0.717       0.616
## exp3:ind2[T.6]      0.0193      0.336      0.058      0.954      -0.639       0.678
## exp3:ind2[T.7]     -0.3359      0.586     -0.573      0.567      -1.485       0.813
## exp3:ind2[T.8]      0.1893      0.452      0.419      0.675      -0.696       1.075
## exp3:ind2[T.9]     -0.2161      0.300     -0.721      0.471      -0.804       0.372
## exp4               -0.0053      0.045     -0.120      0.904      -0.093       0.082
## exp4:occ2[T.10]    -0.0333      0.034     -0.984      0.325      -0.100       0.033
## exp4:occ2[T.11]    -0.0466      0.048     -0.973      0.331      -0.141       0.047
## exp4:occ2[T.12]     0.0110      0.047      0.234      0.815      -0.081       0.103
## exp4:occ2[T.13]    -0.0137      0.036     -0.381      0.703      -0.084       0.057
## exp4:occ2[T.14]     0.0056      0.040      0.139      0.890      -0.073       0.084
## exp4:occ2[T.15]    -0.0327      0.046     -0.708      0.479      -0.123       0.058
## exp4:occ2[T.16]    -0.0090      0.028     -0.325      0.745      -0.063       0.045
## exp4:occ2[T.17]    -0.0257      0.024     -1.073      0.283      -0.073       0.021
## exp4:occ2[T.18]    -0.2121      0.220     -0.963      0.336      -0.644       0.220
## exp4:occ2[T.19]    -0.0169      0.051     -0.330      0.741      -0.118       0.084
## exp4:occ2[T.2]      0.0176      0.029      0.610      0.542      -0.039       0.074
## exp4:occ2[T.20]    -0.0296      0.032     -0.916      0.360      -0.093       0.034
## exp4:occ2[T.21]    -0.0525      0.032     -1.654      0.098      -0.115       0.010
## exp4:occ2[T.22]    -0.0351      0.036     -0.972      0.331      -0.106       0.036
## exp4:occ2[T.3]      0.0303      0.038      0.805      0.421      -0.044       0.104
## exp4:occ2[T.4]      0.0584      0.046      1.276      0.202      -0.031       0.148
## exp4:occ2[T.5]     -0.0515      0.055     -0.938      0.349      -0.159       0.056
## exp4:occ2[T.6]     -0.0170      0.044     -0.386      0.699      -0.103       0.069
## exp4:occ2[T.7]      0.1905      0.056      3.410      0.001       0.081       0.300
## exp4:occ2[T.8]      0.0197      0.038      0.518      0.604      -0.055       0.094
## exp4:occ2[T.9]      0.0190      0.042      0.451      0.652      -0.064       0.102
## exp4:ind2[T.11]    -0.0840      0.052     -1.619      0.106      -0.186       0.018
## exp4:ind2[T.12]    -0.0390      0.042     -0.918      0.359      -0.122       0.044
## exp4:ind2[T.13]    -0.0673      0.052     -1.297      0.195      -0.169       0.034
## exp4:ind2[T.14] -6.822e-05      0.040     -0.002      0.999      -0.079       0.078
## exp4:ind2[T.15]     0.0951      0.104      0.911      0.362      -0.110       0.300
## exp4:ind2[T.16]     0.0439      0.045      0.970      0.332      -0.045       0.132
## exp4:ind2[T.17]     0.0055      0.043      0.129      0.898      -0.079       0.090
## exp4:ind2[T.18]    -0.0063      0.039     -0.161      0.872      -0.084       0.071
## exp4:ind2[T.19]     0.0213      0.052      0.407      0.684      -0.081       0.124
## exp4:ind2[T.2]      0.1483      0.092      1.615      0.106      -0.032       0.328
## exp4:ind2[T.20]     0.0014      0.045      0.032      0.975      -0.087       0.089
## exp4:ind2[T.21]    -0.0550      0.047     -1.160      0.246      -0.148       0.038
## exp4:ind2[T.22]     0.0002      0.042      0.004      0.996      -0.083       0.083
## exp4:ind2[T.3]     -0.2369      0.107     -2.220      0.026      -0.446      -0.028
## exp4:ind2[T.4]      0.0273      0.053      0.511      0.609      -0.077       0.132
## exp4:ind2[T.5]      0.0042      0.041      0.103      0.918      -0.076       0.084
## exp4:ind2[T.6]     -0.0043      0.040     -0.108      0.914      -0.083       0.075
## exp4:ind2[T.7]      0.0481      0.078      0.613      0.540      -0.106       0.202
## exp4:ind2[T.8]     -0.0127      0.056     -0.226      0.821      -0.123       0.097
## exp4:ind2[T.9]      0.0305      0.035      0.861      0.389      -0.039       0.100
## exp1:shs           -0.1920      0.196     -0.982      0.326      -0.575       0.191
## exp1:hsg           -0.0173      0.057     -0.303      0.762      -0.130       0.095
## exp1:scl           -0.0665      0.043     -1.532      0.126      -0.151       0.019
## exp1:clg           -0.0550      0.031     -1.774      0.076      -0.116       0.006
## exp1:mw            -0.0280      0.030     -0.944      0.345      -0.086       0.030
## exp1:so            -0.0100      0.027     -0.374      0.709      -0.062       0.042
## exp1:we             0.0063      0.030      0.209      0.834      -0.053       0.065
## exp2:shs            1.9005      1.450      1.310      0.190      -0.943       4.744
## exp2:hsg            0.1172      0.551      0.213      0.832      -0.963       1.197
## exp2:scl            0.6218      0.463      1.343      0.179      -0.286       1.529
## exp2:clg            0.4097      0.380      1.077      0.281      -0.336       1.155
## exp2:mw             0.2006      0.317      0.632      0.527      -0.421       0.823
## exp2:so             0.0544      0.282      0.193      0.847      -0.498       0.606
## exp2:we             0.0013      0.321      0.004      0.997      -0.628       0.630
## exp3:shs           -0.6721      0.443     -1.518      0.129      -1.540       0.196
## exp3:hsg           -0.0180      0.208     -0.086      0.931      -0.426       0.390
## exp3:scl           -0.1998      0.186     -1.077      0.282      -0.563       0.164
## exp3:clg           -0.1025      0.164     -0.624      0.533      -0.425       0.220
## exp3:mw            -0.0626      0.124     -0.504      0.614      -0.306       0.181
## exp3:so            -0.0116      0.108     -0.107      0.915      -0.224       0.201
## exp3:we            -0.0125      0.125     -0.100      0.921      -0.258       0.233
## exp4:shs            0.0777      0.048      1.635      0.102      -0.015       0.171
## exp4:hsg            0.0005      0.027      0.018      0.985      -0.052       0.053
## exp4:scl            0.0211      0.025      0.859      0.390      -0.027       0.069
## exp4:clg            0.0079      0.023      0.346      0.729      -0.037       0.052
## exp4:mw             0.0062      0.016      0.393      0.694      -0.025       0.037
## exp4:so             0.0003      0.014      0.023      0.982      -0.026       0.027
## exp4:we             0.0018      0.016      0.111      0.912      -0.030       0.033
## ==============================================================================
## Omnibus:                      395.012   Durbin-Watson:                   1.898
## Prob(Omnibus):                  0.000   Jarque-Bera (JB):             1529.250
## Skew:                           0.303   Prob(JB):                         0.00
## Kurtosis:                       5.600   Cond. No.                     6.87e+04
## ==============================================================================
## 
## Notes:
## [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
## [2] The condition number is large, 6.87e+04. This might indicate that there are
## strong multicollinearity or other numerical problems.

print( "Number of regressors in the basic model:",len(flex_results.params), '\n')

## Number of regressors in the basic model: 246

Number of regressors in the flexible model:246.

2.3.3 Lasso Model:

First, we import the essential libraries.

R code

library(hdm)

Python code

from sklearn.linear_model import LassoCV
from sklearn import linear_model
from sklearn.preprocessing import PolynomialFeatures
from sklearn.metrics import mean_squared_error

Then, we estimate the lasso model.

R code

flex <- lwage ~ sex + shs+hsg+scl+clg+occ2+ind2+mw+so+we + (exp1+exp2+exp3+exp4)*(shs+hsg+scl+clg+occ2+ind2+mw+so+we)
lassoreg<- rlasso(flex, data=data)
sumlasso<- summary(lassoreg)

## 
## Call:
## rlasso.formula(formula = flex, data = data)
## 
## Post-Lasso Estimation:  TRUE 
## 
## Total number of variables: 245
## Number of selected variables: 24 
## 
## Residuals: 
##      Min       1Q   Median       3Q      Max 
## -2.03159 -0.29132 -0.01137  0.28472  3.63651 
## 
##             Estimate
## (Intercept)    3.137
## sex            0.000
## shs           -0.480
## hsg           -0.404
## scl           -0.306
## clg            0.000
## occ22          0.041
## occ23          0.109
## occ24          0.000
## occ25          0.000
## occ26         -0.327
## occ27          0.000
## occ28         -0.257
## occ29          0.000
## occ210         0.038
## occ211        -0.398
## occ212         0.000
## occ213        -0.131
## occ214        -0.189
## occ215        -0.387
## occ216         0.000
## occ217        -0.280
## occ218         0.000
## occ219         0.000
## occ220         0.000
## occ221         0.000
## occ222        -0.209
## ind23          0.000
## ind24          0.000
## ind25          0.000
## ind26          0.000
## ind27          0.000
## ind28          0.000
## ind29         -0.169
## ind210         0.000
## ind211         0.000
## ind212         0.193
## ind213         0.000
## ind214         0.076
## ind215         0.000
## ind216         0.000
## ind217         0.000
## ind218         0.000
## ind219         0.000
## ind220        -0.225
## ind221         0.000
## ind222         0.140
## mw             0.000
## so             0.000
## we             0.000
## exp1           0.009
## exp2           0.000
## exp3           0.000
## exp4           0.000
## shs:exp1       0.000
## hsg:exp1       0.000
## scl:exp1       0.000
## clg:exp1       0.000
## occ22:exp1     0.000
## occ23:exp1     0.000
## occ24:exp1     0.000
## occ25:exp1     0.000
## occ26:exp1     0.000
## occ27:exp1     0.000
## occ28:exp1     0.000
## occ29:exp1     0.000
## occ210:exp1    0.003
## occ211:exp1    0.000
## occ212:exp1    0.000
## occ213:exp1   -0.011
## occ214:exp1   -0.009
## occ215:exp1    0.000
## occ216:exp1    0.000
## occ217:exp1    0.000
## occ218:exp1    0.000
## occ219:exp1    0.000
## occ220:exp1    0.000
## occ221:exp1    0.000
## occ222:exp1    0.000
## ind23:exp1     0.000
## ind24:exp1     0.000
## ind25:exp1     0.000
## ind26:exp1     0.000
## ind27:exp1     0.000
## ind28:exp1     0.000
## ind29:exp1     0.000
## ind210:exp1    0.000
## ind211:exp1    0.000
## ind212:exp1    0.000
## ind213:exp1    0.000
## ind214:exp1    0.004
## ind215:exp1    0.000
## ind216:exp1    0.000
## ind217:exp1    0.000
## ind218:exp1    0.000
## ind219:exp1    0.000
## ind220:exp1    0.000
## ind221:exp1    0.000
## ind222:exp1    0.000
## mw:exp1        0.000
## so:exp1        0.000
## we:exp1        0.000
## shs:exp2       0.000
## hsg:exp2       0.000
## scl:exp2       0.000
## clg:exp2       0.000
## occ22:exp2     0.000
## occ23:exp2     0.000
## occ24:exp2     0.000
## occ25:exp2     0.000
## occ26:exp2     0.000
## occ27:exp2     0.000
## occ28:exp2     0.000
## occ29:exp2     0.000
## occ210:exp2    0.000
## occ211:exp2    0.000
## occ212:exp2    0.000
## occ213:exp2    0.000
## occ214:exp2    0.000
## occ215:exp2    0.000
## occ216:exp2    0.000
## occ217:exp2    0.000
## occ218:exp2    0.000
## occ219:exp2    0.000
## occ220:exp2    0.000
## occ221:exp2    0.000
## occ222:exp2    0.000
## ind23:exp2     0.000
## ind24:exp2     0.000
## ind25:exp2     0.000
## ind26:exp2     0.000
## ind27:exp2     0.000
## ind28:exp2     0.000
## ind29:exp2     0.000
## ind210:exp2    0.000
## ind211:exp2    0.000
## ind212:exp2    0.000
## ind213:exp2    0.000
## ind214:exp2    0.000
## ind215:exp2    0.000
## ind216:exp2    0.000
## ind217:exp2    0.000
## ind218:exp2    0.000
## ind219:exp2    0.000
## ind220:exp2    0.000
## ind221:exp2    0.000
## ind222:exp2    0.000
## mw:exp2        0.000
## so:exp2        0.000
## we:exp2        0.000
## shs:exp3       0.000
## hsg:exp3       0.000
## scl:exp3       0.000
## clg:exp3       0.000
## occ22:exp3     0.000
## occ23:exp3     0.000
## occ24:exp3     0.000
## occ25:exp3     0.000
## occ26:exp3     0.000
## occ27:exp3     0.000
## occ28:exp3     0.000
## occ29:exp3     0.000
## occ210:exp3    0.000
## occ211:exp3    0.000
## occ212:exp3    0.000
## occ213:exp3    0.000
## occ214:exp3    0.000
## occ215:exp3    0.000
## occ216:exp3    0.000
## occ217:exp3    0.000
## occ218:exp3    0.000
## occ219:exp3    0.000
## occ220:exp3    0.000
## occ221:exp3    0.000
## occ222:exp3    0.000
## ind23:exp3     0.000
## ind24:exp3     0.000
## ind25:exp3     0.000
## ind26:exp3     0.000
## ind27:exp3     0.000
## ind28:exp3     0.000
## ind29:exp3     0.000
## ind210:exp3    0.000
## ind211:exp3    0.000
## ind212:exp3    0.000
## ind213:exp3    0.000
## ind214:exp3    0.000
## ind215:exp3    0.000
## ind216:exp3    0.000
## ind217:exp3    0.000
## ind218:exp3    0.000
## ind219:exp3    0.000
## ind220:exp3    0.000
## ind221:exp3    0.000
## ind222:exp3    0.000
## mw:exp3        0.000
## so:exp3        0.000
## we:exp3        0.000
## shs:exp4       0.000
## hsg:exp4       0.000
## scl:exp4       0.000
## clg:exp4       0.000
## occ22:exp4     0.000
## occ23:exp4     0.000
## occ24:exp4     0.000
## occ25:exp4     0.000
## occ26:exp4     0.000
## occ27:exp4     0.000
## occ28:exp4     0.000
## occ29:exp4     0.000
## occ210:exp4    0.000
## occ211:exp4    0.000
## occ212:exp4    0.000
## occ213:exp4    0.000
## occ214:exp4    0.000
## occ215:exp4    0.000
## occ216:exp4    0.000
## occ217:exp4    0.000
## occ218:exp4    0.000
## occ219:exp4    0.000
## occ220:exp4    0.000
## occ221:exp4    0.000
## occ222:exp4    0.000
## ind23:exp4     0.000
## ind24:exp4     0.000
## ind25:exp4     0.000
## ind26:exp4     0.000
## ind27:exp4     0.000
## ind28:exp4     0.000
## ind29:exp4     0.000
## ind210:exp4    0.000
## ind211:exp4    0.000
## ind212:exp4    0.000
## ind213:exp4    0.000
## ind214:exp4    0.000
## ind215:exp4    0.000
## ind216:exp4    0.000
## ind217:exp4    0.000
## ind218:exp4    0.000
## ind219:exp4    0.000
## ind220:exp4    0.000
## ind221:exp4    0.000
## ind222:exp4    0.000
## mw:exp4        0.000
## so:exp4        0.000
## we:exp4        0.000
## 
## Residual standard error: 0.4847
## Multiple R-squared:  0.2779
## Adjusted R-squared:  0.2745
## Joint significance test:
##  the sup score statistic for joint significance test is 89.09 with a p-value of 0.008

Python code

# Get exogenous variables from flexible model
X = flex_results_0.exog
X.shape

# Set endogenous variable

## (5150, 246)

lwage = data["lwage"]
lwage.shape

## (5150,)

alpha=0.1
# Set penalty value = 0.1
# reg = linear_model.Lasso(alpha=0.1/np.log(len(lwage)))
reg = linear_model.Lasso(alpha = alpha)

# LASSO regression for flexible model
reg.fit(X, lwage)

## Lasso(alpha=0.1)

lwage_lasso_fitted = reg.fit(X, lwage).predict( X )

# coefficients
# reg.coef_
print('Lasso Regression: R^2 score', reg.score(X, lwage))

## Lasso Regression: R^2 score 0.16047849625520638

Now, we can evaluate the performance of both models based on the (adjusted) \(R^2_{sample}\) and the (adjusted) \(MSE_{sample}\):

R-Squared \((R^2)\)

R code

# Summary from basic and flexible model.
sumbasic <- summary(regbasic)
sumflex <- summary(regflex)

#  R-squared from basic, flexible and lasso models
R2.1 <- sumbasic$r.squared
R2.adj1 <- sumbasic$adj.r.squared

R2.2 <- sumflex$r.squared
R2.adj2 <- sumflex$adj.r.squared

R2.L <- sumlasso$r.squared
R2.adjL <- sumlasso$adj.r.squared

R-squared for the basic model: 0.3100465.
Adjusted R-squared for the basic model: 0.3032809.
R-squared for the flexible model: 0.3511099.
Adjusted R-squared for the flexible model: 0.3186919.
R-squared for the lasso with flexible model: 0.2778653.
Adjusted R-squared for the flexible model: 0.2744836.

Python

# Assess the predictive performance
R2_1 = basic_results.rsquared
print("R-squared for the basic model: ", R2_1, "\n")

## R-squared for the basic model:  0.31004650692219504

R2_adj1 = basic_results.rsquared_adj
print("adjusted R-squared for the basic model: ", R2_adj1, "\n")

## adjusted R-squared for the basic model:  0.3032809304064292

R2_2 = flex_results.rsquared
print("R-squared for the basic model: ", R2_2, "\n")

## R-squared for the basic model:  0.3511098950617233

R2_adj2 = flex_results.rsquared_adj
print("adjusted R-squared for the basic model: ", R2_adj2, "\n")

## adjusted R-squared for the basic model:  0.31869185352218865

R2_L = reg.score(flex_results_0.exog, lwage)
print("R-squared for LASSO: ", R2_L, "\n")

## R-squared for LASSO:  0.16047849625520638

R2_adjL = 1 - (1-R2_L)*(len(lwage)-1)/(len(lwage)-X.shape[1]-1)
print("adjusted R-squared for LASSO: ", R2_adjL, "\n")

## adjusted R-squared for LASSO:  0.11835687889415825

Mean Squared Error \(MSE\)

R code

MSE1 <- mean(sumbasic$res^2)
p1 <- sumbasic$df[1] # number of regressors
MSE.adj1 <- (n/(n-p1))*MSE1


MSE2 <-mean(sumflex$res^2)
p2 <- sumflex$df[1]
MSE.adj2 <- (n/(n-p2))*MSE2

MSEL <-mean(sumlasso$res^2)
pL <- length(sumlasso$coef)
MSE.adjL <- (n/(n-pL))*MSEL

MSE for the basic model: 0.2244251
Adjusted MSE for the basic model: 0.2266697
MSE for the flexible model: 0.2110681
Adjusted MSE for the flexible model: 0.221656
MSE for the lasso flexible model: 0.2348928
Adjusted MSE for the lasso flexible model: 0.2466758

Python code

# calculating the MSE
MSE1 =  np.mean(basic_results.resid**2)
print("MSE for the basic model: ", MSE1, "\n")

## MSE for the basic model:  0.22442505581164396

p1 = len(basic_results.params) # number of regressors
n = len(lwage)
MSE_adj1  = (n/(n-p1))*MSE1
print("adjusted MSE for the basic model: ", MSE_adj1, "\n")

## adjusted MSE for the basic model:  0.2266697465051905

MSE2 =  np.mean(flex_results.resid**2)
print("MSE for the flexible model: ", MSE2, "\n")

## MSE for the flexible model:  0.21106813644318217

p2 = len(flex_results.params) # number of regressors
n = len(lwage)
MSE_adj2  = (n/(n-p2))*MSE2
print("adjusted MSE for the flexible model: ", MSE_adj2, "\n")

## adjusted MSE for the flexible model:  0.2216559752614984

MSEL = mean_squared_error(lwage, lwage_lasso_fitted)
print("MSE for the LASSO model: ", MSEL, "\n")

## MSE for the LASSO model:  0.2730758844230591

pL = reg.coef_.shape[0] # number of regressors
n = len(lwage)
MSE_adjL  = (n/(n-pL))*MSEL
print("adjusted MSE for LASSO model: ", MSE_adjL, "\n")

## adjusted MSE for LASSO model:  0.28677422609680964

Latex presentation

R code

Models <- c("Basic reg","Flexible reg","Lasso reg")
p <- c(p1,p2,pL)
R_2 <- c(R2.1,R2.2,R2.L)
MSE <- c(MSE1,MSE2,MSEL)
R_2_adj <- c(R2.adj1,R2.adj2,R2.adjL)
MSE_adj <- c(MSE.adj1,MSE.adj2,MSE.adjL)

data.frame(Models,p,R_2,MSE,R_2_adj,MSE_adj) %>%
  kable("markdown",caption = "Descriptive Statistics")

Table 2.3: Descriptive Statistics
Models	p	R_2	MSE	R_2_adj	MSE_adj
Basic reg	51	0.3100465	0.2244251	0.3032809	0.2266697
Flexible reg	246	0.3511099	0.2110681	0.3186919	0.2216560
Lasso reg	246	0.2778653	0.2348928	0.2744836	0.2466758

Python code

# import array_to_latex as a2l

table = np.zeros((3, 5))
table[0,0:5] = [p1, R2_1, MSE1, R2_adj1, MSE_adj1]
table[1,0:5] = [p2, R2_2, MSE2, R2_adj2, MSE_adj2]
table[2,0:5] = [pL, R2_L, MSEL, R2_adjL, MSE_adjL]
table

## array([[5.10000000e+01, 3.10046507e-01, 2.24425056e-01, 3.03280930e-01,
##         2.26669747e-01],
##        [2.46000000e+02, 3.51109895e-01, 2.11068136e-01, 3.18691854e-01,
##         2.21655975e-01],
##        [2.46000000e+02, 1.60478496e-01, 2.73075884e-01, 1.18356879e-01,
##         2.86774226e-01]])

table = pd.DataFrame(table, columns = ["p","$R^2_{sample}$","$MSE_{sample}$","$R^2_{adjusted}$", "$MSE_{adjusted}$"], index = ["basic reg","flexible reg", "lasso flex"])
table

##                   p  $R^2_{sample}$  ...  $R^2_{adjusted}$  $MSE_{adjusted}$
## basic reg      51.0        0.310047  ...          0.303281          0.226670
## flexible reg  246.0        0.351110  ...          0.318692          0.221656
## lasso flex    246.0        0.160478  ...          0.118357          0.286774
## 
## [3 rows x 5 columns]

Considering all measures above, the flexible model performs slightly better than the basic model.

One procedure to circumvent this issue is to use data splitting that is described and applied in the following.

2.4 Data Splitting

Measure the prediction quality of the two models via data splitting:

Randomly split the data into one training sample and one testing sample. Here we just use a simple method (stratified splitting is a more sophisticated version of splitting that we can consider).
Use the training sample for estimating the parameters of the Basic Model and the Flexible Model.
Use the testing sample for evaluation. Predict the \(wage\) of every observation in the testing sample based on the estimated parameters in the training sample.
Calculate the Mean Squared Prediction Error \(MSE_{test}\) based on the testing sample for both prediction models.

R code

# to make the results replicable (generating random numbers)
set.seed(1)
# draw (4/5)*n random numbers from 1 to n without replacing them
random_2 <- sample(1:n, floor(n*4/5))
# training sample
train <- data[random_2,]
# testing sample
test <- data[-random_2,]

Python code

# Import relevant packages for splitting data
import random
import math

# Set Seed
# to make the results replicable (generating random numbers)
np.random.seed(0)
random = np.random.randint(0,n, size=math.floor(n))
data["random"] = random
random # the array does not change

## array([2732, 2607, 1653, ..., 4184, 2349, 3462])

data_2 = data.sort_values(by=['random'])
data_2.head()

##                wage     lwage  sex  shs  hsg  ...   occ  occ2   ind  ind2  random
## rownames                                      ...                                
## 2223      26.442308  3.274965  1.0  0.0  1.0  ...   340     1  8660    20       0
## 3467      19.230769  2.956512  0.0  0.0  0.0  ...  9620    22  1870     5       0
## 13501     48.076923  3.872802  1.0  0.0  0.0  ...  3060    10  8190    18       0
## 15588     12.019231  2.486508  0.0  0.0  1.0  ...  6440    19   770     4       2
## 16049     39.903846  3.686473  1.0  0.0  0.0  ...  1820     5  7860    17       2
## 
## [5 rows x 21 columns]

# Create training and testing sample 
train = data_2[ : math.floor(n*4/5)]    # training sample
test =  data_2[ math.floor(n*4/5) : ]   # testing sample
print(train.shape)

## (4120, 21)

print(test.shape)

## (1030, 21)

The train data dimensions are 4120 rows and 20 columns.

We estimate the parameters using the training data set.

# basic model
# estimating the parameters in the training sample
regbasic <- lm(basic, data=train)
regbasic

## 
## Call:
## lm(formula = basic, data = train)
## 
## Coefficients:
## (Intercept)          sex         exp1          shs          hsg          scl  
##    3.641716    -0.059065     0.008635    -0.570657    -0.508393    -0.405932  
##         clg           mw           so           we        occ22        occ23  
##   -0.178154    -0.044486    -0.051111     0.004004    -0.083481    -0.036320  
##       occ24        occ25        occ26        occ27        occ28        occ29  
##   -0.091457    -0.126383    -0.416548    -0.046615    -0.385957    -0.220534  
##      occ210       occ211       occ212       occ213       occ214       occ215  
##   -0.030423    -0.460487    -0.317680    -0.375180    -0.465495    -0.494731  
##      occ216       occ217       occ218       occ219       occ220       occ221  
##   -0.212026    -0.413355    -0.329054    -0.263839    -0.241109    -0.293004  
##      occ222        ind23        ind24        ind25        ind26        ind27  
##   -0.410747    -0.075902    -0.121251    -0.206172    -0.151398    -0.081742  
##       ind28        ind29       ind210       ind211       ind212       ind213  
##   -0.186304    -0.304987    -0.126226    -0.090402     0.028445    -0.116162  
##      ind214       ind215       ind216       ind217       ind218       ind219  
##   -0.105521    -0.227273    -0.223604    -0.208887    -0.256808    -0.266724  
##      ind220       ind221       ind222  
##   -0.459034    -0.226775    -0.047166

# # basic model
# # estimating the parameters in the training sample
basic_results = smf.ols(basic , data=train).fit()
print(basic_results.summary())

##                             OLS Regression Results                            
## ==============================================================================
## Dep. Variable:                  lwage   R-squared:                       0.316
## Model:                            OLS   Adj. R-squared:                  0.308
## Method:                 Least Squares   F-statistic:                     37.65
## Date:                Wed, 24 Nov 2021   Prob (F-statistic):          4.85e-293
## Time:                        12:19:56   Log-Likelihood:                -2784.1
## No. Observations:                4120   AIC:                             5670.
## Df Residuals:                    4069   BIC:                             5993.
## Df Model:                          50                                         
## Covariance Type:            nonrobust                                         
## ==============================================================================
##                  coef    std err          t      P>|t|      [0.025      0.975]
## ------------------------------------------------------------------------------
## Intercept      3.5365      0.061     58.134      0.000       3.417       3.656
## occ2[T.10]    -0.0054      0.045     -0.120      0.904      -0.094       0.083
## occ2[T.11]    -0.4594      0.067     -6.893      0.000      -0.590      -0.329
## occ2[T.12]    -0.3300      0.061     -5.365      0.000      -0.451      -0.209
## occ2[T.13]    -0.3767      0.050     -7.544      0.000      -0.475      -0.279
## occ2[T.14]    -0.5026      0.056     -8.947      0.000      -0.613      -0.392
## occ2[T.15]    -0.4511      0.059     -7.586      0.000      -0.568      -0.335
## occ2[T.16]    -0.2482      0.036     -6.818      0.000      -0.320      -0.177
## occ2[T.17]    -0.4286      0.031    -13.624      0.000      -0.490      -0.367
## occ2[T.18]    -0.2957      0.216     -1.367      0.172      -0.720       0.128
## occ2[T.19]    -0.2354      0.056     -4.191      0.000      -0.345      -0.125
## occ2[T.2]     -0.0771      0.038     -2.029      0.043      -0.152      -0.003
## occ2[T.20]    -0.2158      0.046     -4.669      0.000      -0.306      -0.125
## occ2[T.21]    -0.3029      0.042     -7.171      0.000      -0.386      -0.220
## occ2[T.22]    -0.4385      0.047     -9.385      0.000      -0.530      -0.347
## occ2[T.3]     -0.0054      0.044     -0.121      0.904      -0.092       0.082
## occ2[T.4]     -0.0867      0.061     -1.431      0.152      -0.206       0.032
## occ2[T.5]     -0.2064      0.072     -2.866      0.004      -0.348      -0.065
## occ2[T.6]     -0.4175      0.057     -7.317      0.000      -0.529      -0.306
## occ2[T.7]     -0.0111      0.063     -0.177      0.860      -0.134       0.112
## occ2[T.8]     -0.3633      0.049     -7.380      0.000      -0.460      -0.267
## occ2[T.9]     -0.1928      0.052     -3.743      0.000      -0.294      -0.092
## ind2[T.11]     0.0622      0.066      0.937      0.349      -0.068       0.192
## ind2[T.12]     0.1328      0.060      2.220      0.026       0.016       0.250
## ind2[T.13]     0.0492      0.078      0.629      0.529      -0.104       0.203
## ind2[T.14]     0.0062      0.057      0.108      0.914      -0.106       0.118
## ind2[T.15]    -0.1137      0.150     -0.759      0.448      -0.407       0.180
## ind2[T.16]    -0.1072      0.063     -1.690      0.091      -0.232       0.017
## ind2[T.17]    -0.1034      0.063     -1.640      0.101      -0.227       0.020
## ind2[T.18]    -0.1331      0.058     -2.298      0.022      -0.247      -0.020
## ind2[T.19]    -0.1591      0.073     -2.190      0.029      -0.301      -0.017
## ind2[T.2]      0.2190      0.097      2.248      0.025       0.028       0.410
## ind2[T.20]    -0.3512      0.064     -5.518      0.000      -0.476      -0.226
## ind2[T.21]    -0.0824      0.062     -1.325      0.185      -0.204       0.040
## ind2[T.22]     0.0795      0.060      1.321      0.186      -0.038       0.197
## ind2[T.3]      0.0533      0.088      0.603      0.547      -0.120       0.227
## ind2[T.4]     -0.0416      0.065     -0.636      0.525      -0.170       0.087
## ind2[T.5]     -0.0628      0.063     -1.004      0.315      -0.185       0.060
## ind2[T.6]     -0.0394      0.059     -0.673      0.501      -0.154       0.075
## ind2[T.7]      0.0058      0.080      0.073      0.942      -0.152       0.163
## ind2[T.8]     -0.0610      0.081     -0.754      0.451      -0.220       0.098
## ind2[T.9]     -0.1683      0.055     -3.086      0.002      -0.275      -0.061
## sex           -0.0763      0.017     -4.521      0.000      -0.109      -0.043
## exp1           0.0087      0.001     11.758      0.000       0.007       0.010
## shs           -0.5928      0.057    -10.436      0.000      -0.704      -0.481
## hsg           -0.5213      0.030    -17.127      0.000      -0.581      -0.462
## scl           -0.4215      0.028    -14.848      0.000      -0.477      -0.366
## clg           -0.1974      0.026     -7.655      0.000      -0.248      -0.147
## mw            -0.0233      0.022     -1.075      0.283      -0.066       0.019
## so            -0.0428      0.021     -2.048      0.041      -0.084      -0.002
## we         -5.145e-05      0.023     -0.002      0.998      -0.044       0.044
## ==============================================================================
## Omnibus:                      358.629   Durbin-Watson:                   1.946
## Prob(Omnibus):                  0.000   Jarque-Bera (JB):             1439.044
## Skew:                           0.355   Prob(JB):                         0.00
## Kurtosis:                       5.807   Cond. No.                         543.
## ==============================================================================
## 
## Notes:
## [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Then predict using the parameters in the testing sample.

# calculating the out-of-sample MSE
trainregbasic <- predict(regbasic, newdata=test)
trainregbasic

##       12       44       71       84      129      149      191      221 
## 3.395851 2.734493 3.151807 2.506919 2.368192 3.173461 3.065637 3.043757 
##      248      264      281      368      464      467      496      540 
## 2.699975 2.398635 3.080034 2.542424 2.700733 2.973891 2.940246 2.796404 
##      546      576      629      687      705      765      769      809 
## 2.793806 2.907308 3.213768 3.222378 2.758879 2.772328 2.753855 3.005369 
##      938      945      952      960     1003     1057     1059     1065 
## 2.502601 2.941366 3.166449 2.846778 3.070972 2.629048 3.036431 2.881320 
##     1075     1076     1079     1101     1105     1143     1161     1195 
## 3.414267 3.504898 2.283221 2.564178 2.737425 2.761703 3.606691 2.556301 
##     1224     1271     1299     1342     1387     1407     1508     1568 
## 2.861144 2.749369 2.809817 3.680845 3.445996 2.621132 2.826924 2.877508 
##     1593     1646     1669     1693     1711     1715     1748     1749 
## 3.011676 2.775702 2.923326 2.811011 2.844195 3.037463 3.038273 2.898368 
##     1751     1763     1782     1862     1889     1896     1915     1916 
## 2.913956 3.447804 2.987807 3.308917 3.448159 3.032423 2.639366 2.710743 
##     1917     1930     1933     1939     1945     1946     1964     1978 
## 3.388704 3.639454 3.560323 3.261379 2.991757 3.476764 2.574404 3.409102 
##     1999     2020     2055     2122     2132     2135     2194     2215 
## 2.815364 2.804909 2.796349 3.052717 2.760399 2.834512 3.066740 3.066740 
##     2224     2278     2315     2330     2333     2362     2381     2465 
## 2.996740 3.497591 3.208588 3.172049 3.137275 2.977112 2.911219 2.936761 
##     2488     2490     2500     2502     2555     2566     2592     2593 
## 3.107351 2.515200 2.884473 2.834005 3.120641 3.248789 3.190732 2.996836 
##     2604     2635     2660     2746     2851     2879     2919     2972 
## 3.362746 3.364898 3.195416 2.850228 3.155803 2.618133 2.600422 3.536249 
##     2990     3005     3020     3021     3103     3117     3195     3269 
## 2.486495 2.965091 2.784081 2.659487 3.183660 3.126963 3.537607 3.375366 
##     3313     3357     3383     3428     3501     3531     3575     3605 
## 2.800045 3.108703 2.707201 3.030192 3.530804 3.056041 3.475381 2.868124 
##     3630     3634     3641     3738     3803     3810     3819     3824 
## 2.802191 2.785823 2.978776 2.885989 3.312690 3.053702 3.262039 2.792172 
##     3853     3855     3858     3867     3975     3996     4008     4012 
## 2.929577 2.837561 3.453031 2.637223 3.295419 3.283195 3.190034 3.255441 
##     4015     4021     4027     4034     4042     4046     4056     4057 
## 3.513137 2.909294 3.791820 3.371755 3.487255 3.382399 3.146749 3.341796 
##     4068     4078     4088     4095     4108     4109     4110     4111 
## 3.130255 3.083729 3.136610 3.373162 3.350788 3.342153 3.049343 3.701652 
##     4119     4161     4180     4189     4190     4311     4318     4416 
## 2.852830 3.050451 2.661820 2.655253 3.127026 3.167800 2.660548 3.436540 
##     4476     4492     4495     4585     4627     4658     4668     4735 
## 2.915678 3.440465 2.554050 3.351613 3.121649 2.702883 2.576908 2.956567 
##     4752     4801     4805     4806     4837     4862     4865     4869 
## 2.586882 3.185483 3.288241 3.562156 3.517912 3.426356 3.174408 2.709436 
##     4873     4981     5013     5021     5026     5034     5043     5069 
## 2.973516 2.988795 2.473201 2.692698 3.187168 3.096083 3.258672 2.824127 
##     5087     5115     5117     5145     5177     5187     5207     5210 
## 3.081478 2.752523 2.596189 3.173009 2.658040 2.897046 2.965355 2.732098 
##     5232     5288     5343     5345     5347     5348     5444     5455 
## 3.498372 3.170164 2.911181 3.308917 3.203620 2.998361 2.655253 2.806754 
##     5487     5548     5617     5654     5667     5699     5766     5779 
## 3.177714 2.600207 2.648163 2.536779 3.101771 3.257564 2.673927 3.191729 
##     5793     5798     5830     5889     5892     5902     5919     5946 
## 3.111683 2.895035 2.890652 2.694029 3.010846 3.283855 2.762490 3.055507 
##     6002     6005     6034     6096     6132     6152     6210     6236 
## 3.169177 3.295419 2.537143 2.886118 2.765682 3.264142 2.632311 2.698241 
##     6256     6281     6293     6310     6425     6431     6434     6457 
## 2.791589 2.867499 2.682244 2.712499 2.813238 3.020245 2.523731 3.154739 
##     6465     6478     6538     6539     6544     6575     6606     6668 
## 2.389111 3.040710 2.684084 2.731215 2.828932 2.867409 2.676274 3.154494 
##     6682     6695     6739     6800     6829     6841     6844     6886 
## 2.778237 2.808591 2.555936 3.097376 3.314938 2.790101 2.238735 3.018277 
##     6890     6929     6938     6971     6978     7019     7020     7062 
## 3.155272 2.556840 3.428714 2.846166 3.235418 2.742063 2.772506 2.615001 
##     7089     7158     7212     7221     7300     7313     7340     7347 
## 3.221539 2.731686 2.423232 2.611822 2.891218 2.688873 2.853702 2.904171 
##     7381     7389     7458     7508     7516     7560     7576     7609 
## 2.891335 2.551294 2.834250 2.462321 2.593496 2.901132 2.919222 3.151498 
##     7633     7646     7686     7710     7746     7773     7783     7793 
## 2.860321 2.264641 3.184857 3.467215 2.457641 3.128485 3.373293 3.196914 
##     7808     7859     7900     7917     7929     7977     8010     8055 
## 2.564555 2.457774 2.776710 3.185064 2.783985 2.626024 2.649542 2.811661 
##     8143     8145     8157     8195     8196     8208     8213     8214 
## 3.040923 3.409586 3.270035 3.141863 3.133301 3.039365 2.923953 2.895535 
##     8226     8231     8254     8263     8266     8317     8319     8326 
## 3.576747 2.896649 2.782438 3.055945 3.270551 3.179643 3.313609 3.620846 
##     8333     8336     8338     8346     8407     8511     8523     8560 
## 3.480284 2.632379 2.926105 3.273351 3.462364 3.248004 3.006084 2.999161 
##     8574     8578     8648     8676     8727     8792     8800     8850 
## 2.785021 2.748091 2.861330 2.950619 2.878199 2.792013 2.642290 2.860954 
##     8912     8968     8997     9010     9082     9123     9152     9161 
## 2.632838 2.886228 2.997329 2.681401 2.439120 2.790073 2.970414 3.518632 
##     9209     9210     9219     9226     9285     9317     9319     9348 
## 3.029675 3.164715 2.532392 3.090050 3.263125 3.090179 3.335531 3.517874 
##     9350     9360     9384     9396     9398     9496     9514     9529 
## 2.747039 2.747103 3.196415 3.048000 3.025001 3.219398 2.934031 3.517669 
##     9601     9616     9636     9638     9688     9733     9735     9815 
## 2.478605 3.308452 2.493620 3.135583 2.871998 2.622413 2.555936 3.197328 
##     9823     9844     9934     9980    10001    10034    10096    10108 
## 2.923922 2.886228 2.710209 2.642857 3.286879 3.176896 3.320412 3.150498 
##    10197    10210    10217    10220    10233    10276    10287    10350 
## 3.414480 2.442025 3.415007 3.020489 3.090124 2.626143 2.862338 3.305753 
##    10351    10386    10412    10552    10630    10709    10715    10727 
## 2.585180 2.657856 3.019542 3.026912 2.790073 2.782602 3.008182 3.083998 
##    10782    10800    10886    10903    10909    10917    10921    10980 
## 2.469040 3.257361 3.180766 2.523731 2.860268 2.449059 3.425870 3.257464 
##    11026    11078    11091    11204    11279    11402    11406    11435 
## 3.406395 3.019779 2.981300 2.713518 3.184448 2.729435 2.811661 2.570225 
##    11454    11474    11487    11503    11507    11598    11605    11610 
## 2.743368 2.514423 2.273232 2.660127 2.806852 2.678122 2.719207 2.782849 
##    11683    11729    11795    11806    11839    11846    11896    11924 
## 3.112563 2.746924 2.844411 2.994090 3.140131 2.800299 2.602132 2.988201 
##    11930    11958    12083    12086    12089    12101    12143    12146 
## 2.351741 3.017568 2.263834 3.083077 3.274528 2.704883 2.599324 2.809138 
##    12150    12208    12271    12287    12288    12365    12419    12429 
## 3.287253 3.353175 2.790101 2.466267 2.946758 3.106844 2.790383 3.425624 
##    12433    12475    12539    12578    12607    12711    12733    12752 
## 2.420893 2.832867 2.476729 2.878963 2.695290 2.901157 2.330046 2.573646 
##    12774    12786    12833    12871    12876    12886    12957    12976 
## 2.662935 2.599986 2.904128 3.055031 2.641126 2.807372 3.040229 2.755559 
##    13016    13017    13022    13029    13046    13052    13057    13077 
## 3.417180 3.476312 3.568416 2.837878 2.755559 3.102465 3.113267 3.114496 
##    13092    13105    13112    13122    13220    13221    13243    13310 
## 3.050248 2.845574 3.155015 3.332366 2.430184 2.276895 2.925396 2.747174 
##    13356    13358    13359    13366    13474    13510    13548    13568 
## 3.093096 2.618345 2.623750 2.479811 3.021040 2.539788 2.430929 2.764956 
##    13573    13584    13590    13600    13601    13616    13659    13708 
## 2.534431 2.645527 3.011021 2.962510 2.492657 2.730789 2.650109 2.868645 
##    13718    13782    13837    13896    13913    13917    13932    13933 
## 3.061893 2.792621 2.772830 3.346548 2.598853 3.381887 2.471366 2.724714 
##    13978    14018    14050    14072    14098    14162    14176    14198 
## 2.600259 2.880745 3.664472 3.033902 3.191092 2.865511 2.952953 2.860268 
##    14258    14259    14294    14295    14326    14367    14376    14399 
## 2.875225 2.469983 3.249871 2.713518 2.594471 2.810564 2.783779 3.182974 
##    14412    14465    14499    14512    14592    14644    14654    14683 
## 3.190560 2.980949 3.505541 3.690583 2.542768 2.817335 2.304618 2.807372 
##    14725    14741    14828    14829    14886    14940    14946    14954 
## 2.969227 2.985458 2.697326 2.466737 3.145866 2.356618 3.282041 3.008760 
##    15015    15101    15104    15114    15287    15289    15324    15355 
## 2.568758 2.769849 2.555234 3.244725 2.799118 3.220308 3.556590 3.276802 
##    15366    15469    15481    15590    15594    15628    15681    15735 
## 2.723132 2.638708 2.705936 3.100770 2.381933 2.935394 3.216137 3.239776 
##    15749    15750    15754    15758    15760    15765    15777    15785 
## 3.384704 3.516188 3.514095 3.520652 3.212966 3.608636 3.437759 3.152509 
##    15786    15791    15805    15815    15821    15827    15830    15880 
## 3.481646 3.255376 2.700494 3.265136 3.348101 3.517139 3.187060 3.110830 
##    15891    15902    15934    15941    15952    15961    15965    15971 
## 3.193349 3.714094 3.169789 3.232084 3.315654 3.356737 3.562094 3.122832 
##    15989    16035    16042    16053    16054    16056    16071    16091 
## 3.299678 3.221284 3.014416 3.464375 3.473961 2.927407 3.159375 3.555265 
##    16097    16111    16173    16258    16267    16293    16328    16405 
## 3.351173 3.480937 2.595507 2.787767 2.793084 2.646590 2.607116 2.906384 
##    16454    16502    16515    16516    16526    16590    16592    16632 
## 3.358797 2.575218 2.770689 2.647454 2.977852 2.779557 2.878467 3.526109 
##    16649    16650    16665    16672    16673    16676    16699    16757 
## 2.773134 2.863806 3.199719 3.331058 3.464375 3.401603 2.882098 3.551680 
##    16822    16844    16853    16886    16914    16952    16963    16971 
## 3.474671 2.735219 3.099418 3.248663 3.107956 2.815627 2.601526 3.448631 
##    17039    17069    17081    17091    17132    17135    17152    17157 
## 2.416265 2.552472 2.732800 2.799740 3.195670 2.955886 2.783477 2.800747 
##    17190    17208    17222    17230    17257    17293    17294    17402 
## 2.647563 3.069653 2.777815 2.848641 2.793262 2.861168 2.686719 2.417203 
##    17455    17494    17505    17529    17544    17678    17847    17875 
## 2.834641 2.486331 2.749697 2.858687 2.984336 3.244309 3.037455 3.041415 
##    17878    17879    17959    17997    18001    18039    18066    18125 
## 2.646071 2.947250 2.751080 3.070538 3.205364 2.514498 2.847078 2.989497 
##    18158    18164    18323    18401    18433    18460    18469    18470 
## 2.697404 3.046861 2.724165 3.527250 2.932882 2.664725 3.131272 3.113342 
##    18564    18565    18618    18694    18748    18753    18767    18858 
## 3.091645 3.000447 3.342855 2.587929 2.839579 2.747013 3.050876 2.740299 
##    18960    18984    19009    19076    19089    19090    19091    19094 
## 2.873387 3.467569 2.767178 3.504871 3.191277 3.139801 2.822440 3.502355 
##    19121    19122    19173    19192    19195    19208    19223    19434 
## 3.303374 3.374685 3.339694 2.567525 2.330253 2.800747 2.740299 2.885277 
##    19478    19492    19556    19594    19616    19726    19730    19733 
## 3.445357 2.951892 2.795562 3.032479 3.056944 3.074790 2.941744 3.173019 
##    19757    19801    19827    19847    19858    19869    19875    19906 
## 2.972071 3.380165 2.860108 2.424384 2.441589 2.232110 2.599261 2.692920 
##    19994    19995    20008    20015    20031    20049    20056    20083 
## 2.831725 2.407047 2.657506 3.148543 2.896903 2.791363 2.905944 2.664401 
##    20107    20111    20127    20163    20200    20241    20258    20267 
## 2.795667 2.490694 3.146094 2.563411 2.396161 2.596753 3.466035 3.678169 
##    20385    20427    20443    20520    20605    20679    20767    20806 
## 2.386165 2.836262 2.631594 2.804303 3.008397 2.924797 3.008649 2.891907 
##    20837    20875    20898    20903    20905    20912    20925    20937 
## 3.077707 2.797785 3.133210 2.929868 2.744293 2.625647 3.369021 2.727499 
##    21080    21085    21165    21176    21265    21266    21269    21284 
## 3.022307 2.580516 2.879550 2.826243 3.207662 3.283998 3.108848 2.691858 
##    21304    21333    21337    21365    21398    21418    21542    21559 
## 3.163137 2.232110 2.567706 3.346551 2.729273 2.655225 2.785654 3.094746 
##    21626    21690    21779    21830    21900    21922    21987    22009 
## 3.058768 2.980862 2.905345 2.805037 3.083569 2.423860 2.597930 3.401739 
##    22150    22269    22336    22345    22377    22399    22414    22423 
## 2.308446 3.085418 3.036176 3.196270 2.620568 3.057593 2.677924 3.119933 
##    22471    22714    22972    22973    22979    23002    23007    23053 
## 2.969788 3.721346 3.126603 2.939269 2.775603 2.741388 3.094053 2.841212 
##    23144    23176    23192    23302    23320    23338    23391    23441 
## 2.867323 3.091255 3.399768 2.678750 2.563752 2.767178 2.642704 2.824370 
##    23468    23493    23517    23528    23531    23548    23553    23591 
## 3.232750 2.670633 2.681284 2.775603 2.549311 3.300897 2.907745 3.688897 
##    23681    23701    23780    23840    23848    23857    24022    24023 
## 2.925199 2.675506 2.624561 2.624397 2.807388 3.014823 3.073608 2.755584 
##    24030    24058    24059    24124    24135    24352    24418    24421 
## 2.389318 2.967909 3.114712 2.469926 3.150024 2.977259 2.671498 3.131272 
##    24453    24463    24470    24471    24478    24498    24527    24576 
## 2.857084 3.142006 3.009469 2.621414 3.155093 2.657787 3.065349 3.150426 
##    24593    24601    24687    24714    24741    24804    24816    24911 
## 3.050876 2.601124 2.942821 2.752929 2.629750 2.903934 2.921344 2.745706 
##    24922    24924    24947    25024    25066    25070    25085    25103 
## 2.980886 3.186868 2.833660 2.902548 2.755649 2.949469 3.376068 2.726612 
##    25127    25141    25148    25275    25356    25392    25408    25432 
## 2.413455 2.930122 2.540796 2.796418 2.794290 3.413211 2.820307 2.669043 
##    25519    25570    25571    25597    25611    25621    25652    25711 
## 2.710115 2.796976 2.723939 2.670310 2.295860 3.000743 2.380831 3.001285 
##    25728    25735    25737    25766    25804    25849    25888    25890 
## 2.820115 3.014233 3.069530 3.387951 2.851479 3.644922 2.684099 2.367786 
##    25947    25990    26078    26081    26084    26086    26120    26168 
## 2.988004 2.760290 2.731489 2.380831 2.993922 2.462162 3.502219 3.457387 
##    26314    26331    26476    26528    26602    26612    26613    26644 
## 2.712185 3.126173 2.810868 2.525219 2.586276 3.301202 3.363585 3.064085 
##    26696    26755    26801    26811    26861    26902    26916    26919 
## 2.684321 2.819399 3.331036 2.766494 3.126131 2.923047 2.706563 3.139545 
##    26956    27110    27148    27246    27395    27403    27415    27420 
## 3.223426 2.778321 3.304095 2.648727 2.767012 3.227156 2.821524 3.214151 
##    27531    27541    27555    27578    27579    27598    27600    27606 
## 3.040861 3.048308 3.385778 3.248080 3.429799 3.167417 2.690347 2.985094 
##    27611    27698    27699    27757    27764    27787    27791    27870 
## 2.992102 2.613061 2.689963 2.775728 3.542449 2.716503 2.934718 2.929703 
##    27925    27928    27945    28038    28119    28156    28157    28191 
## 2.673675 2.633052 3.180282 2.968123 3.262777 2.799444 2.465556 2.815750 
##    28237    28242    28251    28254    28360    28363    28379    28394 
## 3.008311 2.731244 2.971210 3.261576 2.949695 3.294052 2.941437 2.609867 
##    28422    28496    28572    28621    28765    28770    28776    28807 
## 2.806628 3.082127 2.723916 2.866546 2.637370 3.235074 3.037555 2.719983 
##    28886    28961    28969    28980    28997    29004    29014    29093 
## 3.413106 2.699416 2.921469 2.733039 2.716503 3.226393 2.954921 2.795414 
##    29101    29118    29159    29164    29169    29225    29275    29288 
## 2.950096 3.020908 3.478804 2.859092 2.503292 2.667893 3.026353 3.193546 
##    29294    29387    29431    29480    29508    29521    29540    29554 
## 2.900029 2.866459 2.623956 3.194736 2.802281 3.648312 3.189868 2.852089 
##    29557    29562    29569    29584    29593    29633    29634    29639 
## 3.072721 3.401966 3.397119 3.397747 3.693485 3.406763 2.722689 3.582981 
##    29652    29687    29715    29748    29830    29868    29871    29916 
## 2.782137 3.266794 3.002642 2.577031 2.901424 2.335838 2.653470 2.385368 
##    29920    29958    30006    30128    30194    30197    30263    30290 
## 2.525828 2.908758 3.364388 3.228235 2.866970 3.093340 3.138425 2.928069 
##    30369    30427    30477    30478    30491    30539    30583    30591 
## 3.217922 2.737111 3.119816 3.057985 3.440781 2.463042 3.415199 3.121343 
##    30606    30613    30624    30632    30649    30694    30727    30737 
## 3.078165 3.305130 3.272426 3.110170 2.893884 3.218469 3.158782 2.787540 
##    30747    30753    30756    30777    30783    30817    30870    30892 
## 3.328520 3.119508 3.386635 3.022257 2.444432 2.668947 3.140760 2.815947 
##    30898    30984    31002    31003    31038    31069    31132    31147 
## 3.477839 2.737057 2.906667 3.413590 3.377261 2.962718 3.528125 2.866060 
##    31183    31184    31191    31222    31254    31324    31363    31391 
## 2.739399 2.779588 3.217096 2.640962 3.085992 3.392725 3.207045 2.789176 
##    31396    31426    31431    31433    31504    31525    31615    31667 
## 2.961050 3.263336 3.207624 2.875036 3.236769 3.617178 3.542060 3.295834 
##    31668    31689    31758    31768    31863    31867    31868    31897 
## 3.389677 3.177946 3.550217 2.836428 3.403739 3.198997 3.305430 3.106381 
##    31899    31902    31956    31969    31975    31982    31997    32042 
## 3.063874 2.847226 3.342601 3.210594 3.279223 2.680762 3.406409 2.788163 
##    32045    32074    32087    32095    32143    32152    32212    32229 
## 2.505339 3.085367 3.027446 3.046780 2.684815 2.950308 3.404048 2.979846 
##    32256    32314    32357    32369    32458    32502    32504    32517 
## 3.075402 3.566477 3.167155 2.795414 2.851532 2.943600 2.752415 3.184073 
##    32561    32564    32571    32591    32599    32626 
## 3.481612 2.347673 2.475821 2.982382 2.551125 3.497855

lwage_test = test["lwage"].values
test = sm.add_constant(test)   #add constant

lwage_pred =  basic_results.predict(test) # predict out of sample
print(lwage_pred)

## rownames
## 29749    2.454760
## 32504    2.729422
## 4239     3.374858
## 985      3.451121
## 8477     2.883054
##            ...   
## 27533    3.039693
## 7218     2.669400
## 7204     3.271324
## 1380     2.943550
## 10451    3.462293
## Length: 1030, dtype: float64

Finally, we test the predictions.

R code

y.test <- log(test$wage)
MSE.test1 <- sum((y.test-trainregbasic)^2)/length(y.test)
R2.test1<- 1- MSE.test1/var(y.test)

Test MSE for the basic model: 0.1971044.
Test R2 for the basic model: 0.3279559.

Python code

MSE_test1 = np.sum((lwage_test-lwage_pred)**2)/len(lwage_test)
R2_test1  = 1 - MSE_test1/np.var(lwage_test)

print("Test MSE for the basic model: ", MSE_test1, " ")

## Test MSE for the basic model:  0.21963534669163987

print("Test R2 for the basic model: ", R2_test1)

## Test R2 for the basic model:  0.27498431184537286

In the basic model, the \(MSE test\) is quite closed to the \(MSE sample\).

# estimating the parameters
regflex <- lm(flex, data=train)

# calculating the out-of-sample MSE
trainregflex<- predict(regflex, newdata=test)

y.test <- log(test$wage)
MSE.test2 <- sum((y.test-trainregflex)^2)/length(y.test)
R2.test2<- 1- MSE.test2/var(y.test)

Test MSE for the flexible model: 0.2064107.
Test R2 for the flexible model: 0.2962252.

# Flexible model
# estimating the parameters in the training sample
flex_results = smf.ols(flex , data=train).fit()

# calculating the out-of-sample MSE
lwage_flex_pred =  flex_results.predict(test) # predict out of sample
lwage_test = test["lwage"].values

MSE_test2 = np.sum((lwage_test-lwage_flex_pred)**2)/len(lwage_test)
R2_test2  = 1 - MSE_test2/np.var(lwage_test)

print("Test MSE for the flexible model: ", MSE_test2, " ")

## Test MSE for the flexible model:  0.2332944574254981

print("Test R2 for the flexible model: ", R2_test2)

## Test R2 for the flexible model:  0.22989562408423558

In the flexible model, the discrepancy between the \(MSE_{test}\) and the \(MSE_{sample}\) is not large.

It is worth to notice that the \(MSE_{test}\) vary across different data splits. Hence, it is a good idea average the out-of-sample MSE over different data splits to get valid results.

Nevertheless, we observe that, based on the out-of-sample \(MSE\), the basic model using OLS regression performs is about as well (or slightly better) than the flexible model.

Next, let us use lasso regression in the flexible model instead of OLS regression. Lasso (least absolute shrinkage and selection operator) is a penalized regression method that can be used to reduce the complexity of a regression model when the number of regressors \(p\) is relatively large in relation to \(n\).

Note that the out-of-sample \(MSE\) on the test sample can be computed for any other black-box prediction method as well. Thus, let us finally compare the performance of lasso regression in the flexible model to OLS regression.

R code

# flexible model using lasso

# estimating the parameters
reglasso <- rlasso(flex, data=train, post=FALSE)

# calculating the out-of-sample MSE
trainreglasso<- predict(reglasso, newdata=test)
MSE.lasso <- sum((y.test-trainreglasso)^2)/length(y.test)
R2.lasso<- 1- MSE.lasso/var(y.test)

Test \(MSE\) for the lasso on flexible model: 0.212698
Test \(R^2\) for the lasso flexible model: 0.2747882

Python code

# flexible model using lasso
# get exogenous variables from training data used in flex model
flex_results_0 = smf.ols(flex , data=train)
X_train = flex_results_0.exog
print(X_train.shape)

# Get endogenous variable

## (4120, 246)

lwage_train = train["lwage"]
print(lwage_train.shape)

## (4120,)

# calculating the out-of-sample MSE
# alpha=0.1
# reg = linear_model.Lasso(alpha = alpha)
# lwage_lasso_fitted = reg.fit(train, lwage_train).predict(test)
# 
# # MSE_lasso = np.sum((lwage_test-lwage_lasso_fitted)**2)/len(lwage_test)
# # R2_lasso  = 1 - MSE_lasso/np.var(lwage_test)
# 
# print("Test MSE for the flexible model: ", MSE_lasso, " ")
# print("Test R2 for the flexible model: ", R2_lasso)

Finally, let us summarize the results:

R code

# Models <- c("Basic regression","Flexible regression","Lasso regression")
# R_2_SAMPLE <- c(R2.test1,R2.test2,R2.lasso)
# MSE_SAMPLE<- c(MSE.test1,MSE.test2,MSE.lasso)
# 
# data.frame(Models,R_2_SAMPLE,MSE_SAMPLE) %>%
#   kable("markdown",caption = "Descriptive Statistics - Random Process")

# print(data.frame(Models,R_2,MSE),type="latex")

Python code

# Package for latex table
# #import array_to_latex as a2l
# 
# table2 = np.zeros((3, 2))
# table2[0,0] = MSE_test1
# table2[1,0] = MSE_test2
# table2[2,0] = MSE_lasso
# table2[0,1] = R2_test1
# table2[1,1] = R2_test2
# table2[2,1] = R2_lasso
# 
# table2 = pd.DataFrame(table2, columns = ["$MSE_{test}$", "$R^2_{test}$"], \
#                       index = ["basic reg","flexible reg","lasso regression"])
# table2

# table2.to_latex
# print(table2.to_latex())