Chapter 2 Predictive Inference
In labor economics an important question is what determines the wage of workers. This is a causal question, but we could begin to investigate from a predictive perspective.
In the following wage example, \(Y\) is the hourly wage of a worker and \(X\) is a vector of worker’s characteristics, e.g., education, experience, gender. Two main questions here are:
How to use job-relevant characteristics, such as education and experience, to best predict wages?
What is the difference in predicted wages between men and women with the same job-relevant characteristics?
In this lab, we focus on the prediction question first.
2.1 Data
The data set we consider is from the March Supplement of the U.S. Current Population Survey, year 2015. We select white non-hispanic individuals, aged 25 to 64 years, and working more than 35 hours per week during at least 50 weeks of the year. We exclude self-employed workers; individuals living in group quarters; individuals in the military, agricultural or private household sectors; individuals with inconsistent reports on earnings and employment status; individuals with allocated or missing information in any of the variables used in the analysis; and individuals with hourly wage below \(3\).
The variable of interest \(Y\) is the hourly wage rate constructed as the ratio of the annual earnings to the total number of hours worked, which is constructed in turn as the product of number of weeks worked and the usual number of hours worked per week. In our analysis, we also focus on single (never married) workers. The final sample is of size \(n = 5150\).
2.2 Data Analysis
2.2.1 R and Python code
- Import relevant packages
R code
library(dplyr)
library(kableExtra)
library(reticulate) # to run python
Python code
import pandas as pd
import numpy as np
import pyreadr
- We start loading the data set.
R code
# to import RData file
load("./data/wage2015_subsample_inference.Rdata")
# to get data dimensions
dim(data)
## [1] 5150 20
Python code
= pyreadr.read_r("./data/wage2015_subsample_inference.Rdata")
rdata_read = rdata_read['data']
data data.shape
## (5150, 20)
The dimensions are 5150 rows and 20 columns.
Let’s have a look at the structure of the data.
R code
# Calculate the means and convert it into a dataframe
<- data.frame(lapply(data,class))%>%
table0 ::gather("Variable","Type")
tidyr
# Table presentation
%>%
table0 kable("markdown",caption = "Type of the Variables")
Variable | Type |
---|---|
wage | numeric |
lwage | numeric |
sex | numeric |
shs | numeric |
hsg | numeric |
scl | numeric |
clg | numeric |
ad | numeric |
mw | numeric |
so | numeric |
we | numeric |
ne | numeric |
exp1 | numeric |
exp2 | numeric |
exp3 | numeric |
exp4 | numeric |
occ | factor |
occ2 | factor |
ind | factor |
ind2 | factor |
Python code
data.info()
## <class 'pandas.core.frame.DataFrame'>
## Index: 5150 entries, 10 to 32643
## Data columns (total 20 columns):
## # Column Non-Null Count Dtype
## --- ------ -------------- -----
## 0 wage 5150 non-null float64
## 1 lwage 5150 non-null float64
## 2 sex 5150 non-null float64
## 3 shs 5150 non-null float64
## 4 hsg 5150 non-null float64
## 5 scl 5150 non-null float64
## 6 clg 5150 non-null float64
## 7 ad 5150 non-null float64
## 8 mw 5150 non-null float64
## 9 so 5150 non-null float64
## 10 we 5150 non-null float64
## 11 ne 5150 non-null float64
## 12 exp1 5150 non-null float64
## 13 exp2 5150 non-null float64
## 14 exp3 5150 non-null float64
## 15 exp4 5150 non-null float64
## 16 occ 5150 non-null category
## 17 occ2 5150 non-null category
## 18 ind 5150 non-null category
## 19 ind2 5150 non-null category
## dtypes: category(4), float64(16)
## memory usage: 736.3+ KB
data.describe()
## wage lwage ... exp3 exp4
## count 5150.000000 5150.000000 ... 5150.000000 5150.000000
## mean 23.410410 2.970787 ... 8.235867 25.118038
## std 21.003016 0.570385 ... 14.488962 53.530225
## min 3.021978 1.105912 ... 0.000000 0.000000
## 25% 13.461538 2.599837 ... 0.125000 0.062500
## 50% 19.230769 2.956512 ... 1.000000 1.000000
## 75% 27.777778 3.324236 ... 9.261000 19.448100
## max 528.845673 6.270697 ... 103.823000 487.968100
##
## [8 rows x 16 columns]
- Give structure to the variables
We are constructing the output variable \(Y\) and the matrix \(Z\) which includes the characteristics of workers that are given in the data.
R code
# Calculate the log wage.
<- log(data$wage)
Y # Number of observaciones
<- length(Y)
n # Regressors
<- data[,c("wage","lwage")]
Z # Number of regressors
<- dim(Z)[2] p
Number of observation: 5150
Number of raw regressors:2
Python code
# Calculate the log wage.
= np.log2(data['wage'])
Y # Number of observaciones
= len(Y)
n # Regressors
= data.loc[:, ~data.columns.isin(['wage', 'lwage','Unnamed: 0'])]
z # Number of regressors
= z.shape[1]
p print("Number of observation:", n, '\n')
## Number of observation: 5150
print( "Number of raw regressors:", p)
## Number of raw regressors: 18
- For the outcome variable wage and a subset of the raw regressors, we calculate the empirical mean to get familiar with the data.
R code
# Select the variables.
<- data[,c("lwage","sex","shs","hsg","scl","clg","ad","mw","so","we","ne","exp1")]
Z_subset # Create table
<- data.frame(as.numeric(lapply(Z_subset,mean))) %>%
table1 mutate(Variables = c("Log Wage","Sex","Some High School","High School Graduate","Some College","College Graduate", "Advanced Degree","Midwest","South","West","Northeast","Experience")) %>%
rename(`Sample Mean` = `as.numeric.lapply.Z_subset..mean..`) %>%
select(2,1)
# HTML table
%>%
table1 kable("markdown",caption = "Descriptive Statistics")
Variables | Sample Mean |
---|---|
Log Wage | 2.9707867 |
Sex | 0.4444660 |
Some High School | 0.0233010 |
High School Graduate | 0.2438835 |
Some College | 0.2780583 |
College Graduate | 0.3176699 |
Advanced Degree | 0.1370874 |
Midwest | 0.2596117 |
South | 0.2965049 |
West | 0.2161165 |
Northeast | 0.2277670 |
Experience | 13.7605825 |
Python code
= data.loc[:, data.columns.isin(["lwage","sex","shs","hsg","scl","clg","ad","mw","so","we","ne","exp1"])]
Z_subset = Z_subset.mean(axis=0)
table table
## lwage 2.970787
## sex 0.444466
## shs 0.023301
## hsg 0.243883
## scl 0.278058
## clg 0.317670
## ad 0.137087
## mw 0.259612
## so 0.296505
## we 0.216117
## ne 0.227767
## exp1 13.760583
## dtype: float64
= pd.DataFrame(data=table, columns={"Sample mean":"0"} )
table # table.index
= list(table.index)
index1 = ["Log Wage","Sex","Some High School","High School Graduate",\
index2 "Some College","College Graduate", "Advanced Degree","Midwest",\
"South","West","Northeast","Experience"]
= table.rename(index=dict(zip(index1,index2))) table
E.g., the share of female workers in our sample is ~41% (\(sex=1\) if female).
Alternatively, we can also print the table as latex.
R code
print(table1, type="latex")
## Variables Sample Mean
## 1 Log Wage 2.97078670
## 2 Sex 0.44446602
## 3 Some High School 0.02330097
## 4 High School Graduate 0.24388350
## 5 Some College 0.27805825
## 6 College Graduate 0.31766990
## 7 Advanced Degree 0.13708738
## 8 Midwest 0.25961165
## 9 South 0.29650485
## 10 West 0.21611650
## 11 Northeast 0.22776699
## 12 Experience 13.76058252
Python code
print(table.to_latex())
## \begin{tabular}{lr}
## \toprule
## {} & Sample mean \\
## \midrule
## Log Wage & 2.970787 \\
## Sex & 0.444466 \\
## Some High School & 0.023301 \\
## High School Graduate & 0.243883 \\
## Some College & 0.278058 \\
## College Graduate & 0.317670 \\
## Advanced Degree & 0.137087 \\
## Midwest & 0.259612 \\
## South & 0.296505 \\
## West & 0.216117 \\
## Northeast & 0.227767 \\
## Experience & 13.760583 \\
## \bottomrule
## \end{tabular}
2.3 Prediction Question
Now, we will construct a prediction rule for hourly wage \(Y\) , which depends linearly on job-relevant characteristics \(X\):
\[Y = \beta ′X + \epsilon \]
Our goals are
Predict wages using various characteristics of workers.
Assess the predictive performance using the (adjusted) sample MSE, the (adjusted) sample \(R^2\) and the out-of-sample \(MSE\) and \(R^2\).
We employ two different specifications for prediction:
Basic Model: \(X\) consists of a set of raw regressors (e.g. gender, experience, education indicators, occupation and industry indicators, regional indicators).
Flexible Model: \(X\) consists of all raw regressors from the basic model plus occupation and industry indicators, transformations (e.g.,\(exp2\) and \(exp3\)) and additional two-way interactions of polynomial in experience with other regressors. An example of a regressor created through a two-way interaction is experience times the indicator of having a college degree.
Using the Flexible Model, enables us to approximate the real relationship by a more complex regression model and therefore to reduce the bias. The Flexible Model increases the range of potential shapes of the estimated regression function. In general, flexible models often deliver good prediction accuracy but give models which are harder to interpret.
Now, let us fit both models to our data by running ordinary least squares (ols):
2.3.1 Basic Model:
R code
<- lwage~ (sex + exp1 + shs + hsg+ scl + clg + mw + so + we+occ2+ind2)
basic <- lm(basic, data=data)
regbasic summary(regbasic) # estimated coefficients
##
## Call:
## lm(formula = basic, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.2479 -0.2885 -0.0035 0.2724 3.6529
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.7222354 0.0803414 46.330 < 2e-16 ***
## sex -0.0728575 0.0150269 -4.848 1.28e-06 ***
## exp1 0.0085677 0.0006537 13.106 < 2e-16 ***
## shs -0.5927984 0.0505549 -11.726 < 2e-16 ***
## hsg -0.5043375 0.0270767 -18.626 < 2e-16 ***
## scl -0.4119936 0.0252036 -16.347 < 2e-16 ***
## clg -0.1822160 0.0229524 -7.939 2.49e-15 ***
## mw -0.0275413 0.0193301 -1.425 0.154280
## so -0.0344538 0.0187063 -1.842 0.065558 .
## we 0.0172492 0.0200860 0.859 0.390510
## occ22 -0.0764717 0.0342039 -2.236 0.025411 *
## occ23 -0.0346777 0.0387595 -0.895 0.370995
## occ24 -0.0962017 0.0519073 -1.853 0.063892 .
## occ25 -0.1879150 0.0603999 -3.111 0.001874 **
## occ26 -0.4149333 0.0502176 -8.263 < 2e-16 ***
## occ27 -0.0459867 0.0565054 -0.814 0.415771
## occ28 -0.3778470 0.0439290 -8.601 < 2e-16 ***
## occ29 -0.2157519 0.0461229 -4.678 2.98e-06 ***
## occ210 -0.0106235 0.0396274 -0.268 0.788645
## occ211 -0.4558342 0.0594409 -7.669 2.07e-14 ***
## occ212 -0.3075889 0.0555146 -5.541 3.16e-08 ***
## occ213 -0.3614403 0.0455401 -7.937 2.53e-15 ***
## occ214 -0.4994955 0.0506204 -9.867 < 2e-16 ***
## occ215 -0.4644817 0.0517634 -8.973 < 2e-16 ***
## occ216 -0.2337150 0.0324348 -7.206 6.62e-13 ***
## occ217 -0.4125884 0.0279079 -14.784 < 2e-16 ***
## occ218 -0.3404183 0.1966277 -1.731 0.083462 .
## occ219 -0.2414797 0.0494794 -4.880 1.09e-06 ***
## occ220 -0.2126282 0.0408854 -5.201 2.06e-07 ***
## occ221 -0.2884133 0.0380839 -7.573 4.30e-14 ***
## occ222 -0.4223936 0.0414626 -10.187 < 2e-16 ***
## ind23 -0.1168365 0.0983990 -1.187 0.235135
## ind24 -0.2444926 0.0772658 -3.164 0.001564 **
## ind25 -0.2735325 0.0810190 -3.376 0.000741 ***
## ind26 -0.2493683 0.0781049 -3.193 0.001418 **
## ind27 -0.1395884 0.0931442 -1.499 0.134032
## ind28 -0.2429480 0.0940642 -2.583 0.009828 **
## ind29 -0.3874847 0.0765762 -5.060 4.34e-07 ***
## ind210 -0.1938509 0.0842585 -2.301 0.021451 *
## ind211 -0.1690628 0.0823701 -2.052 0.040174 *
## ind212 -0.0774358 0.0789759 -0.980 0.326887
## ind213 -0.1726041 0.0901297 -1.915 0.055540 .
## ind214 -0.1870052 0.0768288 -2.434 0.014965 *
## ind215 -0.3253637 0.1489158 -2.185 0.028943 *
## ind216 -0.3153990 0.0815927 -3.866 0.000112 ***
## ind217 -0.3044052 0.0806806 -3.773 0.000163 ***
## ind218 -0.3353864 0.0777377 -4.314 1.63e-05 ***
## ind219 -0.3741207 0.0879131 -4.256 2.12e-05 ***
## ind220 -0.5519322 0.0816545 -6.759 1.54e-11 ***
## ind221 -0.3166788 0.0802596 -3.946 8.06e-05 ***
## ind222 -0.1189713 0.0791489 -1.503 0.132866
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4761 on 5099 degrees of freedom
## Multiple R-squared: 0.31, Adjusted R-squared: 0.3033
## F-statistic: 45.83 on 50 and 5099 DF, p-value: < 2.2e-16
cat( "Number of regressors in the basic model:",length(regbasic$coef), '\n')
## Number of regressors in the basic model: 51
Python
import statsmodels.api as sm
import statsmodels.formula.api as smf
= 'lwage ~ sex + exp1 + shs + hsg+ scl + clg + mw + so + we + occ2+ ind2'
basic = smf.ols(basic , data=data).fit()
basic_results print(basic_results.summary()) # estimated coefficients
## OLS Regression Results
## ==============================================================================
## Dep. Variable: lwage R-squared: 0.310
## Model: OLS Adj. R-squared: 0.303
## Method: Least Squares F-statistic: 45.83
## Date: Wed, 24 Nov 2021 Prob (F-statistic): 0.00
## Time: 12:19:42 Log-Likelihood: -3459.9
## No. Observations: 5150 AIC: 7022.
## Df Residuals: 5099 BIC: 7356.
## Df Model: 50
## Covariance Type: nonrobust
## ==============================================================================
## coef std err t P>|t| [0.025 0.975]
## ------------------------------------------------------------------------------
## Intercept 3.5284 0.054 65.317 0.000 3.422 3.634
## occ2[T.10] -0.0106 0.040 -0.268 0.789 -0.088 0.067
## occ2[T.11] -0.4558 0.059 -7.669 0.000 -0.572 -0.339
## occ2[T.12] -0.3076 0.056 -5.541 0.000 -0.416 -0.199
## occ2[T.13] -0.3614 0.046 -7.937 0.000 -0.451 -0.272
## occ2[T.14] -0.4995 0.051 -9.867 0.000 -0.599 -0.400
## occ2[T.15] -0.4645 0.052 -8.973 0.000 -0.566 -0.363
## occ2[T.16] -0.2337 0.032 -7.206 0.000 -0.297 -0.170
## occ2[T.17] -0.4126 0.028 -14.784 0.000 -0.467 -0.358
## occ2[T.18] -0.3404 0.197 -1.731 0.083 -0.726 0.045
## occ2[T.19] -0.2415 0.049 -4.880 0.000 -0.338 -0.144
## occ2[T.2] -0.0765 0.034 -2.236 0.025 -0.144 -0.009
## occ2[T.20] -0.2126 0.041 -5.201 0.000 -0.293 -0.132
## occ2[T.21] -0.2884 0.038 -7.573 0.000 -0.363 -0.214
## occ2[T.22] -0.4224 0.041 -10.187 0.000 -0.504 -0.341
## occ2[T.3] -0.0347 0.039 -0.895 0.371 -0.111 0.041
## occ2[T.4] -0.0962 0.052 -1.853 0.064 -0.198 0.006
## occ2[T.5] -0.1879 0.060 -3.111 0.002 -0.306 -0.070
## occ2[T.6] -0.4149 0.050 -8.263 0.000 -0.513 -0.316
## occ2[T.7] -0.0460 0.057 -0.814 0.416 -0.157 0.065
## occ2[T.8] -0.3778 0.044 -8.601 0.000 -0.464 -0.292
## occ2[T.9] -0.2158 0.046 -4.678 0.000 -0.306 -0.125
## ind2[T.11] 0.0248 0.058 0.427 0.669 -0.089 0.139
## ind2[T.12] 0.1164 0.053 2.201 0.028 0.013 0.220
## ind2[T.13] 0.0212 0.068 0.312 0.755 -0.112 0.155
## ind2[T.14] 0.0068 0.050 0.136 0.892 -0.092 0.106
## ind2[T.15] -0.1315 0.137 -0.959 0.338 -0.400 0.137
## ind2[T.16] -0.1215 0.056 -2.177 0.029 -0.231 -0.012
## ind2[T.17] -0.1106 0.056 -1.987 0.047 -0.220 -0.001
## ind2[T.18] -0.1415 0.051 -2.774 0.006 -0.242 -0.042
## ind2[T.19] -0.1803 0.065 -2.761 0.006 -0.308 -0.052
## ind2[T.2] 0.1939 0.084 2.301 0.021 0.029 0.359
## ind2[T.20] -0.3581 0.057 -6.332 0.000 -0.469 -0.247
## ind2[T.21] -0.1228 0.055 -2.239 0.025 -0.230 -0.015
## ind2[T.22] 0.0749 0.053 1.407 0.160 -0.029 0.179
## ind2[T.3] 0.0770 0.079 0.972 0.331 -0.078 0.232
## ind2[T.4] -0.0506 0.058 -0.874 0.382 -0.164 0.063
## ind2[T.5] -0.0797 0.056 -1.432 0.152 -0.189 0.029
## ind2[T.6] -0.0555 0.052 -1.072 0.284 -0.157 0.046
## ind2[T.7] 0.0543 0.072 0.756 0.450 -0.086 0.195
## ind2[T.8] -0.0491 0.072 -0.679 0.497 -0.191 0.093
## ind2[T.9] -0.1936 0.048 -4.017 0.000 -0.288 -0.099
## sex -0.0729 0.015 -4.848 0.000 -0.102 -0.043
## exp1 0.0086 0.001 13.106 0.000 0.007 0.010
## shs -0.5928 0.051 -11.726 0.000 -0.692 -0.494
## hsg -0.5043 0.027 -18.626 0.000 -0.557 -0.451
## scl -0.4120 0.025 -16.347 0.000 -0.461 -0.363
## clg -0.1822 0.023 -7.939 0.000 -0.227 -0.137
## mw -0.0275 0.019 -1.425 0.154 -0.065 0.010
## so -0.0345 0.019 -1.842 0.066 -0.071 0.002
## we 0.0172 0.020 0.859 0.391 -0.022 0.057
## ==============================================================================
## Omnibus: 437.645 Durbin-Watson: 1.885
## Prob(Omnibus): 0.000 Jarque-Bera (JB): 1862.313
## Skew: 0.322 Prob(JB): 0.00
## Kurtosis: 5.875 Cond. No. 541.
## ==============================================================================
##
## Notes:
## [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
print( "Number of regressors in the basic model:",len(basic_results.params), '\n') # number of regressors in the Basic Model
## Number of regressors in the basic model: 51
Number of regressors in the basic model 51.
2.3.2 Flexible Model:
R code
<- lwage ~ sex + shs+hsg+scl+clg+occ2+ind2+mw+so+we + (exp1+exp2+exp3+exp4)*(shs+hsg+scl+clg+occ2+ind2+mw+so+we)
flex <- lm(flex, data=data)
regflex summary(regflex) # estimated coefficients
##
## Call:
## lm(formula = flex, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.9384 -0.2782 -0.0041 0.2733 3.4934
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.8602606 0.4286188 9.006 < 2e-16 ***
## sex -0.0695532 0.0152180 -4.570 4.99e-06 ***
## shs -0.1233089 0.9068325 -0.136 0.891845
## hsg -0.5289024 0.1977559 -2.675 0.007508 **
## scl -0.2920581 0.1260155 -2.318 0.020510 *
## clg -0.0411641 0.0703862 -0.585 0.558688
## occ22 0.1613397 0.1297243 1.244 0.213665
## occ23 0.2101514 0.1686774 1.246 0.212869
## occ24 0.0708570 0.1837167 0.386 0.699746
## occ25 -0.3960076 0.1885398 -2.100 0.035745 *
## occ26 -0.2310611 0.1869662 -1.236 0.216576
## occ27 0.3147249 0.1941519 1.621 0.105077
## occ28 -0.1875417 0.1692988 -1.108 0.268022
## occ29 -0.3390270 0.1672301 -2.027 0.042685 *
## occ210 0.0209545 0.1564982 0.134 0.893490
## occ211 -0.6424177 0.3090899 -2.078 0.037723 *
## occ212 -0.0674774 0.2520486 -0.268 0.788929
## occ213 -0.2329781 0.2315379 -1.006 0.314359
## occ214 0.2562009 0.3226729 0.794 0.427236
## occ215 -0.1938585 0.2595082 -0.747 0.455086
## occ216 -0.0551256 0.1470658 -0.375 0.707798
## occ217 -0.4156093 0.1361144 -3.053 0.002275 **
## occ218 -0.4822168 1.0443540 -0.462 0.644290
## occ219 -0.2579412 0.3325215 -0.776 0.437956
## occ220 -0.3010203 0.2341022 -1.286 0.198556
## occ221 -0.4271811 0.2206486 -1.936 0.052922 .
## occ222 -0.8694527 0.2975222 -2.922 0.003490 **
## ind23 -1.2473654 0.6454941 -1.932 0.053365 .
## ind24 -0.0948281 0.4636021 -0.205 0.837935
## ind25 -0.5293860 0.4345990 -1.218 0.223244
## ind26 -0.6221688 0.4347226 -1.431 0.152441
## ind27 -0.5047497 0.5024770 -1.005 0.315176
## ind28 -0.7295442 0.4674008 -1.561 0.118623
## ind29 -0.8025334 0.4252462 -1.887 0.059190 .
## ind210 -0.5805840 0.4808776 -1.207 0.227358
## ind211 -0.9852350 0.4481566 -2.198 0.027966 *
## ind212 -0.7375777 0.4243260 -1.738 0.082232 .
## ind213 -1.0183283 0.4826544 -2.110 0.034922 *
## ind214 -0.5860174 0.4159033 -1.409 0.158892
## ind215 -0.3801359 0.5908517 -0.643 0.520014
## ind216 -0.5703905 0.4386579 -1.300 0.193556
## ind217 -0.8201843 0.4259846 -1.925 0.054239 .
## ind218 -0.7613604 0.4238287 -1.796 0.072495 .
## ind219 -0.8812815 0.4565671 -1.930 0.053635 .
## ind220 -0.9099021 0.4484198 -2.029 0.042499 *
## ind221 -0.7586534 0.4405801 -1.722 0.085143 .
## ind222 -0.4040775 0.4328735 -0.933 0.350620
## mw 0.1106834 0.0814463 1.359 0.174218
## so 0.0224244 0.0743855 0.301 0.763075
## we -0.0215659 0.0841591 -0.256 0.797767
## exp1 -0.0677247 0.1519756 -0.446 0.655885
## exp2 1.6362944 1.6909253 0.968 0.333246
## exp3 -0.9154735 0.6880249 -1.331 0.183388
## exp4 0.1429357 0.0907569 1.575 0.115337
## shs:exp1 -0.1919981 0.1955408 -0.982 0.326206
## hsg:exp1 -0.0173433 0.0572279 -0.303 0.761859
## scl:exp1 -0.0664505 0.0433730 -1.532 0.125570
## clg:exp1 -0.0550346 0.0310279 -1.774 0.076172 .
## occ22:exp1 -0.0736239 0.0501108 -1.469 0.141837
## occ23:exp1 -0.0714859 0.0637688 -1.121 0.262336
## occ24:exp1 -0.0723997 0.0747715 -0.968 0.332953
## occ25:exp1 0.0946732 0.0794005 1.192 0.233182
## occ26:exp1 -0.0348928 0.0712136 -0.490 0.624175
## occ27:exp1 -0.2279338 0.0784860 -2.904 0.003699 **
## occ28:exp1 -0.0727459 0.0645883 -1.126 0.260094
## occ29:exp1 0.0274143 0.0669517 0.409 0.682217
## occ210:exp1 0.0075628 0.0581715 0.130 0.896564
## occ211:exp1 0.1014221 0.1005094 1.009 0.312986
## occ212:exp1 -0.0862744 0.0874768 -0.986 0.324057
## occ213:exp1 0.0067149 0.0761825 0.088 0.929768
## occ214:exp1 -0.1369153 0.0974458 -1.405 0.160073
## occ215:exp1 -0.0400425 0.0898931 -0.445 0.656017
## occ216:exp1 -0.0539314 0.0520926 -1.035 0.300580
## occ217:exp1 0.0147277 0.0467903 0.315 0.752958
## occ218:exp1 0.1074099 0.4718440 0.228 0.819937
## occ219:exp1 0.0047165 0.1060745 0.044 0.964536
## occ220:exp1 0.0243156 0.0743274 0.327 0.743575
## occ221:exp1 0.0791776 0.0696947 1.136 0.255985
## occ222:exp1 0.1093246 0.0880828 1.241 0.214607
## ind23:exp1 0.4758891 0.2227484 2.136 0.032693 *
## ind24:exp1 0.0147304 0.1571102 0.094 0.925305
## ind25:exp1 0.1256987 0.1531626 0.821 0.411864
## ind26:exp1 0.1540275 0.1524289 1.010 0.312312
## ind27:exp1 0.1029245 0.1786939 0.576 0.564654
## ind28:exp1 0.2357669 0.1689203 1.396 0.162859
## ind29:exp1 0.1359079 0.1489486 0.912 0.361578
## ind210:exp1 0.1512578 0.1644341 0.920 0.357687
## ind211:exp1 0.3174885 0.1590023 1.997 0.045907 *
## ind212:exp1 0.2591089 0.1510588 1.715 0.086356 .
## ind213:exp1 0.3396094 0.1669241 2.035 0.041954 *
## ind214:exp1 0.1441411 0.1477994 0.975 0.329485
## ind215:exp1 -0.0568181 0.2349853 -0.242 0.808950
## ind216:exp1 0.0847295 0.1550425 0.546 0.584753
## ind217:exp1 0.1728867 0.1513280 1.142 0.253317
## ind218:exp1 0.1565399 0.1494171 1.048 0.294842
## ind219:exp1 0.1516103 0.1620851 0.935 0.349641
## ind220:exp1 0.1326629 0.1566883 0.847 0.397222
## ind221:exp1 0.2190905 0.1555052 1.409 0.158930
## ind222:exp1 0.1145814 0.1523427 0.752 0.452010
## mw:exp1 -0.0279931 0.0296572 -0.944 0.345274
## so:exp1 -0.0099678 0.0266868 -0.374 0.708786
## we:exp1 0.0063077 0.0301417 0.209 0.834248
## shs:exp2 1.9005060 1.4502480 1.310 0.190098
## hsg:exp2 0.1171642 0.5509729 0.213 0.831609
## scl:exp2 0.6217923 0.4629986 1.343 0.179344
## clg:exp2 0.4096746 0.3802171 1.077 0.281321
## occ22:exp2 0.6632173 0.5523220 1.201 0.229895
## occ23:exp2 0.6415456 0.7102783 0.903 0.366448
## occ24:exp2 0.9748422 0.8655351 1.126 0.260099
## occ25:exp2 -0.9778823 0.9737990 -1.004 0.315335
## occ26:exp2 0.1050860 0.8002267 0.131 0.895527
## occ27:exp2 3.1407119 0.9389423 3.345 0.000829 ***
## occ28:exp2 0.6710877 0.7192077 0.933 0.350818
## occ29:exp2 0.0231977 0.7629142 0.030 0.975744
## occ210:exp2 -0.2692292 0.6405270 -0.420 0.674267
## occ211:exp2 -1.0816539 1.0057575 -1.075 0.282221
## occ212:exp2 0.8323737 0.9341245 0.891 0.372933
## occ213:exp2 -0.2209813 0.7728463 -0.286 0.774942
## occ214:exp2 0.7511163 0.9272548 0.810 0.417955
## occ215:exp2 -0.0326858 0.9409116 -0.035 0.972290
## occ216:exp2 0.3635814 0.5509550 0.660 0.509342
## occ217:exp2 -0.2659285 0.4861131 -0.547 0.584369
## occ218:exp2 -2.5608762 5.1700911 -0.495 0.620393
## occ219:exp2 -0.1291756 1.0616901 -0.122 0.903165
## occ220:exp2 -0.3323297 0.7229071 -0.460 0.645743
## occ221:exp2 -0.9099997 0.6854114 -1.328 0.184349
## occ222:exp2 -0.8550536 0.8279414 -1.033 0.301773
## ind23:exp2 -5.9368948 2.4067939 -2.467 0.013670 *
## ind24:exp2 -1.1053411 1.7101982 -0.646 0.518100
## ind25:exp2 -2.0149181 1.6919190 -1.191 0.233748
## ind26:exp2 -2.2277748 1.6816902 -1.325 0.185325
## ind27:exp2 -1.4648099 2.0137888 -0.727 0.467022
## ind28:exp2 -2.9479949 1.8595425 -1.585 0.112955
## ind29:exp2 -1.7796219 1.6471248 -1.080 0.279999
## ind210:exp2 -2.1973300 1.7738638 -1.239 0.215507
## ind211:exp2 -3.8776807 1.7637372 -2.199 0.027956 *
## ind212:exp2 -3.1690425 1.6819362 -1.884 0.059602 .
## ind213:exp2 -3.9651983 1.8130709 -2.187 0.028789 *
## ind214:exp2 -2.0783289 1.6490355 -1.260 0.207610
## ind215:exp2 0.1911692 2.6075396 0.073 0.941559
## ind216:exp2 -1.3265850 1.7185648 -0.772 0.440202
## ind217:exp2 -2.2002873 1.6837183 -1.307 0.191341
## ind218:exp2 -2.2006232 1.6566630 -1.328 0.184125
## ind219:exp2 -1.9308536 1.7876673 -1.080 0.280152
## ind220:exp2 -1.9467267 1.7244008 -1.129 0.258983
## ind221:exp2 -3.1127363 1.7237908 -1.806 0.071019 .
## ind222:exp2 -1.8578340 1.6849542 -1.103 0.270254
## mw:exp2 0.2005611 0.3172911 0.632 0.527348
## so:exp2 0.0544354 0.2815662 0.193 0.846708
## we:exp2 0.0012717 0.3207873 0.004 0.996837
## shs:exp3 -0.6721239 0.4426627 -1.518 0.128987
## hsg:exp3 -0.0179937 0.2083176 -0.086 0.931171
## scl:exp3 -0.1997877 0.1855189 -1.077 0.281572
## clg:exp3 -0.1025230 0.1643648 -0.624 0.532819
## occ22:exp3 -0.2039403 0.2211386 -0.922 0.356455
## occ23:exp3 -0.2369620 0.2870372 -0.826 0.409103
## occ24:exp3 -0.4366958 0.3520168 -1.241 0.214830
## occ25:exp3 0.3885298 0.4118861 0.943 0.345577
## occ26:exp3 0.0484737 0.3293525 0.147 0.882997
## occ27:exp3 -1.3949288 0.4050109 -3.444 0.000578 ***
## occ28:exp3 -0.2053899 0.2895727 -0.709 0.478181
## occ29:exp3 -0.0909660 0.3143348 -0.289 0.772293
## occ210:exp3 0.1854753 0.2575565 0.720 0.471477
## occ211:exp3 0.3931553 0.3817758 1.030 0.303152
## occ212:exp3 -0.2202559 0.3660206 -0.602 0.547363
## occ213:exp3 0.0950356 0.2904370 0.327 0.743519
## occ214:exp3 -0.1443933 0.3341622 -0.432 0.665684
## occ215:exp3 0.1477077 0.3645191 0.405 0.685339
## occ216:exp3 -0.0378548 0.2151288 -0.176 0.860330
## occ217:exp3 0.1510497 0.1878081 0.804 0.421276
## occ218:exp3 1.4084443 1.8852467 0.747 0.455047
## occ219:exp3 0.0923425 0.4042308 0.228 0.819314
## occ220:exp3 0.1806994 0.2652079 0.681 0.495682
## occ221:exp3 0.3779083 0.2553031 1.480 0.138875
## occ222:exp3 0.2855058 0.2984206 0.957 0.338754
## ind23:exp3 2.6665808 0.9807497 2.719 0.006573 **
## ind24:exp3 0.7298431 0.6879811 1.061 0.288811
## ind25:exp3 0.9942250 0.6842435 1.453 0.146280
## ind26:exp3 1.0641428 0.6800948 1.565 0.117718
## ind27:exp3 0.7089089 0.8337963 0.850 0.395245
## ind28:exp3 1.2340948 0.7483474 1.649 0.099193 .
## ind29:exp3 0.8287315 0.6675904 1.241 0.214526
## ind210:exp3 1.0448162 0.7066717 1.479 0.139337
## ind211:exp3 1.6877578 0.7162155 2.356 0.018487 *
## ind212:exp3 1.3734455 0.6835570 2.009 0.044564 *
## ind213:exp3 1.6376669 0.7259301 2.256 0.024117 *
## ind214:exp3 1.0162910 0.6714525 1.514 0.130199
## ind215:exp3 0.1879483 1.0299675 0.182 0.855214
## ind216:exp3 0.6889680 0.6968028 0.989 0.322831
## ind217:exp3 1.0085540 0.6836992 1.475 0.140238
## ind218:exp3 1.0605598 0.6725232 1.577 0.114863
## ind219:exp3 0.8959865 0.7225602 1.240 0.215029
## ind220:exp3 0.9768944 0.6955822 1.404 0.160255
## ind221:exp3 1.4415215 0.6996480 2.060 0.039418 *
## ind222:exp3 0.9687884 0.6828498 1.419 0.156037
## mw:exp3 -0.0625771 0.1241291 -0.504 0.614194
## so:exp3 -0.0115842 0.1084217 -0.107 0.914917
## we:exp3 -0.0124875 0.1251376 -0.100 0.920515
## shs:exp4 0.0777418 0.0475427 1.635 0.102071
## hsg:exp4 0.0004913 0.0265964 0.018 0.985264
## scl:exp4 0.0210760 0.0245289 0.859 0.390256
## clg:exp4 0.0078695 0.0227528 0.346 0.729457
## occ22:exp4 0.0176389 0.0289257 0.610 0.542021
## occ23:exp4 0.0303057 0.0376552 0.805 0.420962
## occ24:exp4 0.0584146 0.0457704 1.276 0.201927
## occ25:exp4 -0.0515181 0.0549489 -0.938 0.348514
## occ26:exp4 -0.0170182 0.0440847 -0.386 0.699488
## occ27:exp4 0.1905353 0.0558757 3.410 0.000655 ***
## occ28:exp4 0.0196522 0.0379084 0.518 0.604195
## occ29:exp4 0.0190014 0.0421099 0.451 0.651841
## occ210:exp4 -0.0333347 0.0338825 -0.984 0.325246
## occ211:exp4 -0.0465914 0.0479018 -0.973 0.330778
## occ212:exp4 0.0110212 0.0470536 0.234 0.814820
## occ213:exp4 -0.0136895 0.0358988 -0.381 0.702970
## occ214:exp4 0.0055582 0.0400331 0.139 0.889581
## occ215:exp4 -0.0327444 0.0462379 -0.708 0.478872
## occ216:exp4 -0.0089706 0.0275729 -0.325 0.744937
## occ217:exp4 -0.0256735 0.0239306 -1.073 0.283400
## occ218:exp4 -0.2121372 0.2204003 -0.963 0.335841
## occ219:exp4 -0.0169398 0.0513428 -0.330 0.741463
## occ220:exp4 -0.0296125 0.0323353 -0.916 0.359819
## occ221:exp4 -0.0524577 0.0317251 -1.654 0.098291 .
## occ222:exp4 -0.0350646 0.0360687 -0.972 0.331018
## ind23:exp4 -0.3851791 0.1329065 -2.898 0.003771 **
## ind24:exp4 -0.1209478 0.0899580 -1.344 0.178852
## ind25:exp4 -0.1441045 0.0897994 -1.605 0.108616
## ind26:exp4 -0.1526110 0.0892689 -1.710 0.087410 .
## ind27:exp4 -0.1001993 0.1119398 -0.895 0.370768
## ind28:exp4 -0.1609664 0.0979780 -1.643 0.100471
## ind29:exp4 -0.1178080 0.0877821 -1.342 0.179642
## ind210:exp4 -0.1482842 0.0918416 -1.615 0.106469
## ind211:exp4 -0.2322961 0.0944506 -2.459 0.013949 *
## ind212:exp4 -0.1872911 0.0899985 -2.081 0.037481 *
## ind213:exp4 -0.2155617 0.0946011 -2.279 0.022731 *
## ind214:exp4 -0.1483524 0.0884992 -1.676 0.093740 .
## ind215:exp4 -0.0532195 0.1313815 -0.405 0.685439
## ind216:exp4 -0.1044336 0.0916252 -1.140 0.254429
## ind217:exp4 -0.1427349 0.0899315 -1.587 0.112543
## ind218:exp4 -0.1546248 0.0885883 -1.745 0.080973 .
## ind219:exp4 -0.1269592 0.0948784 -1.338 0.180918
## ind220:exp4 -0.1468554 0.0911188 -1.612 0.107094
## ind221:exp4 -0.2032619 0.0920972 -2.207 0.027358 *
## ind222:exp4 -0.1480951 0.0897937 -1.649 0.099154 .
## mw:exp4 0.0062439 0.0158699 0.393 0.694007
## so:exp4 0.0003145 0.0136275 0.023 0.981591
## we:exp4 0.0017685 0.0159602 0.111 0.911776
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4708 on 4904 degrees of freedom
## Multiple R-squared: 0.3511, Adjusted R-squared: 0.3187
## F-statistic: 10.83 on 245 and 4904 DF, p-value: < 2.2e-16
cat( "Number of regressors in the flexible model:",length(regflex$coef))
## Number of regressors in the flexible model: 246
Python code
= 'lwage ~ sex + shs+hsg+scl+clg+occ2+ind2+mw+so+we + (exp1+exp2+exp3+exp4)*(shs+hsg+scl+clg+occ2+ind2+mw+so+we)'
flex = smf.ols(flex , data=data)
flex_results_0 = smf.ols(flex , data=data).fit()
flex_results print(flex_results.summary()) # estimated coefficients
## OLS Regression Results
## ==============================================================================
## Dep. Variable: lwage R-squared: 0.351
## Model: OLS Adj. R-squared: 0.319
## Method: Least Squares F-statistic: 10.83
## Date: Wed, 24 Nov 2021 Prob (F-statistic): 2.69e-305
## Time: 12:19:43 Log-Likelihood: -3301.9
## No. Observations: 5150 AIC: 7096.
## Df Residuals: 4904 BIC: 8706.
## Df Model: 245
## Covariance Type: nonrobust
## ===================================================================================
## coef std err t P>|t| [0.025 0.975]
## -----------------------------------------------------------------------------------
## Intercept 3.2797 0.284 11.540 0.000 2.723 3.837
## occ2[T.10] 0.0210 0.156 0.134 0.893 -0.286 0.328
## occ2[T.11] -0.6424 0.309 -2.078 0.038 -1.248 -0.036
## occ2[T.12] -0.0675 0.252 -0.268 0.789 -0.562 0.427
## occ2[T.13] -0.2330 0.232 -1.006 0.314 -0.687 0.221
## occ2[T.14] 0.2562 0.323 0.794 0.427 -0.376 0.889
## occ2[T.15] -0.1939 0.260 -0.747 0.455 -0.703 0.315
## occ2[T.16] -0.0551 0.147 -0.375 0.708 -0.343 0.233
## occ2[T.17] -0.4156 0.136 -3.053 0.002 -0.682 -0.149
## occ2[T.18] -0.4822 1.044 -0.462 0.644 -2.530 1.565
## occ2[T.19] -0.2579 0.333 -0.776 0.438 -0.910 0.394
## occ2[T.2] 0.1613 0.130 1.244 0.214 -0.093 0.416
## occ2[T.20] -0.3010 0.234 -1.286 0.199 -0.760 0.158
## occ2[T.21] -0.4272 0.221 -1.936 0.053 -0.860 0.005
## occ2[T.22] -0.8695 0.298 -2.922 0.003 -1.453 -0.286
## occ2[T.3] 0.2102 0.169 1.246 0.213 -0.121 0.541
## occ2[T.4] 0.0709 0.184 0.386 0.700 -0.289 0.431
## occ2[T.5] -0.3960 0.189 -2.100 0.036 -0.766 -0.026
## occ2[T.6] -0.2311 0.187 -1.236 0.217 -0.598 0.135
## occ2[T.7] 0.3147 0.194 1.621 0.105 -0.066 0.695
## occ2[T.8] -0.1875 0.169 -1.108 0.268 -0.519 0.144
## occ2[T.9] -0.3390 0.167 -2.027 0.043 -0.667 -0.011
## ind2[T.11] -0.4047 0.314 -1.288 0.198 -1.021 0.211
## ind2[T.12] -0.1570 0.279 -0.562 0.574 -0.705 0.391
## ind2[T.13] -0.4377 0.362 -1.210 0.226 -1.147 0.271
## ind2[T.14] -0.0054 0.270 -0.020 0.984 -0.535 0.524
## ind2[T.15] 0.2004 0.501 0.400 0.689 -0.781 1.182
## ind2[T.16] 0.0102 0.300 0.034 0.973 -0.577 0.598
## ind2[T.17] -0.2396 0.285 -0.841 0.401 -0.798 0.319
## ind2[T.18] -0.1808 0.278 -0.649 0.516 -0.727 0.365
## ind2[T.19] -0.3007 0.327 -0.921 0.357 -0.941 0.339
## ind2[T.2] 0.5806 0.481 1.207 0.227 -0.362 1.523
## ind2[T.20] -0.3293 0.313 -1.052 0.293 -0.943 0.284
## ind2[T.21] -0.1781 0.304 -0.586 0.558 -0.773 0.417
## ind2[T.22] 0.1765 0.296 0.596 0.551 -0.404 0.757
## ind2[T.3] -0.6668 0.561 -1.189 0.234 -1.766 0.433
## ind2[T.4] 0.4858 0.350 1.389 0.165 -0.200 1.171
## ind2[T.5] 0.0512 0.298 0.172 0.863 -0.532 0.635
## ind2[T.6] -0.0416 0.299 -0.139 0.889 -0.628 0.545
## ind2[T.7] 0.0758 0.387 0.196 0.845 -0.683 0.835
## ind2[T.8] -0.1490 0.337 -0.441 0.659 -0.811 0.513
## ind2[T.9] -0.2219 0.275 -0.808 0.419 -0.760 0.316
## sex -0.0696 0.015 -4.570 0.000 -0.099 -0.040
## shs -0.1233 0.907 -0.136 0.892 -1.901 1.654
## hsg -0.5289 0.198 -2.675 0.008 -0.917 -0.141
## scl -0.2921 0.126 -2.318 0.021 -0.539 -0.045
## clg -0.0412 0.070 -0.585 0.559 -0.179 0.097
## mw 0.1107 0.081 1.359 0.174 -0.049 0.270
## so 0.0224 0.074 0.301 0.763 -0.123 0.168
## we -0.0216 0.084 -0.256 0.798 -0.187 0.143
## exp1 0.0835 0.094 0.884 0.377 -0.102 0.269
## exp1:occ2[T.10] 0.0076 0.058 0.130 0.897 -0.106 0.122
## exp1:occ2[T.11] 0.1014 0.101 1.009 0.313 -0.096 0.298
## exp1:occ2[T.12] -0.0863 0.087 -0.986 0.324 -0.258 0.085
## exp1:occ2[T.13] 0.0067 0.076 0.088 0.930 -0.143 0.156
## exp1:occ2[T.14] -0.1369 0.097 -1.405 0.160 -0.328 0.054
## exp1:occ2[T.15] -0.0400 0.090 -0.445 0.656 -0.216 0.136
## exp1:occ2[T.16] -0.0539 0.052 -1.035 0.301 -0.156 0.048
## exp1:occ2[T.17] 0.0147 0.047 0.315 0.753 -0.077 0.106
## exp1:occ2[T.18] 0.1074 0.472 0.228 0.820 -0.818 1.032
## exp1:occ2[T.19] 0.0047 0.106 0.044 0.965 -0.203 0.213
## exp1:occ2[T.2] -0.0736 0.050 -1.469 0.142 -0.172 0.025
## exp1:occ2[T.20] 0.0243 0.074 0.327 0.744 -0.121 0.170
## exp1:occ2[T.21] 0.0792 0.070 1.136 0.256 -0.057 0.216
## exp1:occ2[T.22] 0.1093 0.088 1.241 0.215 -0.063 0.282
## exp1:occ2[T.3] -0.0715 0.064 -1.121 0.262 -0.197 0.054
## exp1:occ2[T.4] -0.0724 0.075 -0.968 0.333 -0.219 0.074
## exp1:occ2[T.5] 0.0947 0.079 1.192 0.233 -0.061 0.250
## exp1:occ2[T.6] -0.0349 0.071 -0.490 0.624 -0.175 0.105
## exp1:occ2[T.7] -0.2279 0.078 -2.904 0.004 -0.382 -0.074
## exp1:occ2[T.8] -0.0727 0.065 -1.126 0.260 -0.199 0.054
## exp1:occ2[T.9] 0.0274 0.067 0.409 0.682 -0.104 0.159
## exp1:ind2[T.11] 0.1662 0.106 1.570 0.116 -0.041 0.374
## exp1:ind2[T.12] 0.1079 0.093 1.156 0.248 -0.075 0.291
## exp1:ind2[T.13] 0.1884 0.117 1.605 0.109 -0.042 0.418
## exp1:ind2[T.14] -0.0071 0.089 -0.080 0.936 -0.182 0.168
## exp1:ind2[T.15] -0.2081 0.203 -1.024 0.306 -0.607 0.190
## exp1:ind2[T.16] -0.0665 0.099 -0.671 0.502 -0.261 0.128
## exp1:ind2[T.17] 0.0216 0.095 0.229 0.819 -0.164 0.207
## exp1:ind2[T.18] 0.0053 0.091 0.058 0.954 -0.172 0.183
## exp1:ind2[T.19] 0.0004 0.111 0.003 0.997 -0.217 0.217
## exp1:ind2[T.2] -0.1513 0.164 -0.920 0.358 -0.474 0.171
## exp1:ind2[T.20] -0.0186 0.102 -0.182 0.855 -0.219 0.182
## exp1:ind2[T.21] 0.0678 0.101 0.673 0.501 -0.130 0.265
## exp1:ind2[T.22] -0.0367 0.096 -0.380 0.704 -0.226 0.152
## exp1:ind2[T.3] 0.3246 0.188 1.722 0.085 -0.045 0.694
## exp1:ind2[T.4] -0.1365 0.114 -1.194 0.233 -0.361 0.088
## exp1:ind2[T.5] -0.0256 0.096 -0.265 0.791 -0.215 0.163
## exp1:ind2[T.6] 0.0028 0.096 0.029 0.977 -0.185 0.191
## exp1:ind2[T.7] -0.0483 0.133 -0.364 0.716 -0.309 0.212
## exp1:ind2[T.8] 0.0845 0.118 0.715 0.475 -0.147 0.316
## exp1:ind2[T.9] -0.0153 0.087 -0.176 0.860 -0.186 0.156
## exp2 -0.5610 0.949 -0.591 0.554 -2.421 1.299
## exp2:occ2[T.10] -0.2692 0.641 -0.420 0.674 -1.525 0.986
## exp2:occ2[T.11] -1.0817 1.006 -1.075 0.282 -3.053 0.890
## exp2:occ2[T.12] 0.8324 0.934 0.891 0.373 -0.999 2.664
## exp2:occ2[T.13] -0.2210 0.773 -0.286 0.775 -1.736 1.294
## exp2:occ2[T.14] 0.7511 0.927 0.810 0.418 -1.067 2.569
## exp2:occ2[T.15] -0.0327 0.941 -0.035 0.972 -1.877 1.812
## exp2:occ2[T.16] 0.3636 0.551 0.660 0.509 -0.717 1.444
## exp2:occ2[T.17] -0.2659 0.486 -0.547 0.584 -1.219 0.687
## exp2:occ2[T.18] -2.5609 5.170 -0.495 0.620 -12.697 7.575
## exp2:occ2[T.19] -0.1292 1.062 -0.122 0.903 -2.211 1.952
## exp2:occ2[T.2] 0.6632 0.552 1.201 0.230 -0.420 1.746
## exp2:occ2[T.20] -0.3323 0.723 -0.460 0.646 -1.750 1.085
## exp2:occ2[T.21] -0.9100 0.685 -1.328 0.184 -2.254 0.434
## exp2:occ2[T.22] -0.8551 0.828 -1.033 0.302 -2.478 0.768
## exp2:occ2[T.3] 0.6415 0.710 0.903 0.366 -0.751 2.034
## exp2:occ2[T.4] 0.9748 0.866 1.126 0.260 -0.722 2.672
## exp2:occ2[T.5] -0.9779 0.974 -1.004 0.315 -2.887 0.931
## exp2:occ2[T.6] 0.1051 0.800 0.131 0.896 -1.464 1.674
## exp2:occ2[T.7] 3.1407 0.939 3.345 0.001 1.300 4.981
## exp2:occ2[T.8] 0.6711 0.719 0.933 0.351 -0.739 2.081
## exp2:occ2[T.9] 0.0232 0.763 0.030 0.976 -1.472 1.519
## exp2:ind2[T.11] -1.6804 1.080 -1.555 0.120 -3.798 0.438
## exp2:ind2[T.12] -0.9717 0.935 -1.039 0.299 -2.804 0.861
## exp2:ind2[T.13] -1.7679 1.156 -1.529 0.126 -4.035 0.499
## exp2:ind2[T.14] 0.1190 0.888 0.134 0.893 -1.622 1.860
## exp2:ind2[T.15] 2.3885 2.202 1.085 0.278 -1.928 6.705
## exp2:ind2[T.16] 0.8707 0.990 0.879 0.379 -1.070 2.812
## exp2:ind2[T.17] -0.0030 0.948 -0.003 0.998 -1.862 1.856
## exp2:ind2[T.18] -0.0033 0.889 -0.004 0.997 -1.746 1.740
## exp2:ind2[T.19] 0.2665 1.119 0.238 0.812 -1.927 2.460
## exp2:ind2[T.2] 2.1973 1.774 1.239 0.216 -1.280 5.675
## exp2:ind2[T.20] 0.2506 1.010 0.248 0.804 -1.729 2.231
## exp2:ind2[T.21] -0.9154 1.015 -0.902 0.367 -2.904 1.074
## exp2:ind2[T.22] 0.3395 0.949 0.358 0.721 -1.522 2.201
## exp2:ind2[T.3] -3.7396 1.959 -1.908 0.056 -7.581 0.102
## exp2:ind2[T.4] 1.0920 1.143 0.955 0.339 -1.149 3.333
## exp2:ind2[T.5] 0.1824 0.939 0.194 0.846 -1.658 2.023
## exp2:ind2[T.6] -0.0304 0.930 -0.033 0.974 -1.853 1.792
## exp2:ind2[T.7] 0.7325 1.440 0.509 0.611 -2.090 3.555
## exp2:ind2[T.8] -0.7507 1.203 -0.624 0.533 -3.108 1.607
## exp2:ind2[T.9] 0.4177 0.839 0.498 0.619 -1.227 2.062
## exp3 0.1293 0.357 0.363 0.717 -0.570 0.828
## exp3:occ2[T.10] 0.1855 0.258 0.720 0.471 -0.319 0.690
## exp3:occ2[T.11] 0.3932 0.382 1.030 0.303 -0.355 1.142
## exp3:occ2[T.12] -0.2203 0.366 -0.602 0.547 -0.938 0.497
## exp3:occ2[T.13] 0.0950 0.290 0.327 0.744 -0.474 0.664
## exp3:occ2[T.14] -0.1444 0.334 -0.432 0.666 -0.800 0.511
## exp3:occ2[T.15] 0.1477 0.365 0.405 0.685 -0.567 0.862
## exp3:occ2[T.16] -0.0379 0.215 -0.176 0.860 -0.460 0.384
## exp3:occ2[T.17] 0.1510 0.188 0.804 0.421 -0.217 0.519
## exp3:occ2[T.18] 1.4084 1.885 0.747 0.455 -2.287 5.104
## exp3:occ2[T.19] 0.0923 0.404 0.228 0.819 -0.700 0.885
## exp3:occ2[T.2] -0.2039 0.221 -0.922 0.356 -0.637 0.230
## exp3:occ2[T.20] 0.1807 0.265 0.681 0.496 -0.339 0.701
## exp3:occ2[T.21] 0.3779 0.255 1.480 0.139 -0.123 0.878
## exp3:occ2[T.22] 0.2855 0.298 0.957 0.339 -0.300 0.871
## exp3:occ2[T.3] -0.2370 0.287 -0.826 0.409 -0.800 0.326
## exp3:occ2[T.4] -0.4367 0.352 -1.241 0.215 -1.127 0.253
## exp3:occ2[T.5] 0.3885 0.412 0.943 0.346 -0.419 1.196
## exp3:occ2[T.6] 0.0485 0.329 0.147 0.883 -0.597 0.694
## exp3:occ2[T.7] -1.3949 0.405 -3.444 0.001 -2.189 -0.601
## exp3:occ2[T.8] -0.2054 0.290 -0.709 0.478 -0.773 0.362
## exp3:occ2[T.9] -0.0910 0.314 -0.289 0.772 -0.707 0.525
## exp3:ind2[T.11] 0.6429 0.411 1.564 0.118 -0.163 1.449
## exp3:ind2[T.12] 0.3286 0.347 0.947 0.344 -0.351 1.009
## exp3:ind2[T.13] 0.5929 0.426 1.392 0.164 -0.242 1.428
## exp3:ind2[T.14] -0.0285 0.328 -0.087 0.931 -0.672 0.615
## exp3:ind2[T.15] -0.8569 0.844 -1.016 0.310 -2.511 0.797
## exp3:ind2[T.16] -0.3558 0.368 -0.968 0.333 -1.077 0.365
## exp3:ind2[T.17] -0.0363 0.352 -0.103 0.918 -0.727 0.654
## exp3:ind2[T.18] 0.0157 0.325 0.048 0.961 -0.621 0.653
## exp3:ind2[T.19] -0.1488 0.421 -0.354 0.724 -0.974 0.676
## exp3:ind2[T.2] -1.0448 0.707 -1.479 0.139 -2.430 0.341
## exp3:ind2[T.20] -0.0679 0.370 -0.183 0.855 -0.794 0.658
## exp3:ind2[T.21] 0.3967 0.381 1.041 0.298 -0.351 1.144
## exp3:ind2[T.22] -0.0760 0.349 -0.218 0.827 -0.760 0.608
## exp3:ind2[T.3] 1.6218 0.785 2.067 0.039 0.084 3.160
## exp3:ind2[T.4] -0.3150 0.429 -0.735 0.463 -1.155 0.526
## exp3:ind2[T.5] -0.0506 0.340 -0.149 0.882 -0.717 0.616
## exp3:ind2[T.6] 0.0193 0.336 0.058 0.954 -0.639 0.678
## exp3:ind2[T.7] -0.3359 0.586 -0.573 0.567 -1.485 0.813
## exp3:ind2[T.8] 0.1893 0.452 0.419 0.675 -0.696 1.075
## exp3:ind2[T.9] -0.2161 0.300 -0.721 0.471 -0.804 0.372
## exp4 -0.0053 0.045 -0.120 0.904 -0.093 0.082
## exp4:occ2[T.10] -0.0333 0.034 -0.984 0.325 -0.100 0.033
## exp4:occ2[T.11] -0.0466 0.048 -0.973 0.331 -0.141 0.047
## exp4:occ2[T.12] 0.0110 0.047 0.234 0.815 -0.081 0.103
## exp4:occ2[T.13] -0.0137 0.036 -0.381 0.703 -0.084 0.057
## exp4:occ2[T.14] 0.0056 0.040 0.139 0.890 -0.073 0.084
## exp4:occ2[T.15] -0.0327 0.046 -0.708 0.479 -0.123 0.058
## exp4:occ2[T.16] -0.0090 0.028 -0.325 0.745 -0.063 0.045
## exp4:occ2[T.17] -0.0257 0.024 -1.073 0.283 -0.073 0.021
## exp4:occ2[T.18] -0.2121 0.220 -0.963 0.336 -0.644 0.220
## exp4:occ2[T.19] -0.0169 0.051 -0.330 0.741 -0.118 0.084
## exp4:occ2[T.2] 0.0176 0.029 0.610 0.542 -0.039 0.074
## exp4:occ2[T.20] -0.0296 0.032 -0.916 0.360 -0.093 0.034
## exp4:occ2[T.21] -0.0525 0.032 -1.654 0.098 -0.115 0.010
## exp4:occ2[T.22] -0.0351 0.036 -0.972 0.331 -0.106 0.036
## exp4:occ2[T.3] 0.0303 0.038 0.805 0.421 -0.044 0.104
## exp4:occ2[T.4] 0.0584 0.046 1.276 0.202 -0.031 0.148
## exp4:occ2[T.5] -0.0515 0.055 -0.938 0.349 -0.159 0.056
## exp4:occ2[T.6] -0.0170 0.044 -0.386 0.699 -0.103 0.069
## exp4:occ2[T.7] 0.1905 0.056 3.410 0.001 0.081 0.300
## exp4:occ2[T.8] 0.0197 0.038 0.518 0.604 -0.055 0.094
## exp4:occ2[T.9] 0.0190 0.042 0.451 0.652 -0.064 0.102
## exp4:ind2[T.11] -0.0840 0.052 -1.619 0.106 -0.186 0.018
## exp4:ind2[T.12] -0.0390 0.042 -0.918 0.359 -0.122 0.044
## exp4:ind2[T.13] -0.0673 0.052 -1.297 0.195 -0.169 0.034
## exp4:ind2[T.14] -6.822e-05 0.040 -0.002 0.999 -0.079 0.078
## exp4:ind2[T.15] 0.0951 0.104 0.911 0.362 -0.110 0.300
## exp4:ind2[T.16] 0.0439 0.045 0.970 0.332 -0.045 0.132
## exp4:ind2[T.17] 0.0055 0.043 0.129 0.898 -0.079 0.090
## exp4:ind2[T.18] -0.0063 0.039 -0.161 0.872 -0.084 0.071
## exp4:ind2[T.19] 0.0213 0.052 0.407 0.684 -0.081 0.124
## exp4:ind2[T.2] 0.1483 0.092 1.615 0.106 -0.032 0.328
## exp4:ind2[T.20] 0.0014 0.045 0.032 0.975 -0.087 0.089
## exp4:ind2[T.21] -0.0550 0.047 -1.160 0.246 -0.148 0.038
## exp4:ind2[T.22] 0.0002 0.042 0.004 0.996 -0.083 0.083
## exp4:ind2[T.3] -0.2369 0.107 -2.220 0.026 -0.446 -0.028
## exp4:ind2[T.4] 0.0273 0.053 0.511 0.609 -0.077 0.132
## exp4:ind2[T.5] 0.0042 0.041 0.103 0.918 -0.076 0.084
## exp4:ind2[T.6] -0.0043 0.040 -0.108 0.914 -0.083 0.075
## exp4:ind2[T.7] 0.0481 0.078 0.613 0.540 -0.106 0.202
## exp4:ind2[T.8] -0.0127 0.056 -0.226 0.821 -0.123 0.097
## exp4:ind2[T.9] 0.0305 0.035 0.861 0.389 -0.039 0.100
## exp1:shs -0.1920 0.196 -0.982 0.326 -0.575 0.191
## exp1:hsg -0.0173 0.057 -0.303 0.762 -0.130 0.095
## exp1:scl -0.0665 0.043 -1.532 0.126 -0.151 0.019
## exp1:clg -0.0550 0.031 -1.774 0.076 -0.116 0.006
## exp1:mw -0.0280 0.030 -0.944 0.345 -0.086 0.030
## exp1:so -0.0100 0.027 -0.374 0.709 -0.062 0.042
## exp1:we 0.0063 0.030 0.209 0.834 -0.053 0.065
## exp2:shs 1.9005 1.450 1.310 0.190 -0.943 4.744
## exp2:hsg 0.1172 0.551 0.213 0.832 -0.963 1.197
## exp2:scl 0.6218 0.463 1.343 0.179 -0.286 1.529
## exp2:clg 0.4097 0.380 1.077 0.281 -0.336 1.155
## exp2:mw 0.2006 0.317 0.632 0.527 -0.421 0.823
## exp2:so 0.0544 0.282 0.193 0.847 -0.498 0.606
## exp2:we 0.0013 0.321 0.004 0.997 -0.628 0.630
## exp3:shs -0.6721 0.443 -1.518 0.129 -1.540 0.196
## exp3:hsg -0.0180 0.208 -0.086 0.931 -0.426 0.390
## exp3:scl -0.1998 0.186 -1.077 0.282 -0.563 0.164
## exp3:clg -0.1025 0.164 -0.624 0.533 -0.425 0.220
## exp3:mw -0.0626 0.124 -0.504 0.614 -0.306 0.181
## exp3:so -0.0116 0.108 -0.107 0.915 -0.224 0.201
## exp3:we -0.0125 0.125 -0.100 0.921 -0.258 0.233
## exp4:shs 0.0777 0.048 1.635 0.102 -0.015 0.171
## exp4:hsg 0.0005 0.027 0.018 0.985 -0.052 0.053
## exp4:scl 0.0211 0.025 0.859 0.390 -0.027 0.069
## exp4:clg 0.0079 0.023 0.346 0.729 -0.037 0.052
## exp4:mw 0.0062 0.016 0.393 0.694 -0.025 0.037
## exp4:so 0.0003 0.014 0.023 0.982 -0.026 0.027
## exp4:we 0.0018 0.016 0.111 0.912 -0.030 0.033
## ==============================================================================
## Omnibus: 395.012 Durbin-Watson: 1.898
## Prob(Omnibus): 0.000 Jarque-Bera (JB): 1529.250
## Skew: 0.303 Prob(JB): 0.00
## Kurtosis: 5.600 Cond. No. 6.87e+04
## ==============================================================================
##
## Notes:
## [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
## [2] The condition number is large, 6.87e+04. This might indicate that there are
## strong multicollinearity or other numerical problems.
print( "Number of regressors in the basic model:",len(flex_results.params), '\n')
## Number of regressors in the basic model: 246
Number of regressors in the flexible model:246.
2.3.3 Lasso Model:
First, we import the essential libraries.
R code
library(hdm)
Python code
from sklearn.linear_model import LassoCV
from sklearn import linear_model
from sklearn.preprocessing import PolynomialFeatures
from sklearn.metrics import mean_squared_error
Then, we estimate the lasso model.
R code
<- lwage ~ sex + shs+hsg+scl+clg+occ2+ind2+mw+so+we + (exp1+exp2+exp3+exp4)*(shs+hsg+scl+clg+occ2+ind2+mw+so+we)
flex <- rlasso(flex, data=data)
lassoreg<- summary(lassoreg) sumlasso
##
## Call:
## rlasso.formula(formula = flex, data = data)
##
## Post-Lasso Estimation: TRUE
##
## Total number of variables: 245
## Number of selected variables: 24
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.03159 -0.29132 -0.01137 0.28472 3.63651
##
## Estimate
## (Intercept) 3.137
## sex 0.000
## shs -0.480
## hsg -0.404
## scl -0.306
## clg 0.000
## occ22 0.041
## occ23 0.109
## occ24 0.000
## occ25 0.000
## occ26 -0.327
## occ27 0.000
## occ28 -0.257
## occ29 0.000
## occ210 0.038
## occ211 -0.398
## occ212 0.000
## occ213 -0.131
## occ214 -0.189
## occ215 -0.387
## occ216 0.000
## occ217 -0.280
## occ218 0.000
## occ219 0.000
## occ220 0.000
## occ221 0.000
## occ222 -0.209
## ind23 0.000
## ind24 0.000
## ind25 0.000
## ind26 0.000
## ind27 0.000
## ind28 0.000
## ind29 -0.169
## ind210 0.000
## ind211 0.000
## ind212 0.193
## ind213 0.000
## ind214 0.076
## ind215 0.000
## ind216 0.000
## ind217 0.000
## ind218 0.000
## ind219 0.000
## ind220 -0.225
## ind221 0.000
## ind222 0.140
## mw 0.000
## so 0.000
## we 0.000
## exp1 0.009
## exp2 0.000
## exp3 0.000
## exp4 0.000
## shs:exp1 0.000
## hsg:exp1 0.000
## scl:exp1 0.000
## clg:exp1 0.000
## occ22:exp1 0.000
## occ23:exp1 0.000
## occ24:exp1 0.000
## occ25:exp1 0.000
## occ26:exp1 0.000
## occ27:exp1 0.000
## occ28:exp1 0.000
## occ29:exp1 0.000
## occ210:exp1 0.003
## occ211:exp1 0.000
## occ212:exp1 0.000
## occ213:exp1 -0.011
## occ214:exp1 -0.009
## occ215:exp1 0.000
## occ216:exp1 0.000
## occ217:exp1 0.000
## occ218:exp1 0.000
## occ219:exp1 0.000
## occ220:exp1 0.000
## occ221:exp1 0.000
## occ222:exp1 0.000
## ind23:exp1 0.000
## ind24:exp1 0.000
## ind25:exp1 0.000
## ind26:exp1 0.000
## ind27:exp1 0.000
## ind28:exp1 0.000
## ind29:exp1 0.000
## ind210:exp1 0.000
## ind211:exp1 0.000
## ind212:exp1 0.000
## ind213:exp1 0.000
## ind214:exp1 0.004
## ind215:exp1 0.000
## ind216:exp1 0.000
## ind217:exp1 0.000
## ind218:exp1 0.000
## ind219:exp1 0.000
## ind220:exp1 0.000
## ind221:exp1 0.000
## ind222:exp1 0.000
## mw:exp1 0.000
## so:exp1 0.000
## we:exp1 0.000
## shs:exp2 0.000
## hsg:exp2 0.000
## scl:exp2 0.000
## clg:exp2 0.000
## occ22:exp2 0.000
## occ23:exp2 0.000
## occ24:exp2 0.000
## occ25:exp2 0.000
## occ26:exp2 0.000
## occ27:exp2 0.000
## occ28:exp2 0.000
## occ29:exp2 0.000
## occ210:exp2 0.000
## occ211:exp2 0.000
## occ212:exp2 0.000
## occ213:exp2 0.000
## occ214:exp2 0.000
## occ215:exp2 0.000
## occ216:exp2 0.000
## occ217:exp2 0.000
## occ218:exp2 0.000
## occ219:exp2 0.000
## occ220:exp2 0.000
## occ221:exp2 0.000
## occ222:exp2 0.000
## ind23:exp2 0.000
## ind24:exp2 0.000
## ind25:exp2 0.000
## ind26:exp2 0.000
## ind27:exp2 0.000
## ind28:exp2 0.000
## ind29:exp2 0.000
## ind210:exp2 0.000
## ind211:exp2 0.000
## ind212:exp2 0.000
## ind213:exp2 0.000
## ind214:exp2 0.000
## ind215:exp2 0.000
## ind216:exp2 0.000
## ind217:exp2 0.000
## ind218:exp2 0.000
## ind219:exp2 0.000
## ind220:exp2 0.000
## ind221:exp2 0.000
## ind222:exp2 0.000
## mw:exp2 0.000
## so:exp2 0.000
## we:exp2 0.000
## shs:exp3 0.000
## hsg:exp3 0.000
## scl:exp3 0.000
## clg:exp3 0.000
## occ22:exp3 0.000
## occ23:exp3 0.000
## occ24:exp3 0.000
## occ25:exp3 0.000
## occ26:exp3 0.000
## occ27:exp3 0.000
## occ28:exp3 0.000
## occ29:exp3 0.000
## occ210:exp3 0.000
## occ211:exp3 0.000
## occ212:exp3 0.000
## occ213:exp3 0.000
## occ214:exp3 0.000
## occ215:exp3 0.000
## occ216:exp3 0.000
## occ217:exp3 0.000
## occ218:exp3 0.000
## occ219:exp3 0.000
## occ220:exp3 0.000
## occ221:exp3 0.000
## occ222:exp3 0.000
## ind23:exp3 0.000
## ind24:exp3 0.000
## ind25:exp3 0.000
## ind26:exp3 0.000
## ind27:exp3 0.000
## ind28:exp3 0.000
## ind29:exp3 0.000
## ind210:exp3 0.000
## ind211:exp3 0.000
## ind212:exp3 0.000
## ind213:exp3 0.000
## ind214:exp3 0.000
## ind215:exp3 0.000
## ind216:exp3 0.000
## ind217:exp3 0.000
## ind218:exp3 0.000
## ind219:exp3 0.000
## ind220:exp3 0.000
## ind221:exp3 0.000
## ind222:exp3 0.000
## mw:exp3 0.000
## so:exp3 0.000
## we:exp3 0.000
## shs:exp4 0.000
## hsg:exp4 0.000
## scl:exp4 0.000
## clg:exp4 0.000
## occ22:exp4 0.000
## occ23:exp4 0.000
## occ24:exp4 0.000
## occ25:exp4 0.000
## occ26:exp4 0.000
## occ27:exp4 0.000
## occ28:exp4 0.000
## occ29:exp4 0.000
## occ210:exp4 0.000
## occ211:exp4 0.000
## occ212:exp4 0.000
## occ213:exp4 0.000
## occ214:exp4 0.000
## occ215:exp4 0.000
## occ216:exp4 0.000
## occ217:exp4 0.000
## occ218:exp4 0.000
## occ219:exp4 0.000
## occ220:exp4 0.000
## occ221:exp4 0.000
## occ222:exp4 0.000
## ind23:exp4 0.000
## ind24:exp4 0.000
## ind25:exp4 0.000
## ind26:exp4 0.000
## ind27:exp4 0.000
## ind28:exp4 0.000
## ind29:exp4 0.000
## ind210:exp4 0.000
## ind211:exp4 0.000
## ind212:exp4 0.000
## ind213:exp4 0.000
## ind214:exp4 0.000
## ind215:exp4 0.000
## ind216:exp4 0.000
## ind217:exp4 0.000
## ind218:exp4 0.000
## ind219:exp4 0.000
## ind220:exp4 0.000
## ind221:exp4 0.000
## ind222:exp4 0.000
## mw:exp4 0.000
## so:exp4 0.000
## we:exp4 0.000
##
## Residual standard error: 0.4847
## Multiple R-squared: 0.2779
## Adjusted R-squared: 0.2745
## Joint significance test:
## the sup score statistic for joint significance test is 89.09 with a p-value of 0.008
Python code
# Get exogenous variables from flexible model
= flex_results_0.exog
X
X.shape
# Set endogenous variable
## (5150, 246)
= data["lwage"]
lwage lwage.shape
## (5150,)
=0.1
alpha# Set penalty value = 0.1
# reg = linear_model.Lasso(alpha=0.1/np.log(len(lwage)))
= linear_model.Lasso(alpha = alpha)
reg
# LASSO regression for flexible model
reg.fit(X, lwage)
## Lasso(alpha=0.1)
= reg.fit(X, lwage).predict( X )
lwage_lasso_fitted
# coefficients
# reg.coef_
print('Lasso Regression: R^2 score', reg.score(X, lwage))
## Lasso Regression: R^2 score 0.16047849625520638
Now, we can evaluate the performance of both models based on the (adjusted) \(R^2_{sample}\) and the (adjusted) \(MSE_{sample}\):
- R-Squared \((R^2)\)
R code
# Summary from basic and flexible model.
<- summary(regbasic)
sumbasic <- summary(regflex)
sumflex
# R-squared from basic, flexible and lasso models
.1 <- sumbasic$r.squared
R2<- sumbasic$adj.r.squared
R2.adj1
.2 <- sumflex$r.squared
R2<- sumflex$adj.r.squared
R2.adj2
<- sumlasso$r.squared
R2.L <- sumlasso$adj.r.squared R2.adjL
- R-squared for the basic model: 0.3100465.
- Adjusted R-squared for the basic model: 0.3032809.
- R-squared for the flexible model: 0.3511099.
- Adjusted R-squared for the flexible model: 0.3186919.
- R-squared for the lasso with flexible model: 0.2778653.
- Adjusted R-squared for the flexible model: 0.2744836.
Python
# Assess the predictive performance
= basic_results.rsquared
R2_1 print("R-squared for the basic model: ", R2_1, "\n")
## R-squared for the basic model: 0.31004650692219504
= basic_results.rsquared_adj
R2_adj1 print("adjusted R-squared for the basic model: ", R2_adj1, "\n")
## adjusted R-squared for the basic model: 0.3032809304064292
= flex_results.rsquared
R2_2 print("R-squared for the basic model: ", R2_2, "\n")
## R-squared for the basic model: 0.3511098950617233
= flex_results.rsquared_adj
R2_adj2 print("adjusted R-squared for the basic model: ", R2_adj2, "\n")
## adjusted R-squared for the basic model: 0.31869185352218865
= reg.score(flex_results_0.exog, lwage)
R2_L print("R-squared for LASSO: ", R2_L, "\n")
## R-squared for LASSO: 0.16047849625520638
= 1 - (1-R2_L)*(len(lwage)-1)/(len(lwage)-X.shape[1]-1)
R2_adjL print("adjusted R-squared for LASSO: ", R2_adjL, "\n")
## adjusted R-squared for LASSO: 0.11835687889415825
- Mean Squared Error \(MSE\)
R code
<- mean(sumbasic$res^2)
MSE1 <- sumbasic$df[1] # number of regressors
p1 <- (n/(n-p1))*MSE1
MSE.adj1
<-mean(sumflex$res^2)
MSE2 <- sumflex$df[1]
p2 <- (n/(n-p2))*MSE2
MSE.adj2
<-mean(sumlasso$res^2)
MSEL <- length(sumlasso$coef)
pL <- (n/(n-pL))*MSEL MSE.adjL
- MSE for the basic model: 0.2244251
- Adjusted MSE for the basic model: 0.2266697
- MSE for the flexible model: 0.2110681
- Adjusted MSE for the flexible model: 0.221656
- MSE for the lasso flexible model: 0.2348928
- Adjusted MSE for the lasso flexible model: 0.2466758
Python code
# calculating the MSE
= np.mean(basic_results.resid**2)
MSE1 print("MSE for the basic model: ", MSE1, "\n")
## MSE for the basic model: 0.22442505581164396
= len(basic_results.params) # number of regressors
p1 = len(lwage)
n = (n/(n-p1))*MSE1
MSE_adj1 print("adjusted MSE for the basic model: ", MSE_adj1, "\n")
## adjusted MSE for the basic model: 0.2266697465051905
= np.mean(flex_results.resid**2)
MSE2 print("MSE for the flexible model: ", MSE2, "\n")
## MSE for the flexible model: 0.21106813644318217
= len(flex_results.params) # number of regressors
p2 = len(lwage)
n = (n/(n-p2))*MSE2
MSE_adj2 print("adjusted MSE for the flexible model: ", MSE_adj2, "\n")
## adjusted MSE for the flexible model: 0.2216559752614984
= mean_squared_error(lwage, lwage_lasso_fitted)
MSEL print("MSE for the LASSO model: ", MSEL, "\n")
## MSE for the LASSO model: 0.2730758844230591
= reg.coef_.shape[0] # number of regressors
pL = len(lwage)
n = (n/(n-pL))*MSEL
MSE_adjL print("adjusted MSE for LASSO model: ", MSE_adjL, "\n")
## adjusted MSE for LASSO model: 0.28677422609680964
Latex presentation
R code
<- c("Basic reg","Flexible reg","Lasso reg")
Models <- c(p1,p2,pL)
p <- c(R2.1,R2.2,R2.L)
R_2 <- c(MSE1,MSE2,MSEL)
MSE <- c(R2.adj1,R2.adj2,R2.adjL)
R_2_adj <- c(MSE.adj1,MSE.adj2,MSE.adjL) MSE_adj
data.frame(Models,p,R_2,MSE,R_2_adj,MSE_adj) %>%
kable("markdown",caption = "Descriptive Statistics")
Models | p | R_2 | MSE | R_2_adj | MSE_adj |
---|---|---|---|---|---|
Basic reg | 51 | 0.3100465 | 0.2244251 | 0.3032809 | 0.2266697 |
Flexible reg | 246 | 0.3511099 | 0.2110681 | 0.3186919 | 0.2216560 |
Lasso reg | 246 | 0.2778653 | 0.2348928 | 0.2744836 | 0.2466758 |
Python code
# import array_to_latex as a2l
= np.zeros((3, 5))
table 0,0:5] = [p1, R2_1, MSE1, R2_adj1, MSE_adj1]
table[1,0:5] = [p2, R2_2, MSE2, R2_adj2, MSE_adj2]
table[2,0:5] = [pL, R2_L, MSEL, R2_adjL, MSE_adjL]
table[ table
## array([[5.10000000e+01, 3.10046507e-01, 2.24425056e-01, 3.03280930e-01,
## 2.26669747e-01],
## [2.46000000e+02, 3.51109895e-01, 2.11068136e-01, 3.18691854e-01,
## 2.21655975e-01],
## [2.46000000e+02, 1.60478496e-01, 2.73075884e-01, 1.18356879e-01,
## 2.86774226e-01]])
= pd.DataFrame(table, columns = ["p","$R^2_{sample}$","$MSE_{sample}$","$R^2_{adjusted}$", "$MSE_{adjusted}$"], index = ["basic reg","flexible reg", "lasso flex"])
table table
## p $R^2_{sample}$ ... $R^2_{adjusted}$ $MSE_{adjusted}$
## basic reg 51.0 0.310047 ... 0.303281 0.226670
## flexible reg 246.0 0.351110 ... 0.318692 0.221656
## lasso flex 246.0 0.160478 ... 0.118357 0.286774
##
## [3 rows x 5 columns]
Considering all measures above, the flexible model performs slightly better than the basic model.
One procedure to circumvent this issue is to use data splitting that is described and applied in the following.
2.4 Data Splitting
Measure the prediction quality of the two models via data splitting:
Randomly split the data into one training sample and one testing sample. Here we just use a simple method (stratified splitting is a more sophisticated version of splitting that we can consider).
Use the training sample for estimating the parameters of the Basic Model and the Flexible Model.
Use the testing sample for evaluation. Predict the \(wage\) of every observation in the testing sample based on the estimated parameters in the training sample.
Calculate the Mean Squared Prediction Error \(MSE_{test}\) based on the testing sample for both prediction models.
R code
# to make the results replicable (generating random numbers)
set.seed(1)
# draw (4/5)*n random numbers from 1 to n without replacing them
<- sample(1:n, floor(n*4/5))
random_2 # training sample
<- data[random_2,]
train # testing sample
<- data[-random_2,] test
Python code
# Import relevant packages for splitting data
import random
import math
# Set Seed
# to make the results replicable (generating random numbers)
0)
np.random.seed(= np.random.randint(0,n, size=math.floor(n))
random "random"] = random
data[# the array does not change random
## array([2732, 2607, 1653, ..., 4184, 2349, 3462])
= data.sort_values(by=['random'])
data_2 data_2.head()
## wage lwage sex shs hsg ... occ occ2 ind ind2 random
## rownames ...
## 2223 26.442308 3.274965 1.0 0.0 1.0 ... 340 1 8660 20 0
## 3467 19.230769 2.956512 0.0 0.0 0.0 ... 9620 22 1870 5 0
## 13501 48.076923 3.872802 1.0 0.0 0.0 ... 3060 10 8190 18 0
## 15588 12.019231 2.486508 0.0 0.0 1.0 ... 6440 19 770 4 2
## 16049 39.903846 3.686473 1.0 0.0 0.0 ... 1820 5 7860 17 2
##
## [5 rows x 21 columns]
# Create training and testing sample
= data_2[ : math.floor(n*4/5)] # training sample
train = data_2[ math.floor(n*4/5) : ] # testing sample
test print(train.shape)
## (4120, 21)
print(test.shape)
## (1030, 21)
The train data dimensions are 4120 rows and 20 columns.
We estimate the parameters using the training data set.
# basic model
# estimating the parameters in the training sample
<- lm(basic, data=train)
regbasic regbasic
##
## Call:
## lm(formula = basic, data = train)
##
## Coefficients:
## (Intercept) sex exp1 shs hsg scl
## 3.641716 -0.059065 0.008635 -0.570657 -0.508393 -0.405932
## clg mw so we occ22 occ23
## -0.178154 -0.044486 -0.051111 0.004004 -0.083481 -0.036320
## occ24 occ25 occ26 occ27 occ28 occ29
## -0.091457 -0.126383 -0.416548 -0.046615 -0.385957 -0.220534
## occ210 occ211 occ212 occ213 occ214 occ215
## -0.030423 -0.460487 -0.317680 -0.375180 -0.465495 -0.494731
## occ216 occ217 occ218 occ219 occ220 occ221
## -0.212026 -0.413355 -0.329054 -0.263839 -0.241109 -0.293004
## occ222 ind23 ind24 ind25 ind26 ind27
## -0.410747 -0.075902 -0.121251 -0.206172 -0.151398 -0.081742
## ind28 ind29 ind210 ind211 ind212 ind213
## -0.186304 -0.304987 -0.126226 -0.090402 0.028445 -0.116162
## ind214 ind215 ind216 ind217 ind218 ind219
## -0.105521 -0.227273 -0.223604 -0.208887 -0.256808 -0.266724
## ind220 ind221 ind222
## -0.459034 -0.226775 -0.047166
# # basic model
# # estimating the parameters in the training sample
= smf.ols(basic , data=train).fit()
basic_results print(basic_results.summary())
## OLS Regression Results
## ==============================================================================
## Dep. Variable: lwage R-squared: 0.316
## Model: OLS Adj. R-squared: 0.308
## Method: Least Squares F-statistic: 37.65
## Date: Wed, 24 Nov 2021 Prob (F-statistic): 4.85e-293
## Time: 12:19:56 Log-Likelihood: -2784.1
## No. Observations: 4120 AIC: 5670.
## Df Residuals: 4069 BIC: 5993.
## Df Model: 50
## Covariance Type: nonrobust
## ==============================================================================
## coef std err t P>|t| [0.025 0.975]
## ------------------------------------------------------------------------------
## Intercept 3.5365 0.061 58.134 0.000 3.417 3.656
## occ2[T.10] -0.0054 0.045 -0.120 0.904 -0.094 0.083
## occ2[T.11] -0.4594 0.067 -6.893 0.000 -0.590 -0.329
## occ2[T.12] -0.3300 0.061 -5.365 0.000 -0.451 -0.209
## occ2[T.13] -0.3767 0.050 -7.544 0.000 -0.475 -0.279
## occ2[T.14] -0.5026 0.056 -8.947 0.000 -0.613 -0.392
## occ2[T.15] -0.4511 0.059 -7.586 0.000 -0.568 -0.335
## occ2[T.16] -0.2482 0.036 -6.818 0.000 -0.320 -0.177
## occ2[T.17] -0.4286 0.031 -13.624 0.000 -0.490 -0.367
## occ2[T.18] -0.2957 0.216 -1.367 0.172 -0.720 0.128
## occ2[T.19] -0.2354 0.056 -4.191 0.000 -0.345 -0.125
## occ2[T.2] -0.0771 0.038 -2.029 0.043 -0.152 -0.003
## occ2[T.20] -0.2158 0.046 -4.669 0.000 -0.306 -0.125
## occ2[T.21] -0.3029 0.042 -7.171 0.000 -0.386 -0.220
## occ2[T.22] -0.4385 0.047 -9.385 0.000 -0.530 -0.347
## occ2[T.3] -0.0054 0.044 -0.121 0.904 -0.092 0.082
## occ2[T.4] -0.0867 0.061 -1.431 0.152 -0.206 0.032
## occ2[T.5] -0.2064 0.072 -2.866 0.004 -0.348 -0.065
## occ2[T.6] -0.4175 0.057 -7.317 0.000 -0.529 -0.306
## occ2[T.7] -0.0111 0.063 -0.177 0.860 -0.134 0.112
## occ2[T.8] -0.3633 0.049 -7.380 0.000 -0.460 -0.267
## occ2[T.9] -0.1928 0.052 -3.743 0.000 -0.294 -0.092
## ind2[T.11] 0.0622 0.066 0.937 0.349 -0.068 0.192
## ind2[T.12] 0.1328 0.060 2.220 0.026 0.016 0.250
## ind2[T.13] 0.0492 0.078 0.629 0.529 -0.104 0.203
## ind2[T.14] 0.0062 0.057 0.108 0.914 -0.106 0.118
## ind2[T.15] -0.1137 0.150 -0.759 0.448 -0.407 0.180
## ind2[T.16] -0.1072 0.063 -1.690 0.091 -0.232 0.017
## ind2[T.17] -0.1034 0.063 -1.640 0.101 -0.227 0.020
## ind2[T.18] -0.1331 0.058 -2.298 0.022 -0.247 -0.020
## ind2[T.19] -0.1591 0.073 -2.190 0.029 -0.301 -0.017
## ind2[T.2] 0.2190 0.097 2.248 0.025 0.028 0.410
## ind2[T.20] -0.3512 0.064 -5.518 0.000 -0.476 -0.226
## ind2[T.21] -0.0824 0.062 -1.325 0.185 -0.204 0.040
## ind2[T.22] 0.0795 0.060 1.321 0.186 -0.038 0.197
## ind2[T.3] 0.0533 0.088 0.603 0.547 -0.120 0.227
## ind2[T.4] -0.0416 0.065 -0.636 0.525 -0.170 0.087
## ind2[T.5] -0.0628 0.063 -1.004 0.315 -0.185 0.060
## ind2[T.6] -0.0394 0.059 -0.673 0.501 -0.154 0.075
## ind2[T.7] 0.0058 0.080 0.073 0.942 -0.152 0.163
## ind2[T.8] -0.0610 0.081 -0.754 0.451 -0.220 0.098
## ind2[T.9] -0.1683 0.055 -3.086 0.002 -0.275 -0.061
## sex -0.0763 0.017 -4.521 0.000 -0.109 -0.043
## exp1 0.0087 0.001 11.758 0.000 0.007 0.010
## shs -0.5928 0.057 -10.436 0.000 -0.704 -0.481
## hsg -0.5213 0.030 -17.127 0.000 -0.581 -0.462
## scl -0.4215 0.028 -14.848 0.000 -0.477 -0.366
## clg -0.1974 0.026 -7.655 0.000 -0.248 -0.147
## mw -0.0233 0.022 -1.075 0.283 -0.066 0.019
## so -0.0428 0.021 -2.048 0.041 -0.084 -0.002
## we -5.145e-05 0.023 -0.002 0.998 -0.044 0.044
## ==============================================================================
## Omnibus: 358.629 Durbin-Watson: 1.946
## Prob(Omnibus): 0.000 Jarque-Bera (JB): 1439.044
## Skew: 0.355 Prob(JB): 0.00
## Kurtosis: 5.807 Cond. No. 543.
## ==============================================================================
##
## Notes:
## [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Then predict using the parameters in the testing sample.
# calculating the out-of-sample MSE
<- predict(regbasic, newdata=test)
trainregbasic trainregbasic
## 12 44 71 84 129 149 191 221
## 3.395851 2.734493 3.151807 2.506919 2.368192 3.173461 3.065637 3.043757
## 248 264 281 368 464 467 496 540
## 2.699975 2.398635 3.080034 2.542424 2.700733 2.973891 2.940246 2.796404
## 546 576 629 687 705 765 769 809
## 2.793806 2.907308 3.213768 3.222378 2.758879 2.772328 2.753855 3.005369
## 938 945 952 960 1003 1057 1059 1065
## 2.502601 2.941366 3.166449 2.846778 3.070972 2.629048 3.036431 2.881320
## 1075 1076 1079 1101 1105 1143 1161 1195
## 3.414267 3.504898 2.283221 2.564178 2.737425 2.761703 3.606691 2.556301
## 1224 1271 1299 1342 1387 1407 1508 1568
## 2.861144 2.749369 2.809817 3.680845 3.445996 2.621132 2.826924 2.877508
## 1593 1646 1669 1693 1711 1715 1748 1749
## 3.011676 2.775702 2.923326 2.811011 2.844195 3.037463 3.038273 2.898368
## 1751 1763 1782 1862 1889 1896 1915 1916
## 2.913956 3.447804 2.987807 3.308917 3.448159 3.032423 2.639366 2.710743
## 1917 1930 1933 1939 1945 1946 1964 1978
## 3.388704 3.639454 3.560323 3.261379 2.991757 3.476764 2.574404 3.409102
## 1999 2020 2055 2122 2132 2135 2194 2215
## 2.815364 2.804909 2.796349 3.052717 2.760399 2.834512 3.066740 3.066740
## 2224 2278 2315 2330 2333 2362 2381 2465
## 2.996740 3.497591 3.208588 3.172049 3.137275 2.977112 2.911219 2.936761
## 2488 2490 2500 2502 2555 2566 2592 2593
## 3.107351 2.515200 2.884473 2.834005 3.120641 3.248789 3.190732 2.996836
## 2604 2635 2660 2746 2851 2879 2919 2972
## 3.362746 3.364898 3.195416 2.850228 3.155803 2.618133 2.600422 3.536249
## 2990 3005 3020 3021 3103 3117 3195 3269
## 2.486495 2.965091 2.784081 2.659487 3.183660 3.126963 3.537607 3.375366
## 3313 3357 3383 3428 3501 3531 3575 3605
## 2.800045 3.108703 2.707201 3.030192 3.530804 3.056041 3.475381 2.868124
## 3630 3634 3641 3738 3803 3810 3819 3824
## 2.802191 2.785823 2.978776 2.885989 3.312690 3.053702 3.262039 2.792172
## 3853 3855 3858 3867 3975 3996 4008 4012
## 2.929577 2.837561 3.453031 2.637223 3.295419 3.283195 3.190034 3.255441
## 4015 4021 4027 4034 4042 4046 4056 4057
## 3.513137 2.909294 3.791820 3.371755 3.487255 3.382399 3.146749 3.341796
## 4068 4078 4088 4095 4108 4109 4110 4111
## 3.130255 3.083729 3.136610 3.373162 3.350788 3.342153 3.049343 3.701652
## 4119 4161 4180 4189 4190 4311 4318 4416
## 2.852830 3.050451 2.661820 2.655253 3.127026 3.167800 2.660548 3.436540
## 4476 4492 4495 4585 4627 4658 4668 4735
## 2.915678 3.440465 2.554050 3.351613 3.121649 2.702883 2.576908 2.956567
## 4752 4801 4805 4806 4837 4862 4865 4869
## 2.586882 3.185483 3.288241 3.562156 3.517912 3.426356 3.174408 2.709436
## 4873 4981 5013 5021 5026 5034 5043 5069
## 2.973516 2.988795 2.473201 2.692698 3.187168 3.096083 3.258672 2.824127
## 5087 5115 5117 5145 5177 5187 5207 5210
## 3.081478 2.752523 2.596189 3.173009 2.658040 2.897046 2.965355 2.732098
## 5232 5288 5343 5345 5347 5348 5444 5455
## 3.498372 3.170164 2.911181 3.308917 3.203620 2.998361 2.655253 2.806754
## 5487 5548 5617 5654 5667 5699 5766 5779
## 3.177714 2.600207 2.648163 2.536779 3.101771 3.257564 2.673927 3.191729
## 5793 5798 5830 5889 5892 5902 5919 5946
## 3.111683 2.895035 2.890652 2.694029 3.010846 3.283855 2.762490 3.055507
## 6002 6005 6034 6096 6132 6152 6210 6236
## 3.169177 3.295419 2.537143 2.886118 2.765682 3.264142 2.632311 2.698241
## 6256 6281 6293 6310 6425 6431 6434 6457
## 2.791589 2.867499 2.682244 2.712499 2.813238 3.020245 2.523731 3.154739
## 6465 6478 6538 6539 6544 6575 6606 6668
## 2.389111 3.040710 2.684084 2.731215 2.828932 2.867409 2.676274 3.154494
## 6682 6695 6739 6800 6829 6841 6844 6886
## 2.778237 2.808591 2.555936 3.097376 3.314938 2.790101 2.238735 3.018277
## 6890 6929 6938 6971 6978 7019 7020 7062
## 3.155272 2.556840 3.428714 2.846166 3.235418 2.742063 2.772506 2.615001
## 7089 7158 7212 7221 7300 7313 7340 7347
## 3.221539 2.731686 2.423232 2.611822 2.891218 2.688873 2.853702 2.904171
## 7381 7389 7458 7508 7516 7560 7576 7609
## 2.891335 2.551294 2.834250 2.462321 2.593496 2.901132 2.919222 3.151498
## 7633 7646 7686 7710 7746 7773 7783 7793
## 2.860321 2.264641 3.184857 3.467215 2.457641 3.128485 3.373293 3.196914
## 7808 7859 7900 7917 7929 7977 8010 8055
## 2.564555 2.457774 2.776710 3.185064 2.783985 2.626024 2.649542 2.811661
## 8143 8145 8157 8195 8196 8208 8213 8214
## 3.040923 3.409586 3.270035 3.141863 3.133301 3.039365 2.923953 2.895535
## 8226 8231 8254 8263 8266 8317 8319 8326
## 3.576747 2.896649 2.782438 3.055945 3.270551 3.179643 3.313609 3.620846
## 8333 8336 8338 8346 8407 8511 8523 8560
## 3.480284 2.632379 2.926105 3.273351 3.462364 3.248004 3.006084 2.999161
## 8574 8578 8648 8676 8727 8792 8800 8850
## 2.785021 2.748091 2.861330 2.950619 2.878199 2.792013 2.642290 2.860954
## 8912 8968 8997 9010 9082 9123 9152 9161
## 2.632838 2.886228 2.997329 2.681401 2.439120 2.790073 2.970414 3.518632
## 9209 9210 9219 9226 9285 9317 9319 9348
## 3.029675 3.164715 2.532392 3.090050 3.263125 3.090179 3.335531 3.517874
## 9350 9360 9384 9396 9398 9496 9514 9529
## 2.747039 2.747103 3.196415 3.048000 3.025001 3.219398 2.934031 3.517669
## 9601 9616 9636 9638 9688 9733 9735 9815
## 2.478605 3.308452 2.493620 3.135583 2.871998 2.622413 2.555936 3.197328
## 9823 9844 9934 9980 10001 10034 10096 10108
## 2.923922 2.886228 2.710209 2.642857 3.286879 3.176896 3.320412 3.150498
## 10197 10210 10217 10220 10233 10276 10287 10350
## 3.414480 2.442025 3.415007 3.020489 3.090124 2.626143 2.862338 3.305753
## 10351 10386 10412 10552 10630 10709 10715 10727
## 2.585180 2.657856 3.019542 3.026912 2.790073 2.782602 3.008182 3.083998
## 10782 10800 10886 10903 10909 10917 10921 10980
## 2.469040 3.257361 3.180766 2.523731 2.860268 2.449059 3.425870 3.257464
## 11026 11078 11091 11204 11279 11402 11406 11435
## 3.406395 3.019779 2.981300 2.713518 3.184448 2.729435 2.811661 2.570225
## 11454 11474 11487 11503 11507 11598 11605 11610
## 2.743368 2.514423 2.273232 2.660127 2.806852 2.678122 2.719207 2.782849
## 11683 11729 11795 11806 11839 11846 11896 11924
## 3.112563 2.746924 2.844411 2.994090 3.140131 2.800299 2.602132 2.988201
## 11930 11958 12083 12086 12089 12101 12143 12146
## 2.351741 3.017568 2.263834 3.083077 3.274528 2.704883 2.599324 2.809138
## 12150 12208 12271 12287 12288 12365 12419 12429
## 3.287253 3.353175 2.790101 2.466267 2.946758 3.106844 2.790383 3.425624
## 12433 12475 12539 12578 12607 12711 12733 12752
## 2.420893 2.832867 2.476729 2.878963 2.695290 2.901157 2.330046 2.573646
## 12774 12786 12833 12871 12876 12886 12957 12976
## 2.662935 2.599986 2.904128 3.055031 2.641126 2.807372 3.040229 2.755559
## 13016 13017 13022 13029 13046 13052 13057 13077
## 3.417180 3.476312 3.568416 2.837878 2.755559 3.102465 3.113267 3.114496
## 13092 13105 13112 13122 13220 13221 13243 13310
## 3.050248 2.845574 3.155015 3.332366 2.430184 2.276895 2.925396 2.747174
## 13356 13358 13359 13366 13474 13510 13548 13568
## 3.093096 2.618345 2.623750 2.479811 3.021040 2.539788 2.430929 2.764956
## 13573 13584 13590 13600 13601 13616 13659 13708
## 2.534431 2.645527 3.011021 2.962510 2.492657 2.730789 2.650109 2.868645
## 13718 13782 13837 13896 13913 13917 13932 13933
## 3.061893 2.792621 2.772830 3.346548 2.598853 3.381887 2.471366 2.724714
## 13978 14018 14050 14072 14098 14162 14176 14198
## 2.600259 2.880745 3.664472 3.033902 3.191092 2.865511 2.952953 2.860268
## 14258 14259 14294 14295 14326 14367 14376 14399
## 2.875225 2.469983 3.249871 2.713518 2.594471 2.810564 2.783779 3.182974
## 14412 14465 14499 14512 14592 14644 14654 14683
## 3.190560 2.980949 3.505541 3.690583 2.542768 2.817335 2.304618 2.807372
## 14725 14741 14828 14829 14886 14940 14946 14954
## 2.969227 2.985458 2.697326 2.466737 3.145866 2.356618 3.282041 3.008760
## 15015 15101 15104 15114 15287 15289 15324 15355
## 2.568758 2.769849 2.555234 3.244725 2.799118 3.220308 3.556590 3.276802
## 15366 15469 15481 15590 15594 15628 15681 15735
## 2.723132 2.638708 2.705936 3.100770 2.381933 2.935394 3.216137 3.239776
## 15749 15750 15754 15758 15760 15765 15777 15785
## 3.384704 3.516188 3.514095 3.520652 3.212966 3.608636 3.437759 3.152509
## 15786 15791 15805 15815 15821 15827 15830 15880
## 3.481646 3.255376 2.700494 3.265136 3.348101 3.517139 3.187060 3.110830
## 15891 15902 15934 15941 15952 15961 15965 15971
## 3.193349 3.714094 3.169789 3.232084 3.315654 3.356737 3.562094 3.122832
## 15989 16035 16042 16053 16054 16056 16071 16091
## 3.299678 3.221284 3.014416 3.464375 3.473961 2.927407 3.159375 3.555265
## 16097 16111 16173 16258 16267 16293 16328 16405
## 3.351173 3.480937 2.595507 2.787767 2.793084 2.646590 2.607116 2.906384
## 16454 16502 16515 16516 16526 16590 16592 16632
## 3.358797 2.575218 2.770689 2.647454 2.977852 2.779557 2.878467 3.526109
## 16649 16650 16665 16672 16673 16676 16699 16757
## 2.773134 2.863806 3.199719 3.331058 3.464375 3.401603 2.882098 3.551680
## 16822 16844 16853 16886 16914 16952 16963 16971
## 3.474671 2.735219 3.099418 3.248663 3.107956 2.815627 2.601526 3.448631
## 17039 17069 17081 17091 17132 17135 17152 17157
## 2.416265 2.552472 2.732800 2.799740 3.195670 2.955886 2.783477 2.800747
## 17190 17208 17222 17230 17257 17293 17294 17402
## 2.647563 3.069653 2.777815 2.848641 2.793262 2.861168 2.686719 2.417203
## 17455 17494 17505 17529 17544 17678 17847 17875
## 2.834641 2.486331 2.749697 2.858687 2.984336 3.244309 3.037455 3.041415
## 17878 17879 17959 17997 18001 18039 18066 18125
## 2.646071 2.947250 2.751080 3.070538 3.205364 2.514498 2.847078 2.989497
## 18158 18164 18323 18401 18433 18460 18469 18470
## 2.697404 3.046861 2.724165 3.527250 2.932882 2.664725 3.131272 3.113342
## 18564 18565 18618 18694 18748 18753 18767 18858
## 3.091645 3.000447 3.342855 2.587929 2.839579 2.747013 3.050876 2.740299
## 18960 18984 19009 19076 19089 19090 19091 19094
## 2.873387 3.467569 2.767178 3.504871 3.191277 3.139801 2.822440 3.502355
## 19121 19122 19173 19192 19195 19208 19223 19434
## 3.303374 3.374685 3.339694 2.567525 2.330253 2.800747 2.740299 2.885277
## 19478 19492 19556 19594 19616 19726 19730 19733
## 3.445357 2.951892 2.795562 3.032479 3.056944 3.074790 2.941744 3.173019
## 19757 19801 19827 19847 19858 19869 19875 19906
## 2.972071 3.380165 2.860108 2.424384 2.441589 2.232110 2.599261 2.692920
## 19994 19995 20008 20015 20031 20049 20056 20083
## 2.831725 2.407047 2.657506 3.148543 2.896903 2.791363 2.905944 2.664401
## 20107 20111 20127 20163 20200 20241 20258 20267
## 2.795667 2.490694 3.146094 2.563411 2.396161 2.596753 3.466035 3.678169
## 20385 20427 20443 20520 20605 20679 20767 20806
## 2.386165 2.836262 2.631594 2.804303 3.008397 2.924797 3.008649 2.891907
## 20837 20875 20898 20903 20905 20912 20925 20937
## 3.077707 2.797785 3.133210 2.929868 2.744293 2.625647 3.369021 2.727499
## 21080 21085 21165 21176 21265 21266 21269 21284
## 3.022307 2.580516 2.879550 2.826243 3.207662 3.283998 3.108848 2.691858
## 21304 21333 21337 21365 21398 21418 21542 21559
## 3.163137 2.232110 2.567706 3.346551 2.729273 2.655225 2.785654 3.094746
## 21626 21690 21779 21830 21900 21922 21987 22009
## 3.058768 2.980862 2.905345 2.805037 3.083569 2.423860 2.597930 3.401739
## 22150 22269 22336 22345 22377 22399 22414 22423
## 2.308446 3.085418 3.036176 3.196270 2.620568 3.057593 2.677924 3.119933
## 22471 22714 22972 22973 22979 23002 23007 23053
## 2.969788 3.721346 3.126603 2.939269 2.775603 2.741388 3.094053 2.841212
## 23144 23176 23192 23302 23320 23338 23391 23441
## 2.867323 3.091255 3.399768 2.678750 2.563752 2.767178 2.642704 2.824370
## 23468 23493 23517 23528 23531 23548 23553 23591
## 3.232750 2.670633 2.681284 2.775603 2.549311 3.300897 2.907745 3.688897
## 23681 23701 23780 23840 23848 23857 24022 24023
## 2.925199 2.675506 2.624561 2.624397 2.807388 3.014823 3.073608 2.755584
## 24030 24058 24059 24124 24135 24352 24418 24421
## 2.389318 2.967909 3.114712 2.469926 3.150024 2.977259 2.671498 3.131272
## 24453 24463 24470 24471 24478 24498 24527 24576
## 2.857084 3.142006 3.009469 2.621414 3.155093 2.657787 3.065349 3.150426
## 24593 24601 24687 24714 24741 24804 24816 24911
## 3.050876 2.601124 2.942821 2.752929 2.629750 2.903934 2.921344 2.745706
## 24922 24924 24947 25024 25066 25070 25085 25103
## 2.980886 3.186868 2.833660 2.902548 2.755649 2.949469 3.376068 2.726612
## 25127 25141 25148 25275 25356 25392 25408 25432
## 2.413455 2.930122 2.540796 2.796418 2.794290 3.413211 2.820307 2.669043
## 25519 25570 25571 25597 25611 25621 25652 25711
## 2.710115 2.796976 2.723939 2.670310 2.295860 3.000743 2.380831 3.001285
## 25728 25735 25737 25766 25804 25849 25888 25890
## 2.820115 3.014233 3.069530 3.387951 2.851479 3.644922 2.684099 2.367786
## 25947 25990 26078 26081 26084 26086 26120 26168
## 2.988004 2.760290 2.731489 2.380831 2.993922 2.462162 3.502219 3.457387
## 26314 26331 26476 26528 26602 26612 26613 26644
## 2.712185 3.126173 2.810868 2.525219 2.586276 3.301202 3.363585 3.064085
## 26696 26755 26801 26811 26861 26902 26916 26919
## 2.684321 2.819399 3.331036 2.766494 3.126131 2.923047 2.706563 3.139545
## 26956 27110 27148 27246 27395 27403 27415 27420
## 3.223426 2.778321 3.304095 2.648727 2.767012 3.227156 2.821524 3.214151
## 27531 27541 27555 27578 27579 27598 27600 27606
## 3.040861 3.048308 3.385778 3.248080 3.429799 3.167417 2.690347 2.985094
## 27611 27698 27699 27757 27764 27787 27791 27870
## 2.992102 2.613061 2.689963 2.775728 3.542449 2.716503 2.934718 2.929703
## 27925 27928 27945 28038 28119 28156 28157 28191
## 2.673675 2.633052 3.180282 2.968123 3.262777 2.799444 2.465556 2.815750
## 28237 28242 28251 28254 28360 28363 28379 28394
## 3.008311 2.731244 2.971210 3.261576 2.949695 3.294052 2.941437 2.609867
## 28422 28496 28572 28621 28765 28770 28776 28807
## 2.806628 3.082127 2.723916 2.866546 2.637370 3.235074 3.037555 2.719983
## 28886 28961 28969 28980 28997 29004 29014 29093
## 3.413106 2.699416 2.921469 2.733039 2.716503 3.226393 2.954921 2.795414
## 29101 29118 29159 29164 29169 29225 29275 29288
## 2.950096 3.020908 3.478804 2.859092 2.503292 2.667893 3.026353 3.193546
## 29294 29387 29431 29480 29508 29521 29540 29554
## 2.900029 2.866459 2.623956 3.194736 2.802281 3.648312 3.189868 2.852089
## 29557 29562 29569 29584 29593 29633 29634 29639
## 3.072721 3.401966 3.397119 3.397747 3.693485 3.406763 2.722689 3.582981
## 29652 29687 29715 29748 29830 29868 29871 29916
## 2.782137 3.266794 3.002642 2.577031 2.901424 2.335838 2.653470 2.385368
## 29920 29958 30006 30128 30194 30197 30263 30290
## 2.525828 2.908758 3.364388 3.228235 2.866970 3.093340 3.138425 2.928069
## 30369 30427 30477 30478 30491 30539 30583 30591
## 3.217922 2.737111 3.119816 3.057985 3.440781 2.463042 3.415199 3.121343
## 30606 30613 30624 30632 30649 30694 30727 30737
## 3.078165 3.305130 3.272426 3.110170 2.893884 3.218469 3.158782 2.787540
## 30747 30753 30756 30777 30783 30817 30870 30892
## 3.328520 3.119508 3.386635 3.022257 2.444432 2.668947 3.140760 2.815947
## 30898 30984 31002 31003 31038 31069 31132 31147
## 3.477839 2.737057 2.906667 3.413590 3.377261 2.962718 3.528125 2.866060
## 31183 31184 31191 31222 31254 31324 31363 31391
## 2.739399 2.779588 3.217096 2.640962 3.085992 3.392725 3.207045 2.789176
## 31396 31426 31431 31433 31504 31525 31615 31667
## 2.961050 3.263336 3.207624 2.875036 3.236769 3.617178 3.542060 3.295834
## 31668 31689 31758 31768 31863 31867 31868 31897
## 3.389677 3.177946 3.550217 2.836428 3.403739 3.198997 3.305430 3.106381
## 31899 31902 31956 31969 31975 31982 31997 32042
## 3.063874 2.847226 3.342601 3.210594 3.279223 2.680762 3.406409 2.788163
## 32045 32074 32087 32095 32143 32152 32212 32229
## 2.505339 3.085367 3.027446 3.046780 2.684815 2.950308 3.404048 2.979846
## 32256 32314 32357 32369 32458 32502 32504 32517
## 3.075402 3.566477 3.167155 2.795414 2.851532 2.943600 2.752415 3.184073
## 32561 32564 32571 32591 32599 32626
## 3.481612 2.347673 2.475821 2.982382 2.551125 3.497855
= test["lwage"].values
lwage_test = sm.add_constant(test) #add constant
test
= basic_results.predict(test) # predict out of sample
lwage_pred print(lwage_pred)
## rownames
## 29749 2.454760
## 32504 2.729422
## 4239 3.374858
## 985 3.451121
## 8477 2.883054
## ...
## 27533 3.039693
## 7218 2.669400
## 7204 3.271324
## 1380 2.943550
## 10451 3.462293
## Length: 1030, dtype: float64
Finally, we test the predictions.
R code
<- log(test$wage)
y.test <- sum((y.test-trainregbasic)^2)/length(y.test)
MSE.test1 <- 1- MSE.test1/var(y.test) R2.test1
- Test MSE for the basic model: 0.1971044.
- Test R2 for the basic model: 0.3279559.
Python code
= np.sum((lwage_test-lwage_pred)**2)/len(lwage_test)
MSE_test1 = 1 - MSE_test1/np.var(lwage_test)
R2_test1
print("Test MSE for the basic model: ", MSE_test1, " ")
## Test MSE for the basic model: 0.21963534669163987
print("Test R2 for the basic model: ", R2_test1)
## Test R2 for the basic model: 0.27498431184537286
In the basic model, the \(MSE test\) is quite closed to the \(MSE sample\).
# estimating the parameters
<- lm(flex, data=train)
regflex
# calculating the out-of-sample MSE
<- predict(regflex, newdata=test)
trainregflex
<- log(test$wage)
y.test <- sum((y.test-trainregflex)^2)/length(y.test)
MSE.test2 <- 1- MSE.test2/var(y.test) R2.test2
Test MSE for the flexible model: 0.2064107.
Test R2 for the flexible model: 0.2962252.
# Flexible model
# estimating the parameters in the training sample
= smf.ols(flex , data=train).fit()
flex_results
# calculating the out-of-sample MSE
= flex_results.predict(test) # predict out of sample
lwage_flex_pred = test["lwage"].values
lwage_test
= np.sum((lwage_test-lwage_flex_pred)**2)/len(lwage_test)
MSE_test2 = 1 - MSE_test2/np.var(lwage_test)
R2_test2
print("Test MSE for the flexible model: ", MSE_test2, " ")
## Test MSE for the flexible model: 0.2332944574254981
print("Test R2 for the flexible model: ", R2_test2)
## Test R2 for the flexible model: 0.22989562408423558
In the flexible model, the discrepancy between the \(MSE_{test}\) and the \(MSE_{sample}\) is not large.
It is worth to notice that the \(MSE_{test}\) vary across different data splits. Hence, it is a good idea average the out-of-sample MSE over different data splits to get valid results.
Nevertheless, we observe that, based on the out-of-sample \(MSE\), the basic model using OLS regression performs is about as well (or slightly better) than the flexible model.
Next, let us use lasso regression in the flexible model instead of OLS regression. Lasso (least absolute shrinkage and selection operator) is a penalized regression method that can be used to reduce the complexity of a regression model when the number of regressors \(p\) is relatively large in relation to \(n\).
Note that the out-of-sample \(MSE\) on the test sample can be computed for any other black-box prediction method as well. Thus, let us finally compare the performance of lasso regression in the flexible model to OLS regression.
R code
# flexible model using lasso
# estimating the parameters
<- rlasso(flex, data=train, post=FALSE)
reglasso
# calculating the out-of-sample MSE
<- predict(reglasso, newdata=test)
trainreglasso<- sum((y.test-trainreglasso)^2)/length(y.test)
MSE.lasso <- 1- MSE.lasso/var(y.test) R2.lasso
Test \(MSE\) for the lasso on flexible model: 0.212698
Test \(R^2\) for the lasso flexible model: 0.2747882
Python code
# flexible model using lasso
# get exogenous variables from training data used in flex model
= smf.ols(flex , data=train)
flex_results_0 = flex_results_0.exog
X_train print(X_train.shape)
# Get endogenous variable
## (4120, 246)
= train["lwage"]
lwage_train print(lwage_train.shape)
## (4120,)
# calculating the out-of-sample MSE
# alpha=0.1
# reg = linear_model.Lasso(alpha = alpha)
# lwage_lasso_fitted = reg.fit(train, lwage_train).predict(test)
#
# # MSE_lasso = np.sum((lwage_test-lwage_lasso_fitted)**2)/len(lwage_test)
# # R2_lasso = 1 - MSE_lasso/np.var(lwage_test)
#
# print("Test MSE for the flexible model: ", MSE_lasso, " ")
# print("Test R2 for the flexible model: ", R2_lasso)
Finally, let us summarize the results:
R code
# Models <- c("Basic regression","Flexible regression","Lasso regression")
# R_2_SAMPLE <- c(R2.test1,R2.test2,R2.lasso)
# MSE_SAMPLE<- c(MSE.test1,MSE.test2,MSE.lasso)
#
# data.frame(Models,R_2_SAMPLE,MSE_SAMPLE) %>%
# kable("markdown",caption = "Descriptive Statistics - Random Process")
# print(data.frame(Models,R_2,MSE),type="latex")
Python code
# Package for latex table
# #import array_to_latex as a2l
#
# table2 = np.zeros((3, 2))
# table2[0,0] = MSE_test1
# table2[1,0] = MSE_test2
# table2[2,0] = MSE_lasso
# table2[0,1] = R2_test1
# table2[1,1] = R2_test2
# table2[2,1] = R2_lasso
#
# table2 = pd.DataFrame(table2, columns = ["$MSE_{test}$", "$R^2_{test}$"], \
# index = ["basic reg","flexible reg","lasso regression"])
# table2
# table2.to_latex
# print(table2.to_latex())