Chapter 16 Inference on Predictive and Causal Effects in High-Dimensional Nonlinear Models
16.1 Impact of 401(k) on Financial Wealth
As a practical illustration of the methods developed in this lecture, we consider estimation of the effect of 401(k) eligibility and participation
One can argue that eligibility for enrolling in a 401(k) plan in this data can be taken as exogenous after conditioning on a few observables of which the most important for their argument is income. The basic idea is that, at least around the time 401(k)’s initially became available, people were unlikely to be basing their employment decisions on whether an employer offered a 401(k) but would instead focus on income and other aspects of the job.
16.1.1 Data
The data set can be loaded from the hdm
package for R and hdmpy for Python by typing
library(hdm)
library(ggplot2)
data(pension)
<- pension
data dim(data)
## [1] 9915 44
import numpy as np
import pandas as pd
from doubleml.datasets import fetch_401K
from sklearn.preprocessing import PolynomialFeatures
import matplotlib.pyplot as plt
import seaborn as sns
import hdmpy
import pyreadr
import urllib.request
import os
import warnings
"ignore")
warnings.filterwarnings(
= pyreadr.read_r("./data/pension.Rdata")
rdata_read # Extracting the data frame from rdata_read
= rdata_read[ 'pension' ]
data = data.copy()
pension data.shape
## (9915, 44)
See the “Details” section on the description of the data set, which can be accessed by
help(pension)
"start \"\" https://search.r-project.org/CRAN/refmans/hdm/html/pension.html") os.system(
## 0
The data consist of 9,915 observations at the household level drawn from the 1991 Survey of Income and Program Participation (SIPP). All the variables are referred to 1990. We use net financial assets (net_tfa) as the outcome variable, \(Y\), in our analysis. The net financial assets are computed as the sum of IRA balances, 401(k) balances, checking accounts, saving bonds, other interest-earning accounts, other interest-earning assets, stocks, and mutual funds less non mortgage debts.
Among the \(9915\) individuals, \(3682\) are eligible to participate in the program. The variable e401 indicates eligibility and p401 indicates participation, respectively.
= ggplot(data, aes(x = e401, fill = factor(e401))) +
hist_e401 geom_bar()
hist_e401
set()
sns.= sns.color_palette()
colors ="e401", hue = "e401" , data=data) sns.countplot(x
Eligibility is highly associated with financial wealth:
= ggplot(data, aes(x = net_tfa, color = factor(e401), fill = factor(e401)) ) +
dens_net_tfa geom_density() + xlim(c(-20000, 150000)) +
facet_wrap(.~e401)
dens_net_tfa
from matplotlib.pyplot import figure
=(8, 10), dpi=100) figure(figsize
="net_tfa", hue="e401", col="e401",
sns.displot(data, x="kde", fill=True) kind
The unconditional APE of e401 is about \(19559\):
<- data[data$e401==1,]
e1 <- data[data$e401==0,]
e0 round(mean(e1$net_tfa)-mean(e0$net_tfa),0)
## [1] 19559
int(np.round( data[['e401', 'net_tfa']].groupby('e401').mean().diff().iloc[1, 0] ) )
## 19559
Among the \(3682\) individuals that are eligible, \(2594\) decided to participate in the program. The unconditional APE of p401 is about \(27372\):
<- data[data$p401==1,]
p1 <- data[data$p401==0,]
p0 round(mean(p1$net_tfa)-mean(p0$net_tfa),0)
## [1] 27372
int(np.round( data[['p401', 'net_tfa']].groupby('p401').mean().diff().iloc[1, 0]))
## 27372
As discussed, these estimates are biased since they do not account for saver heterogeneity and endogeneity of participation.
16.2 Double ML package
We are interested in valid estimators of the average treatment effect of e401
and p401
on net_tfa
. To get those estimators, we use the DoubleML
package that internally builds on mlr3. You find additional information on the package on the package website https://docs.doubleml.org/ and the R documentation page https://docs.doubleml.org/r/stable/.
# installing Double ML
#remotes::install_github("DoubleML/doubleml-for-r",quiet=TRUE)
# loading the packages
library(DoubleML)
library(mlr3learners)
library(mlr3)
library(data.table)
library(randomForest)
import doubleml as dml
from sklearn.linear_model import LassoCV, LogisticRegressionCV
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from xgboost import XGBClassifier, XGBRegressor
As mentioned, in the tutorial we use the meta package mlr3
to generate predictions with machine learning methods. A comprehensive introduction and description of the mlr3
package is provided in the mlr3book. A list of all learners that you can use in mlr3
can be found here. The entry in the columns mlr3 Package and Packages indicate which packages must be installed/loaded in your R session.
16.3 Estimating the ATE of 401(k) Eligibility on Net Financial Assets
We first look at the treatment effect of e401 on net total financial assets. We give estimates of the ATE and ATT that corresponds to the linear model
\[\begin{equation*} Y = D \alpha + f(X)'\beta+ \epsilon, \end{equation*}\]
where \(f(X)\) includes indicators of marital status, two-earner status, defined benefit pension status, IRA participation status, and home ownership status, and orthogonal polynomials of degrees 2, 4, 6 and 8 in family size, education, age and income, respectively. The dimensions of \(f(X)\) is 25.
In the first step, we report estimates of the average treatment effect (ATE) of 401(k) eligibility on net financial assets both in the partially linear regression (PLR) model and in the interactive regression model (IRM) allowing for heterogeneous treatment effects.
I decided to low down the degree of the income variable from 8 to 6 since sklearn has some problems with values greater than float32
. I can not use np.log
in inc
variable since it has negative values. As a result, we got a column with some None
values. For more information about this problem, check this link.
# Constructing the data (as DoubleMLData)
= "net_tfa ~ e401 + poly(age, 6, raw=TRUE) + poly(inc, 6, raw=TRUE) + poly(educ, 4, raw=TRUE) + poly(fsize, 2, raw=TRUE) + marr + twoearn + db + pira + hown"
formula_flex
= as.data.table(model.frame(formula_flex, pension))
model_flex
= colnames(model_flex)[-c(1,2)]
x_cols
= DoubleMLData$new(model_flex,
data_ml_aux y_col = "net_tfa",
d_cols = "e401",
x_cols=x_cols)
<- dim(model_flex)[2]-2
p
p
## [1] 23
# complex model with two-way interactions
#data_interactions = fetch_401k(polynomial_features = TRUE, instrument = FALSE)
# Constructing the data (as DoubleMLData)
= data.copy()[['marr', 'twoearn', 'db', 'pira', 'hown']]
features
= {'age': 6,
poly_dict 'inc': 6,
'educ': 4,
'fsize': 2}
for key, degree in poly_dict.items():
= PolynomialFeatures(degree, include_bias=False)
poly = poly.fit_transform(data[[key]])
data_transf = poly.get_feature_names([key])
x_cols = pd.DataFrame(data_transf, columns=x_cols)
data_transf
= pd.concat((features, data_transf),
features =1, sort=False)
axis
= pd.concat((data.copy()[['net_tfa', 'e401']], features.copy()),
model_flex =1, sort=False)
axis
= model_flex.columns.to_list()[2:]
x_cols
# Initialize DoubleMLData (data-backend of DoubleML)
= dml.DoubleMLData(model_flex, y_col='net_tfa', \
data_ml_aux ='e401' , x_cols = x_cols)
d_cols
# complex model with two-way interactions
# data_interactions
# fetch_401K( return_type = 'DataFrame' , polynomial_features = True )
= model_flex.shape[1] - 2
p print(p)
## 23
= model_flex.shape[1] - 2
p print(p)
## 23
16.4 Partially Linear Regression Models (PLR)
We start using lasso to estimate the function \(g_0\) and \(m_0\) in the following PLR model:
\[\begin{eqnarray} & Y = D\theta_0 + g_0(X) + \zeta, & E[\zeta \mid D,X]= 0,\\ & D = m_0(X) + V, & E[V \mid X] = 0. \end{eqnarray}\]
# Estimating the PLR
::get_logger("mlr3")$set_threshold("warn")
lgrset.seed(123)
<- lrn("regr.cv_glmnet",nfolds = 5, s = "lambda.min")
lasso <- lrn("classif.cv_glmnet", nfolds = 5, s = "lambda.min")
lasso_class
<- DoubleMLPLR$new(data_ml_aux, ml_g = lasso, ml_m = lasso_class, n_folds=3)
dml_plr $fit(store_predictions=TRUE)
dml_plr$summary() dml_plr
## [1] "Estimates and significance testing of the effect of target variables"
## Estimate. Std. Error t value Pr(>|t|)
## e401 8952 1269 7.057 1.7e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
<- dml_plr$coef
lasso_plr <- dml_plr$se lasso_std_plr
# Estimating the PLR
# Initialize learners
= 0.0001*np.logspace(0, 4, 10)
Cs = make_pipeline(StandardScaler(), LassoCV(cv=5, max_iter=10000))
lasso = make_pipeline(StandardScaler(),
lasso_class =5, penalty='l1', solver='liblinear',
LogisticRegressionCV(cv= Cs, max_iter=1000))
Cs
123)
np.random.seed(# Initialize DoubleMLPLR model
= dml.DoubleMLPLR(data_ml_aux,
dml_plr = lasso,
ml_g = lasso_class,
ml_m = 3)
n_folds
=True) dml_plr.fit(store_predictions
## <doubleml.double_ml_plr.DoubleMLPLR object at 0x00000000245CEAF0>
= dml_plr.summary.coef[0]
lasso_plr = dml_plr.summary['std err'][0]
lasso_std_plr dml_plr.summary
## coef std err ... 2.5 % 97.5 %
## e401 8919.577401 1325.857662 ... 6320.944135 11518.210667
##
## [1 rows x 6 columns]
Let us check the predictive performance of this model.
$params_names() dml_plr
## [1] "ml_g" "ml_m"
<- as.matrix(dml_plr$predictions$ml_g) # predictions of g_o
g_hat <- as.matrix(dml_plr$predictions$ml_m) # predictions of m_o m_hat
= dml_plr.predictions['ml_g'].flatten() # predictions of g_o
g_hat = dml_plr.predictions['ml_m'].flatten() # predictions of m_o
m_hat print(dml_plr.params_names)
## ['ml_g', 'ml_m']
# cross-fitted RMSE: outcome
<- as.matrix(pension$net_tfa) # true observations
y <- as.numeric(dml_plr$coef) # estimated regression coefficient
theta <- as.matrix(pension$e401)
d <- as.matrix(d*theta)+g_hat # predictions for y
predictions_y <- sqrt(mean((y-predictions_y)^2))
lasso_y_rmse lasso_y_rmse
## [1] 53735.74
# cross-fitted RMSE: outcome
= pension.net_tfa.to_numpy() # true observations
y = dml_plr.coef[ 0 ] # estimated regression coefficient
theta = pension.e401.to_numpy()
d = d*theta + g_hat # predictions for y
predictions_y = np.sqrt( np.mean( ( y - predictions_y ) ** 2 ) )
lasso_y_rmse lasso_y_rmse
## 54342.024308477914
# cross-fitted RMSE: treatment
<- as.matrix(pension$e401)
d <- sqrt(mean((d-m_hat)^2))
lasso_d_rmse lasso_d_rmse
## [1] 0.4443213
# cross-fitted ce: treatment
mean(ifelse(m_hat > 0.5, 1, 0) != d)
## [1] 0.3167927
# cross-fitted RMSE: treatment
= pension.e401.to_numpy()
d = np.sqrt( np.mean( ( d - m_hat ) ** 2 ) )
lasso_d_rmse print( lasso_d_rmse )
# cross-fitted ce: treatment
## 0.4585164421000309
print(np.mean( ( m_hat > 0.5 ) * 1 != d ))
## 0.31376701966717097
Then, we repeat this procedure for various machine learning methods.
# Random Forest
::get_logger("mlr3")$set_threshold("warn")
lgr
<- lrn("regr.ranger")
randomForest
<- lrn("classif.ranger")
randomForest_class
<- DoubleMLPLR$new(data_ml_aux,
dml_plr ml_g = randomForest,
ml_m = randomForest_class,
n_folds=3)
$fit(store_predictions=TRUE) # set store_predictions=TRUE to evaluate the model
dml_plr<- dml_plr$coef
forest_plr <- dml_plr$se
forest_std_plr $summary() dml_plr
## [1] "Estimates and significance testing of the effect of target variables"
## Estimate. Std. Error t value Pr(>|t|)
## e401 8359 1267 6.599 4.14e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Random Forest
= RandomForestRegressor(
randomForest = 500 )
n_estimators
= RandomForestClassifier(
randomForest_class = 500 )
n_estimators
= dml.DoubleMLPLR(data_ml_aux, ml_g = randomForest, \
dml_plr = randomForest_class, n_folds = 3)
ml_m =True) # set store_predictions=TRUE to evaluate the model dml_plr.fit(store_predictions
## <doubleml.double_ml_plr.DoubleMLPLR object at 0x0000000064929BE0>
= dml_plr.coef
forest_plr = dml_plr.summary[ 'std err' ]
forest_std_plr
dml_plr.summary
## coef std err ... 2.5 % 97.5 %
## e401 8702.588054 1310.423641 ... 6134.204913 11270.971196
##
## [1 rows x 6 columns]
We can compare the accuracy of this model to the model that has been estimated with lasso.
# Evaluation predictions
<- as.matrix(dml_plr$predictions$ml_g) # predictions of g_o
g_hat <- as.matrix(dml_plr$predictions$ml_m) # predictions of m_o
m_hat <- as.numeric(dml_plr$coef) # estimated regression coefficient
theta <- as.matrix(d*theta)+g_hat # predictions for y
predictions_y <- sqrt(mean((y-predictions_y)^2))
forest_y_rmse
# cross-fitted RMSE: treatment
<- sqrt(mean((d-m_hat)^2))
forest_d_rmse
#### Results
forest_y_rmse
## [1] 56078.76
forest_d_rmse
## [1] 0.4558187
# cross-fitted ce: treatment
mean(ifelse(m_hat > 0.5, 1, 0) != d)
## [1] 0.3324256
# Evaluation predictions
= dml_plr.predictions['ml_g'].flatten() # predictions of g_o
g_hat = dml_plr.predictions['ml_m'].flatten() # predictions of m_o
m_hat = pension.net_tfa.to_numpy()
y = dml_plr.coef[ 0 ] # estimated regression coefficient
theta = d*theta + g_hat # predictions for y
predictions_y = np.sqrt( np.mean( ( y - predictions_y ) ** 2 ) )
forest_y_rmse
# cross-fitted RMSE: treatment
= np.sqrt( np.mean( ( d - m_hat ) ** 2 ) )
forest_d_rmse
#### Results
print( forest_y_rmse )
## 58456.527347918534
print( forest_d_rmse )
# cross-fitted ce: treatment
## 0.4659644938522823
> 0.5 ) * 1 != d ) np.mean( ( m_hat
## 0.3495713565305093
# Trees
::get_logger("mlr3")$set_threshold("warn")
lgr
<- lrn("regr.rpart")
trees <- lrn("classif.rpart")
trees_class
<- DoubleMLPLR$new(data_ml_aux, ml_g = trees, ml_m = trees_class, n_folds=3)
dml_plr $fit(store_predictions=TRUE)
dml_plr$summary() dml_plr
## [1] "Estimates and significance testing of the effect of target variables"
## Estimate. Std. Error t value Pr(>|t|)
## e401 8835 1317 6.708 1.98e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
<- dml_plr$coef
tree_plr <- dml_plr$se
tree_std_plr
# Evaluation predictions
<- as.matrix(dml_plr$predictions$ml_g) # predictions of g_o
g_hat <- as.matrix(dml_plr$predictions$ml_m) # predictions of m_o
m_hat <- as.numeric(dml_plr$coef) # estimated regression coefficient
theta <- as.matrix(d*theta)+g_hat # predictions for y
predictions_y <- sqrt(mean((y-predictions_y)^2))
tree_y_rmse tree_y_rmse
## [1] 56708.52
# cross-fitted RMSE: treatment
<- sqrt(mean((d-m_hat)^2))
tree_d_rmse tree_d_rmse
## [1] 0.4553888
# cross-fitted ce: treatment
mean(ifelse(m_hat > 0.5, 1, 0) != d)
## [1] 0.3136662
# Trees
= DecisionTreeRegressor(
trees =30, ccp_alpha=0.01, min_samples_split=20, \
max_depth= np.round(20/3).astype(int))
min_samples_leaf
= DecisionTreeClassifier( max_depth=30, ccp_alpha=0.01, \
trees_class =20, \
min_samples_split= np.round(20/3).astype(int) )
min_samples_leaf
123)
np.random.seed(= dml.DoubleMLPLR(data_ml_aux,
dml_plr = trees,
ml_g = trees_class,
ml_m = 3)
n_folds =True) dml_plr.fit(store_predictions
## <doubleml.double_ml_plr.DoubleMLPLR object at 0x00000000C4DD61F0>
= dml_plr.summary
tree_summary print(tree_summary)
## coef std err ... 2.5 % 97.5 %
## e401 8594.661891 1383.808638 ... 5882.446799 11306.876983
##
## [1 rows x 6 columns]
=True) dml_plr.fit(store_predictions
## <doubleml.double_ml_plr.DoubleMLPLR object at 0x00000000C4DD61F0>
dml_plr.summary
## coef std err ... 2.5 % 97.5 %
## e401 8601.584926 1383.825475 ... 5889.336834 11313.833019
##
## [1 rows x 6 columns]
= dml_plr.coef
tree_plr = dml_plr.summary[ 'std err' ]
tree_std_plr
# Evaluation predictions
= dml_plr.predictions['ml_g'].flatten() # predictions of g_o
g_hat = dml_plr.predictions['ml_m'].flatten() # predictions of m_o
m_hat
= pension.net_tfa.to_numpy()
y = dml_plr.coef[ 0 ] # estimated regression coefficient
theta = d*theta + g_hat # predictions for y
predictions_y = np.sqrt( np.mean( ( y - predictions_y ) ** 2 ) )
tree_y_rmse print( tree_y_rmse )
# cross-fitted RMSE: treatment
## 58017.721042321384
= np.sqrt( np.mean( ( d - m_hat ) ** 2 ) )
tree_d_rmse print( tree_d_rmse )
# cross-fitted ce: treatment
## 0.4546690531536573
> 0.5 ) * 1 != d ) np.mean( ( m_hat
## 0.35269793242561776
# Boosting
::get_logger("mlr3")$set_threshold("warn")
lgr<- lrn("regr.xgboost",objective="reg:squarederror")
boost<- lrn("classif.xgboost",objective = "binary:logistic",eval_metric ="logloss")
boost_class
<- DoubleMLPLR$new(data_ml_aux, ml_g = boost, ml_m = boost_class, n_folds=3)
dml_plr $fit(store_predictions=TRUE)
dml_plr$summary() dml_plr
## [1] "Estimates and significance testing of the effect of target variables"
## Estimate. Std. Error t value Pr(>|t|)
## e401 8867 1342 6.606 3.95e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
<- dml_plr$coef
boost_plr <- dml_plr$se
boost_std_plr
# Evaluation predictions
<- as.matrix(dml_plr$predictions$ml_g) # predictions of g_o
g_hat <- as.matrix(dml_plr$predictions$ml_m) # predictions of m_o
m_hat <- as.numeric(dml_plr$coef) # estimated regression coefficient
theta <- as.matrix(d*theta)+g_hat # predictions for y
predictions_y <- sqrt(mean((y-predictions_y)^2))
boost_y_rmse boost_y_rmse
## [1] 59816.19
# cross-fitted RMSE: treatment
<- sqrt(mean((d-m_hat)^2))
boost_d_rmse boost_d_rmse
## [1] 0.4726804
# cross-fitted ce: treatment
mean(ifelse(m_hat > 0.5, 1, 0) != d)
## [1] 0.3237519
# Boosted Trees
= XGBRegressor(n_jobs=1, objective = "reg:squarederror" )
boost = XGBClassifier(use_label_encoder=False, n_jobs=1,
boost_class = "binary:logistic", \
objective = "logloss" )
eval_metric
123)
np.random.seed(= dml.DoubleMLPLR( data_ml_aux ,
dml_plr = boost,
ml_g = boost_class,
ml_m = 3)
n_folds =True) dml_plr.fit(store_predictions
## <doubleml.double_ml_plr.DoubleMLPLR object at 0x00000000C4DC2160>
= dml_plr.summary
boost_summary print( boost_summary )
## coef std err ... 2.5 % 97.5 %
## e401 9129.263678 1471.781415 ... 6244.625111 12013.902244
##
## [1 rows x 6 columns]
= dml_plr.coef
boost_plr = dml_plr.summary[ 'std err' ]
boost_std_plr
# Evaluation predictions
= dml_plr.predictions['ml_g'].flatten() # predictions of g_o
g_hat = dml_plr.predictions['ml_m'].flatten() # predictions of m_o
m_hat = pension.net_tfa.to_numpy()
y = dml_plr.coef[ 0 ] # estimated regression coefficient
theta = d*theta + g_hat # predictions for y
predictions_y
= np.sqrt( np.mean( ( y - predictions_y ) ** 2 ) )
boost_y_rmse print( boost_y_rmse )
# cross-fitted RMSE: treatment
## 62126.68237198471
= np.sqrt( np.mean( ( d - m_hat ) ** 2 ) )
boost_d_rmse print( boost_d_rmse )
# cross-fitted ce: treatment
## 0.4629817464588606
> 0.5 ) * 1 != d ) np.mean( ( m_hat
## 0.33605648008068584
Let’s sum up the results:
library(xtable)
<- matrix(0, 4, 4)
table 1,1:4] <- c(lasso_plr,forest_plr,tree_plr,boost_plr)
table[2,1:4] <- c(lasso_std_plr,forest_std_plr,tree_std_plr,boost_std_plr)
table[3,1:4] <- c(lasso_y_rmse,forest_y_rmse,tree_y_rmse,boost_y_rmse)
table[4,1:4] <- c(lasso_d_rmse,forest_d_rmse,tree_d_rmse,boost_d_rmse)
table[rownames(table) <- c("Estimate","Std.Error","RMSE Y","RMSE D")
colnames(table) <- c("Lasso","Random Forest","Trees","Boosting")
<- xtable(table, digits = 2)
tab tab
## % latex table generated in R 4.0.4 by xtable 1.8-4 package
## % Wed Nov 24 15:05:32 2021
## \begin{table}[ht]
## \centering
## \begin{tabular}{rrrrr}
## \hline
## & Lasso & Random Forest & Trees & Boosting \\
## \hline
## Estimate & 8952.25 & 8359.37 & 8834.64 & 8867.42 \\
## Std.Error & 1268.61 & 1266.75 & 1317.08 & 1342.33 \\
## RMSE Y & 53735.74 & 56078.76 & 56708.52 & 59816.19 \\
## RMSE D & 0.44 & 0.46 & 0.46 & 0.47 \\
## \hline
## \end{tabular}
## \end{table}
= np.zeros( (4, 4) )
table
0,0:4] = lasso_plr,forest_plr[0],tree_plr[0],boost_plr[0]
table[1,0:4] = lasso_std_plr,forest_std_plr,tree_std_plr,boost_std_plr
table[2,0:4] = lasso_y_rmse,forest_y_rmse,tree_y_rmse,boost_y_rmse
table[3,0:4] = lasso_d_rmse,forest_d_rmse,tree_d_rmse,boost_d_rmse
table[= pd.DataFrame( table , index = ["Estimate","Std.Error","RMSE Y","RMSE D" ], \
table_pd = [ "Lasso","Random Forest","Trees","Boosting"])
columns
table_pd.to_latex()
## '\\begin{tabular}{lrrrr}\n\\toprule\n{} & Lasso & Random Forest & Trees & Boosting \\\\\n\\midrule\nEstimate & 8919.577401 & 8702.588054 & 8601.584926 & 9129.263678 \\\\\nStd.Error & 1325.857662 & 1310.423641 & 1383.825475 & 1471.781415 \\\\\nRMSE Y & 54342.024308 & 58456.527348 & 58017.721042 & 62126.682372 \\\\\nRMSE D & 0.458516 & 0.465964 & 0.454669 & 0.462982 \\\\\n\\bottomrule\n\\end{tabular}\n'
The best model with lowest RMSE in both equation is the PLR model estimated via lasso. It gives the following estimate:
lasso_plr
## e401
## 8952.253
lasso_plr
## 8919.577401231152
16.5 Interactive Regression Model (IRM)
Next, we consider estimation of average treatment effects when treatment effects are fully heterogeneous:
\[\begin{eqnarray}\label{eq: HetPL1} & Y = g_0(D, X) + U, & \quad E[U \mid X, D]= 0,\\ & D = m_0(X) + V, & \quad E[V\mid X] = 0. \end{eqnarray}\]
To reduce the disproportionate impact of extreme propensity score weights in the interactive model we trim the propensity scores which are close to the bounds.
set.seed(123)
::get_logger("mlr3")$set_threshold("warn")
lgr
= DoubleMLIRM$new(data_ml_aux,
dml_irm ml_g = lasso,
ml_m = lasso_class,
trimming_threshold = 0.01,
n_folds=3)
$fit(store_predictions=TRUE)
dml_irm
$summary() dml_irm
## [1] "Estimates and significance testing of the effect of target variables"
## Estimate. Std. Error t value Pr(>|t|)
## e401 7941 1106 7.177 7.15e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
<- dml_irm$coef
lasso_irm <- dml_irm$se
lasso_std_irm
# predictions
$params_names() dml_irm
## [1] "ml_g0" "ml_g1" "ml_m"
<- as.matrix(dml_irm$predictions$ml_g0) # predictions of g_0(D=0, X)
g0_hat <- as.matrix(dml_irm$predictions$ml_g1) # predictions of g_0(D=1, X)
g1_hat <- d*g1_hat+(1-d)*g0_hat # predictions of g_0
g_hat <- as.matrix(dml_irm$predictions$ml_m) # predictions of m_o m_hat
# Lasso
= make_pipeline(StandardScaler(), LassoCV(cv=5, max_iter=20000))
lasso
# Initialize DoubleMLIRM model
123)
np.random.seed(= dml.DoubleMLIRM(data_ml_aux, ml_g = lasso, ml_m = lasso_class, trimming_threshold = 0.01, n_folds = 3)
dml_irm
=True) dml_irm.fit(store_predictions
## <doubleml.double_ml_irm.DoubleMLIRM object at 0x00000000C4DCA070>
= dml_irm.summary
lasso_summary print(dml_irm.summary)
## coef std err ... 2.5 % 97.5 %
## e401 8089.763322 1270.047728 ... 5600.515516 10579.011127
##
## [1 rows x 6 columns]
= dml_irm.coef[0]
lasso_irm = dml_irm.se[0]
lasso_std_irm
# predictions
print(dml_irm.params_names)
## ['ml_g0', 'ml_g1', 'ml_m']
= dml_irm.predictions['ml_g0'].flatten() # predictions of g_0(D=0, X)
g0_hat = dml_irm.predictions['ml_g1'].flatten() # predictions of g_0(D=1, X)
g1_hat = d * g1_hat + ( 1 - d )*g0_hat # predictions of g_0
g_hat = dml_irm.predictions['ml_m'].flatten() # predictions of m_o m_hat
# cross-fitted RMSE: outcome
<- as.matrix(pension$net_tfa) # true observations
y <- as.matrix(pension$e401)
d <- sqrt(mean((y-g_hat)^2))
lasso_y_irm lasso_y_irm
## [1] 55439.51
# cross-fitted RMSE: treatment
<- sqrt(mean((d-m_hat)^2))
lasso_d_irm lasso_d_irm
## [1] 0.4440579
# cross-fitted ce: treatment
mean(ifelse(m_hat > 0.5, 1, 0) != d)
## [1] 0.3153807
# cross-fitted RMSE: outcome
= pension.net_tfa.to_numpy() # true observations
y = pension.e401.to_numpy()
d = np.sqrt( np.mean( ( y - g_hat ) ** 2 ) )
lasso_y_irm print(lasso_y_irm)
# cross-fitted RMSE: treatment
## 54497.27607070185
= np.sqrt( np.mean( ( d - m_hat ) ** 2 ) )
lasso_d_irm print( lasso_d_irm )
# cross-fitted ce: treatment
## 0.45851619758764794
> 0.5 ) * 1 != d ) np.mean( ( m_hat
## 0.31376701966717097
##### forest #####
= DoubleMLIRM$new(data_ml_aux,
dml_irm ml_g = randomForest,
ml_m = randomForest_class,
trimming_threshold = 0.01,
n_folds=3)
$fit(store_predictions=TRUE)
dml_irm
#Summary
$summary() dml_irm
## [1] "Estimates and significance testing of the effect of target variables"
## Estimate. Std. Error t value Pr(>|t|)
## e401 7809 1311 5.955 2.6e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
<- dml_irm$coef
forest_irm <- dml_plr$se
forest_std_irm
# predictions
<- as.matrix(dml_irm$predictions$ml_g0) # predictions of g_0(D=0, X)
g0_hat <- as.matrix(dml_irm$predictions$ml_g1) # predictions of g_0(D=1, X)
g1_hat <- d*g1_hat+(1-d)*g0_hat # predictions of g_0
g_hat <- as.matrix(dml_irm$predictions$ml_m) # predictions of m_o
m_hat
# cross-fitted RMSE: outcome
<- as.matrix(pension$net_tfa) # true observations
y <- as.matrix(pension$e401)
d <- sqrt(mean((y-g_hat)^2))
forest_y_irm forest_y_irm
## [1] 55502.95
# cross-fitted RMSE: treatment
<- sqrt(mean((d-m_hat)^2))
forest_d_irm forest_d_irm
## [1] 0.4559869
# cross-fitted ce: treatment
mean(ifelse(m_hat > 0.5, 1, 0) != d)
## [1] 0.3290973
##### trees #####
<- DoubleMLIRM$new(data_ml_aux,
dml_irm ml_g = trees,
ml_m = trees_class,
trimming_threshold = 0.01,
n_folds=3)
$fit(store_predictions=TRUE)
dml_irm
$summary() dml_irm
## [1] "Estimates and significance testing of the effect of target variables"
## Estimate. Std. Error t value Pr(>|t|)
## e401 7477 1187 6.3 2.97e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
<- dml_irm$coef
tree_irm <- dml_irm$se
tree_std_irm
# predictions
<- as.matrix(dml_irm$predictions$ml_g0) # predictions of g_0(D=0, X)
g0_hat <- as.matrix(dml_irm$predictions$ml_g1) # predictions of g_0(D=1, X)
g1_hat <- d*g1_hat+(1-d)*g0_hat # predictions of g_0
g_hat <- as.matrix(dml_irm$predictions$ml_m) # predictions of m_o
m_hat
# cross-fitted RMSE: outcome
<- as.matrix(pension$net_tfa) # true observations
y <- as.matrix(pension$e401)
d <- sqrt(mean((y-g_hat)^2))
tree_y_irm tree_y_irm
## [1] 57305.26
# cross-fitted RMSE: treatment
<- sqrt(mean((d-m_hat)^2))
tree_d_irm tree_d_irm
## [1] 0.455894
# cross-fitted ce: treatment
mean(ifelse(m_hat > 0.5, 1, 0) != d)
## [1] 0.3135653
##### boosting #####
<- DoubleMLIRM$new(data_ml_aux,
dml_irm ml_g = boost,
ml_m = boost_class,
trimming_threshold = 0.01,
n_folds=3)
$fit(store_predictions=TRUE)
dml_irm
$summary() dml_irm
## [1] "Estimates and significance testing of the effect of target variables"
## Estimate. Std. Error t value Pr(>|t|)
## e401 8353 1228 6.803 1.03e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
<- dml_irm$coef
boost_irm <- dml_irm$se
boost_std_irm
# predictions
<- as.matrix(dml_irm$predictions$ml_g0) # predictions of g_0(D=0, X)
g0_hat <- as.matrix(dml_irm$predictions$ml_g1) # predictions of g_0(D=1, X)
g1_hat <- d*g1_hat+(1-d)*g0_hat # predictions of g_0
g_hat <- as.matrix(dml_irm$predictions$ml_m) # predictions of m_o
m_hat
# cross-fitted RMSE: outcome
<- as.matrix(pension$net_tfa) # true observations
y <- as.matrix(pension$e401)
d <- sqrt(mean((y-g_hat)^2))
boost_y_irm boost_y_irm
## [1] 61055.91
# cross-fitted RMSE: treatment
<- sqrt(mean((d-m_hat)^2))
boost_d_irm boost_d_irm
## [1] 0.4732346
# cross-fitted ce: treatment
mean(ifelse(m_hat > 0.5, 1, 0) != d)
## [1] 0.3234493
##### forest #####
= RandomForestRegressor(n_estimators=500)
randomForest = RandomForestClassifier(n_estimators=500)
randomForest_class
123)
np.random.seed(= dml.DoubleMLIRM(data_ml_aux,
dml_irm = randomForest,
ml_g = randomForest_class,
ml_m = 0.01,
trimming_threshold = 3)
n_folds = True ) dml_irm.fit( store_predictions
## <doubleml.double_ml_irm.DoubleMLIRM object at 0x0000000064929A30>
print( dml_irm.summary )
## coef std err ... 2.5 % 97.5 %
## e401 8429.821206 1521.201806 ... 5448.320453 11411.32196
##
## [1 rows x 6 columns]
= dml_irm.coef[ 0 ]
forest_irm = dml_irm.se[ 0 ]
forest_std_irm
# predictions
= dml_irm.predictions['ml_g0'].flatten() # predictions of g_0(D=0, X)
g0_hat = dml_irm.predictions['ml_g1'].flatten() # predictions of g_0(D=1, X)
g1_hat = d * g1_hat + ( 1 - d )*g0_hat # predictions of g_0
g_hat = dml_irm.predictions['ml_m'].flatten() # predictions of m_o
m_hat
# cross-fitted RMSE: outcome
= pension.net_tfa.to_numpy()
y = pension.e401.to_numpy()
d = np.sqrt( np.mean( ( y - g_hat ) ** 2 ) )
forest_y_irm print( forest_y_irm )
# cross-fitted RMSE: treatment
## 56839.92174410771
= np.sqrt( np.mean( ( d - m_hat ) ** 2 ) )
forest_d_irm print( forest_d_irm )
# cross-fitted ce: treatment
## 0.46312674664653625
> 0.5 ) * 1 != d )
np.mean( ( m_hat
##### trees #####
## 0.33767019667170955
= DecisionTreeRegressor( max_depth=30, ccp_alpha=0.01, min_samples_split=20, min_samples_leaf=67)
trees = DecisionTreeClassifier( max_depth=30, ccp_alpha=0.01, min_samples_split=20, min_samples_leaf=34)
trees_class 123)
np.random.seed(= dml.DoubleMLIRM(data_ml_aux,
dml_irm = trees,
ml_g = trees_class,
ml_m = 0.01,
trimming_threshold = 3)
n_folds
= True ) dml_irm.fit( store_predictions
## <doubleml.double_ml_irm.DoubleMLIRM object at 0x00000000C4DD63D0>
print(dml_irm.summary)
## coef std err ... 2.5 % 97.5 %
## e401 7462.262513 1171.316712 ... 5166.523942 9758.001083
##
## [1 rows x 6 columns]
= dml_irm.coef[ 0 ]
tree_irm = dml_irm.se[ 0 ]
tree_std_irm
# predictions
= dml_irm.predictions['ml_g0'].flatten() # predictions of g_0(D=0, X)
g0_hat = dml_irm.predictions['ml_g1'].flatten() # predictions of g_0(D=1, X)
g1_hat = d * g1_hat + ( 1 - d )*g0_hat # predictions of g_0
g_hat = dml_irm.predictions['ml_m'].flatten() # predictions of m_o
m_hat
# cross-fitted RMSE: outcome
= pension.net_tfa.to_numpy()
y = pension.e401.to_numpy()
d = np.sqrt( np.mean( ( y - g_hat ) ** 2 ) )
tree_y_irm print( tree_y_irm )
# cross-fitted RMSE: treatment
## 56517.04167292204
= np.sqrt( np.mean( ( d - m_hat ) ** 2 ) )
tree_d_irm print( tree_d_irm )
# cross-fitted ce: treatment
## 0.4546690531536573
> 0.5 ) * 1 != d )
np.mean( ( m_hat
##### boosting #####
## 0.35269793242561776
= XGBRegressor(n_jobs=1, objective = "reg:squarederror")
boost = XGBClassifier(use_label_encoder=False, n_jobs=1,
boost_class = "binary:logistic",
objective = "logloss")
eval_metric 123)
np.random.seed(= dml.DoubleMLIRM(data_ml_aux, ml_g = boost, ml_m = boost_class,
dml_irm = 0.01, n_folds = 3)
trimming_threshold = True ) dml_irm.fit( store_predictions
## <doubleml.double_ml_irm.DoubleMLIRM object at 0x00000000C4DCA070>
print( dml_irm.summary )
## coef std err t P>|t| 2.5 % 97.5 %
## e401 9756.926817 2500.320649 3.90227 0.000095 4856.388395 14657.46524
= dml_irm.coef[ 0 ]
boost_irm = dml_irm.se[ 0 ]
boost_std_irm
# predictions
= dml_irm.predictions['ml_g0'].flatten() # predictions of g_0(D=0, X)
g0_hat = dml_irm.predictions['ml_g1'].flatten() # predictions of g_0(D=1, X)
g1_hat = d * g1_hat + ( 1 - d )*g0_hat # predictions of g_0
g_hat = dml_irm.predictions['ml_m'].flatten() # predictions of m_o
m_hat
# cross-fitted RMSE: outcome
= pension.net_tfa.to_numpy()
y = pension.e401.to_numpy()
d = np.sqrt( np.mean( ( y - g_hat ) ** 2 ) )
boost_y_irm print( boost_y_irm )
# cross-fitted RMSE: treatment
## 61594.80967416296
= np.sqrt( np.mean( ( d - m_hat ) ** 2 ) )
boost_d_irm print( boost_d_irm )
# cross-fitted ce: treatment
## 0.46296289075787445
> 0.5 ) * 1 != d ) np.mean( ( m_hat
## 0.33605648008068584
library(xtable)
<- matrix(0, 4, 4)
table 1,1:4] <- c(lasso_irm,forest_irm,tree_irm,boost_irm)
table[2,1:4] <- c(lasso_std_irm,forest_std_irm,tree_std_irm,boost_std_irm)
table[3,1:4] <- c(lasso_y_irm,forest_y_irm,tree_y_irm,boost_y_irm)
table[4,1:4] <- c(lasso_d_irm,forest_d_irm,tree_d_irm,boost_d_irm)
table[rownames(table) <- c("Estimate","Std.Error","RMSE Y","RMSE D")
colnames(table) <- c("Lasso","Random Forest","Trees","Boosting")
<- xtable(table, digits = 2)
tab tab
## % latex table generated in R 4.0.4 by xtable 1.8-4 package
## % Wed Nov 24 15:08:21 2021
## \begin{table}[ht]
## \centering
## \begin{tabular}{rrrrr}
## \hline
## & Lasso & Random Forest & Trees & Boosting \\
## \hline
## Estimate & 7941.05 & 7808.95 & 7477.07 & 8353.33 \\
## Std.Error & 1106.52 & 1342.33 & 1186.75 & 1227.92 \\
## RMSE Y & 55439.51 & 55502.95 & 57305.26 & 61055.91 \\
## RMSE D & 0.44 & 0.46 & 0.46 & 0.47 \\
## \hline
## \end{tabular}
## \end{table}
= np.zeros( (4, 4) )
table2 0,0:4] = lasso_irm,forest_irm,tree_irm,boost_irm
table2[1,0:4] = lasso_std_irm,forest_std_irm,tree_std_irm,boost_std_irm
table2[2,0:4] = lasso_y_irm,forest_y_irm,tree_y_irm,boost_y_irm
table2[3,0:4] = lasso_d_irm,forest_d_irm,tree_d_irm,boost_d_irm
table2[= pd.DataFrame(
table2_pd
table2 , = ["Estimate","Std.Error","RMSE Y","RMSE D" ], \
index = [ "Lasso","Random Forest","Trees","Boosting"])
columns table2_pd.to_latex()
## '\\begin{tabular}{lrrrr}\n\\toprule\n{} & Lasso & Random Forest & Trees & Boosting \\\\\n\\midrule\nEstimate & 8089.763322 & 8429.821206 & 7462.262513 & 9756.926817 \\\\\nStd.Error & 1270.047728 & 1521.201806 & 1171.316712 & 2500.320649 \\\\\nRMSE Y & 54497.276071 & 56839.921744 & 56517.041673 & 61594.809674 \\\\\nRMSE D & 0.458516 & 0.463127 & 0.454669 & 0.462963 \\\\\n\\bottomrule\n\\end{tabular}\n'
Here, Random Forest gives the best prediction rule for \(g_0\) and Lasso the best prediction rule for \(m_0\), respectively. Let us fit the IRM model using the best ML method for each equation to get a final estimate for the treatment effect of eligibility.
set.seed(123)
::get_logger("mlr3")$set_threshold("warn")
lgr
= DoubleMLIRM$new(data_ml_aux,
dml_irm ml_g = randomForest,
ml_m = lasso_class,
trimming_threshold = 0.01,
n_folds=3)
$fit(store_predictions=TRUE)
dml_irm
$summary() dml_irm
## [1] "Estimates and significance testing of the effect of target variables"
## Estimate. Std. Error t value Pr(>|t|)
## e401 8286 1118 7.411 1.25e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
<- dml_irm$coef
best_irm <- dml_irm$se best_std_irm
123)
np.random.seed(
= dml.DoubleMLIRM(data_ml_aux,
dml_irm = randomForest,
ml_g = lasso_class,
ml_m = 0.01,
trimming_threshold = 3)
n_folds
=True) dml_irm.fit(store_predictions
## <doubleml.double_ml_irm.DoubleMLIRM object at 0x000000000C648C10>
print( dml_irm.summary )
## coef std err ... 2.5 % 97.5 %
## e401 8403.399751 1348.088788 ... 5761.194279 11045.605224
##
## [1 rows x 6 columns]
= dml_irm.coef[0]
best_irm = dml_irm.se[0] best_std_irm
These estimates that flexibly account for confounding are substantially attenuated relative to the baseline estimate (19559) that does not account for confounding. They suggest much smaller causal effects of 401(k) eligiblity on financial asset holdings.
16.6 Local Average Treatment Effects of 401(k) Participation on Net Financial Assets
16.7 Interactive IV Model (IIVM)
Now, we consider estimation of local average treatment effects (LATE) of participation with the binary instrument e401
. As before, \(Y\) denotes the outcome net_tfa
, and \(X\) is the vector of covariates. Here the structural equation model is:
\[\begin{eqnarray} & Y = g_0(Z,X) + U, &\quad E[U\mid Z,X] = 0,\\ & D = r_0(Z,X) + V, &\quad E[V\mid Z, X] = 0,\\ & Z = m_0(X) + \zeta, &\quad E[\zeta \mid X] = 0. \end{eqnarray}\]
# Constructing the data (as DoubleMLData)
= "net_tfa ~ p401+ e401 +
formula_flex2 poly(age, 6, raw=TRUE) +
poly(inc, 6, raw=TRUE) +
poly(educ, 4, raw=TRUE) +
poly(fsize, 2, raw=TRUE) +
marr + twoearn + db +
pira + hown"
= as.data.table(model.frame(formula_flex2, data))
model_flex2
= colnames(model_flex2)[-c(1,2,3)]
x_cols
= DoubleMLData$new(model_flex2, y_col = "net_tfa", d_cols = "p401", z_cols ="e401",x_cols=x_cols) data_IV_aux
# Constructing the data (as DoubleMLData)
= data.copy()[['marr', 'twoearn', 'db', 'pira', 'hown']]
features = {'age': 6, 'inc': 6, 'educ': 4, 'fsize': 2}
poly_dict for key, degree in poly_dict.items():
= PolynomialFeatures(degree, include_bias=False)
poly = poly.fit_transform(data[[key]])
data_transf = poly.get_feature_names([key])
x_cols = pd.DataFrame(data_transf, columns=x_cols)
data_transf = pd.concat((features, data_transf),
features =1, sort=False)
axis= pd.concat((data.copy()[['net_tfa', 'p401' , 'e401']], features.copy()),
model_flex2 =1, sort=False)
axis= model_flex2.columns.to_list()[3:]
x_cols # Initialize DoubleMLData (data-backend of DoubleML)
= dml.DoubleMLData(model_flex2, y_col='net_tfa', \
data_IV_aux ='p401' , z_cols = 'e401' , \
d_cols = x_cols) x_cols
set.seed(123)
::get_logger("mlr3")$set_threshold("warn")
lgr
= DoubleMLIIVM$new(data_IV_aux,
dml_MLIIVM ml_g = lasso,
ml_m = lasso_class,
ml_r = lasso_class,
n_folds=3,
subgroups = list(always_takers = FALSE,
never_takers = TRUE))
$fit(store_predictions=TRUE)
dml_MLIIVM
$summary() dml_MLIIVM
## [1] "Estimates and significance testing of the effect of target variables"
## Estimate. Std. Error t value Pr(>|t|)
## p401 11484 1596 7.195 6.25e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
<- dml_MLIIVM$coef
lasso_MLIIVM <- dml_MLIIVM$se lasso_std_MLIIVM
# Lasso
= make_pipeline(StandardScaler(), LassoCV(cv=5, max_iter=20000))
lasso
# Initialize DoubleMLIRM model
123)
np.random.seed(= dml.DoubleMLIIVM(data_IV_aux,
dml_MLIIVM = lasso,
ml_g = lasso_class,
ml_m = lasso_class,
ml_r = {'always_takers': False,
subgroups 'never_takers': True},
= 0.01,
trimming_threshold = 3)
n_folds
=True) dml_MLIIVM.fit(store_predictions
## <doubleml.double_ml_iivm.DoubleMLIIVM object at 0x00000000614DAF70>
print( dml_MLIIVM.summary )
## coef std err ... 2.5 % 97.5 %
## p401 11764.551405 1842.601194 ... 8153.119428 15375.983383
##
## [1 rows x 6 columns]
= dml_MLIIVM.coef[ 0 ]
lasso_MLIIVM = dml_MLIIVM.se[ 0 ] lasso_std_MLIIVM
The confidence interval for the local average treatment effect of participation is given by
$confint(level = 0.95) dml_MLIIVM
## 2.5 % 97.5 %
## p401 8356.038 14612.92
dml_MLIIVM.confint()
## 2.5 % 97.5 %
## p401 8153.119428 15375.983383
Here we can also check the accuracy of the model:
# variables
<- as.matrix(pension$net_tfa) # true observations
y <- as.matrix(pension$p401)
d <- as.matrix(pension$e401)
z
# predictions
$params_names() dml_MLIIVM
## [1] "ml_g0" "ml_g1" "ml_m" "ml_r0" "ml_r1"
<- as.matrix(dml_MLIIVM$predictions$ml_g0) # predictions of g_0(z=0, X)
g0_hat <- as.matrix(dml_MLIIVM$predictions$ml_g1) # predictions of g_0(z=1, X)
g1_hat <- z*g1_hat+(1-z)*g0_hat # predictions of g_0
g_hat <- as.matrix(dml_MLIIVM$predictions$ml_r0) # predictions of r_0(z=0, X)
r0_hat <- as.matrix(dml_MLIIVM$predictions$ml_r1) # predictions of r_0(z=1, X)
r1_hat <- z*r1_hat+(1-z)*r0_hat # predictions of r_0
r_hat <- as.matrix(dml_MLIIVM$predictions$ml_m) # predictions of m_o m_hat
# variables
= pension.net_tfa.to_numpy()
y = pension.p401.to_numpy()
d = pension.e401.to_numpy()
z
# predictions
print( dml_MLIIVM.params_names )
## ['ml_g0', 'ml_g1', 'ml_m', 'ml_r0', 'ml_r1']
= dml_MLIIVM.predictions['ml_g0'].flatten() # predictions of g_0(z=0, X)
g0_hat = dml_MLIIVM.predictions['ml_g1'].flatten() # predictions of g_0(z=1, X)
g1_hat = d * g1_hat + ( 1 - d )*g0_hat # predictions of g_0
g_hat = dml_MLIIVM.predictions['ml_r0'].flatten() # predictions of r_0(z=0, X)
r0_hat = dml_MLIIVM.predictions['ml_r1'].flatten() # predictions of r_0(z=1, X)
r1_hat = z * r1_hat + (1 - z) * r0_hat # predictions of r_0
r_hat = dml_MLIIVM.predictions['ml_m'].flatten() # predictions of m_odml_MLIIVM.confint() m_hat
# cross-fitted RMSE: outcome
<- sqrt(mean((y-g_hat)^2))
lasso_y_MLIIVM lasso_y_MLIIVM
## [1] 55439.51
# cross-fitted RMSE: treatment
<- sqrt(mean((d-r_hat)^2))
lasso_d_MLIIVM lasso_d_MLIIVM
## [1] 0.2730651
# cross-fitted RMSE: instrument
<- sqrt(mean((z-m_hat)^2))
lasso_z_MLIIVM lasso_z_MLIIVM
## [1] 0.4440579
# cross-fitted RMSE: outcome
= np.sqrt( np.mean( ( y - g_hat ) ** 2 ) )
lasso_y_MLIIVM print( lasso_y_MLIIVM )
# cross-fitted RMSE: treatment
## 54127.27882639754
= np.sqrt( np.mean( ( d - r_hat ) ** 2 ) )
lasso_d_MLIIVM print( lasso_d_MLIIVM )
# cross-fitted RMSE: instrument
## 0.2786204170023307
= np.sqrt( np.mean( ( z - m_hat ) ** 2 ) )
lasso_z_MLIIVM print( lasso_z_MLIIVM )
## 0.45851619758764794
Again, we repeat the procedure for the other machine learning methods:
### random forest ###
set.seed(123)
::get_logger("mlr3")$set_threshold("warn")
lgr
= DoubleMLIIVM$new(data_IV_aux,
dml_MLIIVM ml_g = randomForest,
ml_m = randomForest_class,
ml_r = randomForest_class,
n_folds=3,
subgroups = list(always_takers = FALSE,
never_takers = TRUE))
$fit(store_predictions=TRUE)
dml_MLIIVM
$summary() dml_MLIIVM
## [1] "Estimates and significance testing of the effect of target variables"
## Estimate. Std. Error t value Pr(>|t|)
## p401 12386 1861 6.655 2.84e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
<- dml_MLIIVM$coef
forest_MLIIVM <- dml_MLIIVM$se
forest_std_MLIIVM
# predictions
<- as.matrix(dml_MLIIVM$predictions$ml_g0) # predictions of g_0(Z=0, X)
g0_hat <- as.matrix(dml_MLIIVM$predictions$ml_g1) # predictions of g_0(Z=1, X)
g1_hat <- z*g1_hat+(1-z)*g0_hat # predictions of g_0
g_hat <- as.matrix(dml_MLIIVM$predictions$ml_r0) # predictions of r_0(Z=0, X)
r0_hat <- as.matrix(dml_MLIIVM$predictions$ml_r1) # predictions of r_0(Z=1, X)
r1_hat <- z*r1_hat+(1-z)*r0_hat # predictions of r_0
r_hat <- as.matrix(dml_MLIIVM$predictions$ml_m) # predictions of m_o
m_hat
# cross-fitted RMSE: outcome
<- sqrt(mean((y-g_hat)^2))
forest_y_MLIIVM forest_y_MLIIVM
## [1] 55894.92
# cross-fitted RMSE: treatment
<- sqrt(mean((d-r_hat)^2))
forest_d_MLIIVM forest_d_MLIIVM
## [1] 0.281485
# cross-fitted RMSE: instrument
<- sqrt(mean((z-m_hat)^2))
forest_z_MLIIVM forest_z_MLIIVM
## [1] 0.4558207
### trees ###
= DoubleMLIIVM$new(data_IV_aux,
dml_MLIIVM ml_g = trees,
ml_m = trees_class,
ml_r = trees_class,
n_folds=3,
subgroups = list(always_takers = FALSE,
never_takers = TRUE))
$fit(store_predictions=TRUE)
dml_MLIIVM
$summary() dml_MLIIVM
## [1] "Estimates and significance testing of the effect of target variables"
## Estimate. Std. Error t value Pr(>|t|)
## p401 11341 1719 6.596 4.23e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
<- dml_MLIIVM$coef
tree_MLIIVM <- dml_MLIIVM$se
tree_std_MLIIVM
# predictions
<- as.matrix(dml_MLIIVM$predictions$ml_g0) # predictions of g_0(Z=0, X)
g0_hat <- as.matrix(dml_MLIIVM$predictions$ml_g1) # predictions of g_0(Z=1, X)
g1_hat <- z*g1_hat+(1-z)*g0_hat # predictions of g_0
g_hat <- as.matrix(dml_MLIIVM$predictions$ml_r0) # predictions of r_0(Z=0, X)
r0_hat <- as.matrix(dml_MLIIVM$predictions$ml_r1) # predictions of r_0(Z=1, X)
r1_hat <- z*r1_hat+(1-z)*r0_hat # predictions of r_0
r_hat <- as.matrix(dml_MLIIVM$predictions$ml_m) # predictions of m_o
m_hat
# cross-fitted RMSE: outcome
<- sqrt(mean((y-g_hat)^2))
tree_y_MLIIVM tree_y_MLIIVM
## [1] 58138.87
# cross-fitted RMSE: treatment
<- sqrt(mean((d-r_hat)^2))
tree_d_MLIIVM tree_d_MLIIVM
## [1] 0.2780659
# cross-fitted RMSE: instrument
<- sqrt(mean((z-m_hat)^2))
tree_z_MLIIVM tree_z_MLIIVM
## [1] 0.4556836
### boosting ###
= DoubleMLIIVM$new(data_IV_aux,
dml_MLIIVM ml_g = boost,
ml_m = boost_class,
ml_r = boost_class,
n_folds=3,
subgroups = list(always_takers = FALSE,
never_takers = TRUE))
$fit(store_predictions=TRUE)
dml_MLIIVM
$summary() dml_MLIIVM
## [1] "Estimates and significance testing of the effect of target variables"
## Estimate. Std. Error t value Pr(>|t|)
## p401 12529 1822 6.878 6.08e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
<- dml_MLIIVM$coef
boost_MLIIVM <- dml_MLIIVM$se
boost_std_MLIIVM
# predictions
<- as.matrix(dml_MLIIVM$predictions$ml_g0) # predictions of g_0(Z=0, X)
g0_hat <- as.matrix(dml_MLIIVM$predictions$ml_g1) # predictions of g_0(Z=1, X)
g1_hat <- z*g1_hat+(1-z)*g0_hat # predictions of g_0
g_hat <- as.matrix(dml_MLIIVM$predictions$ml_r0) # predictions of r_0(Z=0, X)
r0_hat <- as.matrix(dml_MLIIVM$predictions$ml_r1) # predictions of r_0(Z=1, X)
r1_hat <- z*r1_hat+(1-z)*r0_hat # predictions of r_0
r_hat <- as.matrix(dml_MLIIVM$predictions$ml_m) # predictions of m_o
m_hat
# cross-fitted RMSE: outcome
<- sqrt(mean((y-g_hat)^2))
boost_y_MLIIVM boost_y_MLIIVM
## [1] 61377.34
# cross-fitted RMSE: treatment
<- sqrt(mean((d-r_hat)^2))
boost_d_MLIIVM boost_d_MLIIVM
## [1] 0.2897971
# cross-fitted RMSE: instrument
<- sqrt(mean((z-m_hat)^2))
boost_z_MLIIVM boost_z_MLIIVM
## [1] 0.4726815
### random forest ###
= RandomForestRegressor(n_estimators=500)
randomForest = RandomForestClassifier(n_estimators=500)
randomForest_class 123)
np.random.seed(= dml.DoubleMLIIVM(data_IV_aux,
dml_MLIIVM = randomForest,
ml_g = randomForest_class,
ml_m = randomForest_class,
ml_r = {'always_takers': False,
subgroups 'never_takers': True},
= 0.01,
trimming_threshold = 3)
n_folds
=True) dml_MLIIVM.fit(store_predictions
## <doubleml.double_ml_iivm.DoubleMLIIVM object at 0x0000000064929880>
print( dml_MLIIVM.summary )
## coef std err ... 2.5 % 97.5 %
## p401 11823.312322 2162.030668 ... 7585.810079 16060.814565
##
## [1 rows x 6 columns]
= dml_MLIIVM.coef[ 0 ]
forest_MLIIVM = dml_MLIIVM.se[ 0 ]
forest_std_MLIIVM
# predictions
= dml_MLIIVM.predictions['ml_g0'].flatten() # predictions of g_0(z=0, X)
g0_hat = dml_MLIIVM.predictions['ml_g1'].flatten() # predictions of g_0(z=1, X)
g1_hat = d * g1_hat + ( 1 - d )*g0_hat # predictions of g_0
g_hat = dml_MLIIVM.predictions['ml_r0'].flatten() # predictions of r_0(z=0, X)
r0_hat = dml_MLIIVM.predictions['ml_r1'].flatten() # predictions of r_0(z=1, X)
r1_hat = z * r1_hat + (1 - z) * r0_hat # predictions of r_0
r_hat = dml_MLIIVM.predictions['ml_m'].flatten() # predictions of m_o
m_hat
# cross-fitted RMSE: outcome
= np.sqrt( np.mean( ( y - g_hat ) ** 2 ) )
forest_y_MLIIVM print( forest_y_MLIIVM )
# cross-fitted RMSE: treatment
## 57014.35978884377
= np.sqrt( np.mean( ( d - r_hat ) ** 2 ) )
forest_d_MLIIVM print( forest_d_MLIIVM )
# cross-fitted RMSE: instrument
## 0.28397007364006993
= np.sqrt( np.mean( ( z - m_hat ) ** 2 ) )
forest_z_MLIIVM print( forest_z_MLIIVM )
### trees ###
## 0.46312674664653625
123)
np.random.seed(= dml.DoubleMLIIVM(data_IV_aux,
dml_MLIIVM = trees,
ml_g = trees_class,
ml_m = trees_class,
ml_r = {'always_takers': False,
subgroups 'never_takers': True},
= 0.01,
trimming_threshold = 3)
n_folds =True) dml_MLIIVM.fit(store_predictions
## <doubleml.double_ml_iivm.DoubleMLIIVM object at 0x00000000649291F0>
print( dml_MLIIVM.summary )
## coef std err ... 2.5 % 97.5 %
## p401 10652.033245 1665.615224 ... 7387.487395 13916.579095
##
## [1 rows x 6 columns]
= dml_MLIIVM.coef[ 0 ]
tree_MLIIVM = dml_MLIIVM.se[ 0 ]
tree_std_MLIIVM
# predictions
= dml_MLIIVM.predictions['ml_g0'].flatten() # predictions of g_0(z=0, X)
g0_hat = dml_MLIIVM.predictions['ml_g1'].flatten() # predictions of g_0(z=1, X)
g1_hat = d * g1_hat + ( 1 - d )*g0_hat # predictions of g_0
g_hat = dml_MLIIVM.predictions['ml_r0'].flatten() # predictions of r_0(z=0, X)
r0_hat = dml_MLIIVM.predictions['ml_r1'].flatten() # predictions of r_0(z=1, X)
r1_hat = z * r1_hat + (1 - z) * r0_hat # predictions of r_0
r_hat = dml_MLIIVM.predictions['ml_m'].flatten() # predictions of m_o
m_hat
# cross-fitted RMSE: outcome
= np.sqrt( np.mean( ( y - g_hat ) ** 2 ) )
tree_y_MLIIVM print( tree_y_MLIIVM )
# cross-fitted RMSE: treatment
## 56317.410732152974
= np.sqrt( np.mean( ( d - r_hat ) ** 2 ) )
tree_d_MLIIVM print( tree_d_MLIIVM )
# cross-fitted RMSE: instrument
## 0.278215621782697
= np.sqrt( np.mean( ( z - m_hat ) ** 2 ) )
tree_z_MLIIVM print( tree_z_MLIIVM )
### boosting ###
## 0.4546690531536573
123)
np.random.seed(= dml.DoubleMLIIVM(data_IV_aux,
dml_MLIIVM = boost,
ml_g = boost_class,
ml_m = boost_class,
ml_r = {'always_takers': False,
subgroups 'never_takers': True},
= 0.01,
trimming_threshold = 3)
n_folds =True) dml_MLIIVM.fit(store_predictions
## <doubleml.double_ml_iivm.DoubleMLIIVM object at 0x000000000C669B20>
print( dml_MLIIVM.summary )
## coef std err t P>|t| 2.5 % 97.5 %
## p401 13593.184614 3490.834608 3.893964 0.000099 6751.274506 20435.094721
= dml_MLIIVM.coef[ 0 ]
boost_MLIIVM = dml_MLIIVM.se[ 0 ]
boost_std_MLIIVM
# predictions
= dml_MLIIVM.predictions['ml_g0'].flatten() # predictions of g_0(z=0, X)
g0_hat = dml_MLIIVM.predictions['ml_g1'].flatten() # predictions of g_0(z=1, X)
g1_hat = d * g1_hat + ( 1 - d )*g0_hat # predictions of g_0
g_hat = dml_MLIIVM.predictions['ml_r0'].flatten() # predictions of r_0(z=0, X)
r0_hat = dml_MLIIVM.predictions['ml_r1'].flatten() # predictions of r_0(z=1, X)
r1_hat = z * r1_hat + (1 - z) * r0_hat # predictions of r_0
r_hat = dml_MLIIVM.predictions['ml_m'].flatten() # predictions of m_o
m_hat
# cross-fitted RMSE: outcome
= np.sqrt( np.mean( ( y - g_hat ) ** 2 ) )
boost_y_MLIIVM print( boost_y_MLIIVM )
# cross-fitted RMSE: treatment
## 61407.11615253673
= np.sqrt( np.mean( ( d - r_hat ) ** 2 ) )
boost_d_MLIIVM print( boost_d_MLIIVM )
# cross-fitted RMSE: instrument
## 0.2960100530837048
= np.sqrt( np.mean( ( z - m_hat ) ** 2 ) )
boost_z_MLIIVM print( boost_z_MLIIVM )
## 0.46296289075787445
# cross-fitted RMSE: outcome
<- sqrt(mean((y-g_hat)^2))
lasso_y_MLIIVM lasso_y_MLIIVM
## [1] 61377.34
# cross-fitted RMSE: treatment
<- sqrt(mean((d-r_hat)^2))
lasso_d_MLIIVM lasso_d_MLIIVM
## [1] 0.2897971
# cross-fitted RMSE: instrument
<- sqrt(mean((z-m_hat)^2))
lasso_z_MLIIVM lasso_z_MLIIVM
## [1] 0.4726815
# cross-fitted RMSE: outcome
= np.sqrt( np.mean( ( y - g_hat ) ** 2 ) )
lasso_y_MLIIVM print( lasso_y_MLIIVM )
# cross-fitted RMSE: treatment
## 61407.11615253673
= np.sqrt( np.mean( ( d - r_hat ) ** 2 ) )
lasso_d_MLIIVM print( lasso_d_MLIIVM )
# cross-fitted RMSE: instrument
## 0.2960100530837048
= np.sqrt( np.mean( ( z - m_hat ) ** 2 ) )
lasso_z_MLIIVM print( lasso_z_MLIIVM )
## 0.46296289075787445
library(xtable)
<- matrix(0, 5, 4)
table 1,1:4] <- c(lasso_MLIIVM,forest_MLIIVM,tree_MLIIVM,boost_MLIIVM)
table[2,1:4] <- c(lasso_std_MLIIVM,forest_std_MLIIVM,tree_std_MLIIVM,boost_std_MLIIVM)
table[3,1:4] <- c(lasso_y_MLIIVM,forest_y_MLIIVM,tree_y_MLIIVM,boost_y_MLIIVM)
table[4,1:4] <- c(lasso_d_MLIIVM,forest_d_MLIIVM,tree_d_MLIIVM,boost_d_MLIIVM)
table[5,1:4] <- c(lasso_z_MLIIVM,forest_z_MLIIVM,tree_z_MLIIVM,boost_z_MLIIVM)
table[rownames(table) <- c("Estimate","Std.Error","RMSE Y","RMSE D","RMSE Z")
colnames(table) <- c("Lasso","Random Forest","Trees","Boosting")
<- xtable(table, digits = 2)
tab tab
## % latex table generated in R 4.0.4 by xtable 1.8-4 package
## % Wed Nov 24 15:13:38 2021
## \begin{table}[ht]
## \centering
## \begin{tabular}{rrrrr}
## \hline
## & Lasso & Random Forest & Trees & Boosting \\
## \hline
## Estimate & 11484.48 & 12385.92 & 11340.90 & 12529.11 \\
## Std.Error & 1596.17 & 1861.19 & 1719.41 & 1821.69 \\
## RMSE Y & 61377.34 & 55894.92 & 58138.87 & 61377.34 \\
## RMSE D & 0.29 & 0.28 & 0.28 & 0.29 \\
## RMSE Z & 0.47 & 0.46 & 0.46 & 0.47 \\
## \hline
## \end{tabular}
## \end{table}
= np.zeros( (5, 4) )
table3 0,0:4] = lasso_MLIIVM,forest_MLIIVM,tree_MLIIVM,boost_MLIIVM
table3[1,0:4] = lasso_std_MLIIVM,forest_std_MLIIVM,tree_std_MLIIVM,boost_std_MLIIVM
table3[2,0:4] = lasso_y_MLIIVM,forest_y_MLIIVM,tree_y_MLIIVM,boost_y_MLIIVM
table3[3,0:4] = lasso_d_MLIIVM,forest_d_MLIIVM,tree_d_MLIIVM,boost_d_MLIIVM
table3[4,0:4] = lasso_z_MLIIVM,forest_z_MLIIVM,tree_z_MLIIVM,boost_z_MLIIVM
table3[= pd.DataFrame( table3 ,
table3_pd = ["Estimate","Std.Error","RMSE Y","RMSE D","RMSE Z" ], \
index = [ "Lasso","Random Forest","Trees","Boosting"])
columns
table3_pd.to_latex()
## '\\begin{tabular}{lrrrr}\n\\toprule\n{} & Lasso & Random Forest & Trees & Boosting \\\\\n\\midrule\nEstimate & 11764.551405 & 11823.312322 & 10652.033245 & 13593.184614 \\\\\nStd.Error & 1842.601194 & 2162.030668 & 1665.615224 & 3490.834608 \\\\\nRMSE Y & 61407.116153 & 57014.359789 & 56317.410732 & 61407.116153 \\\\\nRMSE D & 0.296010 & 0.283970 & 0.278216 & 0.296010 \\\\\nRMSE Z & 0.462963 & 0.463127 & 0.454669 & 0.462963 \\\\\n\\bottomrule\n\\end{tabular}\n'
We report results based on four ML methods for estimating the nuisance functions used in forming the orthogonal estimating equations. We find again that the estimates of the treatment effect are stable across ML methods. The estimates are highly significant, hence we would reject the hypothesis that the effect of 401(k) participation has no effect on financial health.
We might rerun the model using the best ML method for each equation to get a final estimate for the treatment effect of participation:
set.seed(123)
::get_logger("mlr3")$set_threshold("warn")
lgr= DoubleMLIIVM$new(data_IV_aux,
dml_MLIIVM ml_g = randomForest,
ml_m = lasso_class,
ml_r = lasso_class,
n_folds=3,
subgroups = list(always_takers = FALSE,
never_takers = TRUE))
$fit(store_predictions=TRUE)
dml_MLIIVM
$summary() dml_MLIIVM
## [1] "Estimates and significance testing of the effect of target variables"
## Estimate. Std. Error t value Pr(>|t|)
## p401 11984 1612 7.433 1.06e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
<- dml_MLIIVM$coef
best_MLIIVM <- dml_MLIIVM$se best_std_MLIIVM
::: {.column width=“1%” datI a-latex=“{0.04}”}
# Random best
123)
np.random.seed(= dml.DoubleMLIIVM(data_IV_aux,
dml_MLIIVM = randomForest,
ml_g = lasso_class,
ml_m = lasso_class,
ml_r = {'always_takers': False,
subgroups 'never_takers': True},
= 0.01,
trimming_threshold = 3)
n_folds =True) dml_MLIIVM.fit(store_predictions
## <doubleml.double_ml_iivm.DoubleMLIIVM object at 0x000000000C654A00>
print( dml_MLIIVM.summary )
## coef std err ... 2.5 % 97.5 %
## p401 12220.624723 1957.387719 ... 8384.21529 16057.034157
##
## [1 rows x 6 columns]
= dml_MLIIVM.coef[ 0 ]
best_MLIIVM = dml_MLIIVM.se[ 0 ] best_std_MLIIVM
::::::