Chapter 19 Sensititivy Analysis for Unobserved Confounder with DML and Sensmakr
19.1 Here we experiment with using package “sensemakr” in conjunction with debiased ML
19.2 We will work on:
Mimic the partialling out procedure with machine learning tools,
And invoke Sensmakr to compute \(\phi^2\) and plot sensitivity results.
# loads package
#install.packages("sensemakr")
library(sensemakr)
# loads data
data("darfur")
import warnings
"ignore")
warnings.filterwarnings(from sensemakr import sensemakr
from sensemakr import sensitivity_stats
from sensemakr import bias_functions
from sensemakr import ovb_bounds
from sensemakr import ovb_plots
import statsmodels.api as sm
import statsmodels.formula.api as smf
import numpy as np
import pandas as pd
# loads data
= pd.read_csv("data/darfur.csv") darfur
Data is described here https://cran.r-project.org/web/packages/sensemakr/vignettes/sensemakr.html
The main outcome is attitude towards peace – the peacefactor. The key variable of interest is whether the responders were directly harmed (directlyharmed). We want to know if being directly harmed in the conflict causes people to support peace-enforcing measures. The measured confounders include female indicator, age, farmer, herder, voted in the past, and household size. There is also a village indicator, which we will treat as fixed effect and partial it out before conducting the analysis. The standard errors will be clustered at the village level.
19.3 Take out village fixed effects and run basic linear analysis
#get rid of village fixed effects
attach(darfur)
library(lfe)
<- lm(peacefactor~village)$res
peacefactorR<- lm(directlyharmed~village)$res
directlyharmedR<- lm(female~village)$res
femaleR<- lm(age~village)$res
ageR<- lm(farmer_dar~village)$res
farmerR<- lm(herder_dar~village)$res
herderR<- lm(pastvoted~village)$res
pastvotedR<- lm(hhsize_darfur~village)$res hhsizeR
# get rid of village fixed effects
import statsmodels.api as sm
import statsmodels.formula.api as smf
= smf.ols('peacefactor~village' , data=darfur).fit().resid
peacefactorR = smf.ols('directlyharmed~village' , data=darfur).fit().resid
directlyharmedR = smf.ols('female~village' , data=darfur).fit().resid
femaleR = smf.ols('age~village' , data=darfur).fit().resid
ageR = smf.ols('farmer_dar~village' , data=darfur).fit().resid
farmerR = smf.ols('herder_dar~village' , data=darfur).fit().resid
herderR = smf.ols('pastvoted~village' , data=darfur).fit().resid
pastvotedR = smf.ols('hhsize_darfur~village' , data=darfur).fit().resid
hhsizeR
### Auxiliary code to rearrange data
= pd.concat([peacefactorR, directlyharmedR, femaleR,
darfurR
ageR, farmerR, herderR, pastvotedR, 'village']], axis=1)
hhsizeR, darfur[= ["peacefactorR", "directlyharmedR", "femaleR",
darfurR.columns "ageR", "farmerR", "herderR", "pastvotedR",
"hhsize_darfurR", "village"]
# Preliminary linear model analysis
# here we are clustering standard errors at the village level
summary(felm(peacefactorR~ directlyharmedR+
+ ageR +
femaleR + herderR + pastvotedR +
farmerR|0|0|village)) hhsizeR
##
## Call:
## felm(formula = peacefactorR ~ directlyharmedR + femaleR + ageR + farmerR + herderR + pastvotedR + hhsizeR | 0 | 0 | village)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.67487 -0.14712 0.00000 0.09857 0.90307
##
## Coefficients:
## Estimate Cluster s.e. t value Pr(>|t|)
## (Intercept) -3.681e-18 6.704e-16 -0.005 0.99562
## directlyharmedR 9.732e-02 2.382e-02 4.085 4.68e-05 ***
## femaleR -2.321e-01 2.444e-02 -9.495 < 2e-16 ***
## ageR -2.072e-03 7.441e-04 -2.784 0.00545 **
## farmerR -4.044e-02 2.956e-02 -1.368 0.17156
## herderR 1.428e-02 3.650e-02 0.391 0.69569
## pastvotedR -4.802e-02 2.688e-02 -1.787 0.07420 .
## hhsizeR 1.230e-03 2.166e-03 0.568 0.57034
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2437 on 1268 degrees of freedom
## Multiple R-squared(full model): 0.1542 Adjusted R-squared: 0.1496
## Multiple R-squared(proj model): 0.1542 Adjusted R-squared: 0.1496
## F-statistic(full model, *iid*):33.03 on 7 and 1268 DF, p-value: < 2.2e-16
## F-statistic(proj model): 25.44 on 7 and 485 DF, p-value: < 2.2e-16
# Preliminary linear model analysis
# here we are clustering standard errors at the village level
= smf.ols('peacefactorR~ directlyharmedR+ femaleR + ageR + farmerR+ herderR + pastvotedR + hhsizeR'
linear_model_1 =darfurR ).fit().get_robustcov_results(cov_type = "cluster", groups= darfurR['village'])
,data= linear_model_1.summary2().tables[1]
linear_model_1_table linear_model_1_table
## Coef. Std.Err. ... [0.025 0.975]
## Intercept -2.003606e-15 8.016956e-16 ... -3.578831e-15 -4.283801e-16
## directlyharmedR 9.731582e-02 2.382281e-02 ... 5.050716e-02 1.441245e-01
## femaleR -2.320514e-01 2.443857e-02 ... -2.800700e-01 -1.840329e-01
## ageR -2.071749e-03 7.441260e-04 ... -3.533858e-03 -6.096402e-04
## farmerR -4.044295e-02 2.956411e-02 ... -9.853250e-02 1.764661e-02
## herderR 1.427910e-02 3.649802e-02 ... -5.743466e-02 8.599286e-02
## pastvotedR -4.802496e-02 2.687661e-02 ... -1.008339e-01 4.784016e-03
## hhsizeR 1.229812e-03 2.166312e-03 ... -3.026704e-03 5.486328e-03
##
## [8 rows x 6 columns]
# here we are clustering standard errors at the village level
summary(felm(peacefactorR~ femaleR +
+ farmerR+ herderR +
ageR + hhsizeR |0|0|village)) pastvotedR
##
## Call:
## felm(formula = peacefactorR ~ femaleR + ageR + farmerR + herderR + pastvotedR + hhsizeR | 0 | 0 | village)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.63765 -0.15168 0.00000 0.09859 0.90298
##
## Coefficients:
## Estimate Cluster s.e. t value Pr(>|t|)
## (Intercept) -2.635e-18 6.584e-16 -0.004 0.99681
## femaleR -2.415e-01 2.536e-02 -9.522 < 2e-16 ***
## ageR -2.187e-03 7.453e-04 -2.934 0.00341 **
## farmerR -4.071e-02 2.923e-02 -1.393 0.16390
## herderR 2.623e-02 3.968e-02 0.661 0.50871
## pastvotedR -4.414e-02 2.784e-02 -1.585 0.11313
## hhsizeR 1.336e-03 2.127e-03 0.628 0.52991
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2463 on 1269 degrees of freedom
## Multiple R-squared(full model): 0.1353 Adjusted R-squared: 0.1312
## Multiple R-squared(proj model): 0.1353 Adjusted R-squared: 0.1312
## F-statistic(full model, *iid*): 33.1 on 6 and 1269 DF, p-value: < 2.2e-16
## F-statistic(proj model): 23.07 on 6 and 485 DF, p-value: < 2.2e-16
# Linear model 2
= smf.ols('peacefactorR~ femaleR + ageR + farmerR+ herderR + pastvotedR + hhsizeR'
linear_model_2 =darfurR ).fit().get_robustcov_results(cov_type = "cluster", groups= darfurR['village'])
,data= linear_model_2.summary2().tables[1]
linear_model_2_table linear_model_2_table
## Coef. Std.Err. ... [0.025 0.975]
## Intercept -1.946360e-15 7.328661e-16 ... -3.386344e-15 -5.063751e-16
## femaleR -2.415042e-01 2.536306e-02 ... -2.913393e-01 -1.916692e-01
## ageR -2.186810e-03 7.453429e-04 ... -3.651310e-03 -7.223106e-04
## farmerR -4.071431e-02 2.923021e-02 ... -9.814780e-02 1.671917e-02
## herderR 2.622875e-02 3.967810e-02 ... -5.173345e-02 1.041909e-01
## pastvotedR -4.414131e-02 2.784297e-02 ... -9.884904e-02 1.056643e-02
## hhsizeR 1.336220e-03 2.126678e-03 ... -2.842419e-03 5.514859e-03
##
## [7 rows x 6 columns]
# here we are clustering standard errors at the village level
summary(felm(directlyharmedR~ femaleR +
+ farmerR+ herderR +
ageR + hhsizeR |0|0|village)) pastvotedR
##
## Call:
## felm(formula = directlyharmedR ~ femaleR + ageR + farmerR + herderR + pastvotedR + hhsizeR | 0 | 0 | village)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.8285 -0.3129 0.0000 0.2630 0.9076
##
## Coefficients:
## Estimate Cluster s.e. t value Pr(>|t|)
## (Intercept) 1.075e-17 5.196e-16 0.021 0.9835
## femaleR -9.714e-02 5.129e-02 -1.894 0.0585 .
## ageR -1.182e-03 1.151e-03 -1.028 0.3044
## farmerR -2.789e-03 4.280e-02 -0.065 0.9481
## herderR 1.228e-01 5.064e-02 2.425 0.0155 *
## pastvotedR 3.991e-02 3.366e-02 1.186 0.2360
## hhsizeR 1.093e-03 3.286e-03 0.333 0.7394
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3744 on 1269 degrees of freedom
## Multiple R-squared(full model): 0.0179 Adjusted R-squared: 0.01326
## Multiple R-squared(proj model): 0.0179 Adjusted R-squared: 0.01326
## F-statistic(full model, *iid*):3.856 on 6 and 1269 DF, p-value: 0.0008089
## F-statistic(proj model): 3.828 on 6 and 485 DF, p-value: 0.0009698
# Linear model 3
= smf.ols('directlyharmedR~ femaleR + ageR + farmerR+ herderR + pastvotedR + hhsizeR'
linear_model_3 =darfurR ).fit().get_robustcov_results(cov_type = "cluster", groups= darfurR['village'])
,data= linear_model_3.summary2().tables[1]
linear_model_3_table linear_model_3_table
## Coef. Std.Err. ... [0.025 0.975]
## Intercept 5.867702e-16 7.973710e-16 ... -9.799580e-16 2.153498e-15
## femaleR -9.713517e-02 5.128637e-02 ... -1.979061e-01 3.635740e-03
## ageR -1.182350e-03 1.150680e-03 ... -3.443283e-03 1.078584e-03
## farmerR -2.788538e-03 4.280159e-02 ... -8.688797e-02 8.131090e-02
## herderR 1.227925e-01 5.064352e-02 ... 2.328466e-02 2.223002e-01
## pastvotedR 3.990773e-02 3.366253e-02 ... -2.623468e-02 1.060501e-01
## hhsizeR 1.093431e-03 3.286160e-03 ... -5.363437e-03 7.550298e-03
##
## [7 rows x 6 columns]
19.4 We first use Lasso for Partilling Out Controls
library(hdm)
= rlasso(peacefactorR ~ (femaleR +
resY +
ageR +
farmerR+
herderR +
pastvotedR ^3, post=F)$res
hhsizeR)
= rlasso(directlyharmedR ~ (femaleR +
resD +
ageR +
farmerR +
herderR +
pastvotedR ^3 , post=F)$res hhsizeR)
import hdmpy
import patsy
from patsy import ModelDesc, Term, EvalFactor
= patsy.dmatrix("(femaleR + ageR + farmerR+ herderR + pastvotedR + hhsizeR)**3", darfurR)
X = darfurR['peacefactorR'].to_numpy()
Y = darfurR['directlyharmedR'].to_numpy()
D
= hdmpy.rlasso(X[: , 1:],Y, post = False).est['residuals'].reshape( Y.size,)
resY = hdmpy.rlasso(X[: , 1:],D, post = False).est['residuals'].reshape( D.size,)
resD = 1 - np.var(resY)/np.var(peacefactorR)
FVU_Y = 1 - np.var(resD)/np.var(directlyharmedR) FVU_D
summary( rlasso(peacefactorR ~ (femaleR +
+
ageR +
farmerR+
herderR +
pastvotedR ^3,
hhsizeR)post=F) )
##
## Call:
## rlasso.formula(formula = peacefactorR ~ (femaleR + ageR + farmerR +
## herderR + pastvotedR + hhsizeR)^3, post = F)
##
## Post-Lasso Estimation: FALSE
##
## Total number of variables: 41
## Number of selected variables: 5
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.6045408 -0.1552131 0.0005935 0.0899936 0.8968939
##
## Estimate
## (Intercept) -0.001
## femaleR -0.201
## ageR 0.000
## farmerR -0.004
## herderR 0.000
## pastvotedR 0.000
## hhsizeR 0.000
## femaleR:ageR 0.000
## femaleR:farmerR 0.000
## femaleR:herderR 0.000
## femaleR:pastvotedR 0.026
## femaleR:hhsizeR 0.000
## ageR:farmerR 0.000
## ageR:herderR 0.000
## ageR:pastvotedR 0.000
## ageR:hhsizeR 0.000
## farmerR:herderR 0.000
## farmerR:pastvotedR 0.000
## farmerR:hhsizeR 0.000
## herderR:pastvotedR 0.000
## herderR:hhsizeR 0.000
## pastvotedR:hhsizeR 0.000
## femaleR:ageR:farmerR 0.000
## femaleR:ageR:herderR 0.000
## femaleR:ageR:pastvotedR 0.000
## femaleR:ageR:hhsizeR 0.000
## femaleR:farmerR:herderR 0.000
## femaleR:farmerR:pastvotedR 0.000
## femaleR:farmerR:hhsizeR 0.000
## femaleR:herderR:pastvotedR 0.000
## femaleR:herderR:hhsizeR 0.000
## femaleR:pastvotedR:hhsizeR 0.000
## ageR:farmerR:herderR 0.000
## ageR:farmerR:pastvotedR 0.002
## ageR:farmerR:hhsizeR 0.000
## ageR:herderR:pastvotedR 0.000
## ageR:herderR:hhsizeR 0.000
## ageR:pastvotedR:hhsizeR 0.000
## farmerR:herderR:pastvotedR 0.000
## farmerR:herderR:hhsizeR 0.000
## farmerR:pastvotedR:hhsizeR 0.000
## herderR:pastvotedR:hhsizeR 0.000
##
## Residual standard error: 0.2472
## Multiple R-squared: 0.1251
## Adjusted R-squared: 0.1217
## Joint significance test:
## the sup score statistic for joint significance test is 16.89 with a p-value of 0.626
= hdmpy.rlasso(X[: , 1:],Y, post = False)
rlasso_1
def summary_rlasso( mod , X1):
= mod.est['coefficients'].rename(columns = { 0 : "Est."})
ob1 = X1.design_info.column_names
ob1.index return ob1
summary_rlasso(rlasso_1 , X)
## Est.
## Intercept -0.000593
## femaleR -0.200735
## ageR -0.000381
## farmerR -0.004220
## herderR 0.000000
## pastvotedR 0.000000
## hhsizeR 0.000000
## femaleR:ageR 0.000000
## femaleR:farmerR 0.000000
## femaleR:herderR 0.000000
## femaleR:pastvotedR 0.025755
## femaleR:hhsizeR 0.000000
## ageR:farmerR 0.000000
## ageR:herderR 0.000000
## ageR:pastvotedR 0.000000
## ageR:hhsizeR 0.000000
## farmerR:herderR 0.000000
## farmerR:pastvotedR 0.000000
## farmerR:hhsizeR 0.000000
## herderR:pastvotedR 0.000000
## herderR:hhsizeR 0.000000
## pastvotedR:hhsizeR 0.000000
## femaleR:ageR:farmerR 0.000000
## femaleR:ageR:herderR 0.000000
## femaleR:ageR:pastvotedR 0.000000
## femaleR:ageR:hhsizeR 0.000000
## femaleR:farmerR:herderR 0.000000
## femaleR:farmerR:pastvotedR 0.000000
## femaleR:farmerR:hhsizeR 0.000000
## femaleR:herderR:pastvotedR 0.000000
## femaleR:herderR:hhsizeR 0.000000
## femaleR:pastvotedR:hhsizeR 0.000000
## ageR:farmerR:herderR 0.000000
## ageR:farmerR:pastvotedR 0.001897
## ageR:farmerR:hhsizeR 0.000000
## ageR:herderR:pastvotedR 0.000000
## ageR:herderR:hhsizeR 0.000000
## ageR:pastvotedR:hhsizeR 0.000000
## farmerR:herderR:pastvotedR 0.000000
## farmerR:herderR:hhsizeR 0.000000
## farmerR:pastvotedR:hhsizeR 0.000000
## herderR:pastvotedR:hhsizeR 0.000000
print(c("Controls explain the following fraction of variance of Outcome", 1-var(resY)/var(peacefactorR)))
## [1] "Controls explain the following fraction of variance of Outcome"
## [2] "0.125108054898996"
print("Controls explain the following fraction of variance of Outcome", FVU_Y)
## Controls explain the following fraction of variance of Outcome 0.12510805781802825
print(c("Controls explain the following fraction of variance of Treatment", 1-var(resD)/var(directlyharmedR)))
## [1] "Controls explain the following fraction of variance of Treatment"
## [2] "0.0119842379295229"
print("Controls explain the following fraction of variance of treatment", FVU_D)
## Controls explain the following fraction of variance of treatment 0.011984223484257428
library(lfe)
= felm(resY ~ resD|0|0|village) # cluster SEs by village
dml.darfur.model
summary(dml.darfur.model,
robust=T) #culster SE by village
##
## Call:
## felm(formula = resY ~ resD | 0 | 0 | village)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.63410 -0.15299 0.00069 0.09305 0.89593
##
## Coefficients:
## Estimate Cluster s.e. t value Pr(>|t|)
## (Intercept) -1.250e-18 1.594e-04 0.000 1
## resD 1.003e-01 2.443e-02 4.108 4.25e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2444 on 1274 degrees of freedom
## Multiple R-squared(full model): 0.02312 Adjusted R-squared: 0.02235
## Multiple R-squared(proj model): 0.02312 Adjusted R-squared: 0.02235
## F-statistic(full model, *iid*):30.15 on 1 and 1274 DF, p-value: 4.82e-08
## F-statistic(proj model): 16.87 on 1 and 485 DF, p-value: 4.693e-05
# Filan estimation
'resY'] = resY
darfurR['resD'] = resD
darfurR[# Culster SE by village
= smf.ols('resY~ resD',data=darfurR ).fit().get_robustcov_results(cov_type = "cluster", groups= darfurR['village'])
dml_darfur_model = dml_darfur_model.summary2().tables[1]
dml_darfur_model_table dml_darfur_model_table
## Coef. Std.Err. t P>|t| [0.025 0.975]
## Intercept -6.372846e-10 0.000159 -0.000004 0.999997 -0.000313 0.000313
## resD 1.003364e-01 0.024427 4.107546 0.000047 0.052340 0.148333
= lm(resY ~ resD) #lineaer model to use as input in sensemakr dml.darfur.model
# linear model to use as input in sensemakr
= smf.ols('resY~ resD',data=darfurR ).fit()
dml_darfur_model= dml_darfur_model.summary2().tables[1] dml_darfur_model_table
19.5 Manual Bias Analysis
# Main estimate
= dml.darfur.model$coef[2]
beta
# Hypothetical values of partial R2s
= .16; R2.DC = .01
R2.YC
# Elements of the formal
<- (R2.YC * R2.DC)/(1- R2.DC)
kappa<- mean(dml.darfur.model$res^2)/mean(dml.darfur.model$res^2)
varianceRatio
# Compute square bias
<- kappa*varianceRatio
BiasSq
# Compute absolute value of the bias
print(sqrt(BiasSq))
## [1] 0.04020151
import matplotlib.pyplot as plt
= dml_darfur_model_table['Coef.'][1]
beta
# Hypothetical values of partial R2s
= .16
R2_YC = .01
R2_DC
# Elements of the formal
= (R2_YC * R2_DC)/(1- R2_DC)
kappa = np.mean(dml_darfur_model.resid**2)/np.mean(dml_darfur_model.resid**2)
varianceRatio
# Compute square bias
= kappa*varianceRatio
BiasSq
# Compute absolute value of the bias
print(np.sqrt(BiasSq))
## 0.04020151261036848
# plotting
<- seq(0,.3, by=.001)
gridR2.DC<- kappa*(1 - gridR2.DC)/gridR2.DC
gridR2.YC<- ifelse(gridR2.YC> 1, 1, gridR2.YC);
gridR2.YC
plot(gridR2.DC,
gridR2.YC, type="l", col=4,
xlab="Partial R2 of Treatment with Confounder",
ylab="Partial R2 of Outcome with Confounder",
main= c("Combo of R2 such that |Bias|< ", round(sqrt(BiasSq), digits=4))
)
# plotting
= np.arange(0,0.3,0.001)
gridR2_DC = kappa*(1 - gridR2_DC)/gridR2_DC
gridR2_YC = np.where(gridR2_YC > 1, 1, gridR2_YC)
gridR2_YC
"Combo of R2 such that |Bias|<{}".format(round(np.sqrt(BiasSq), 5))) plt.title(
## Text(0.5, 1.0, 'Combo of R2 such that |Bias|<0.0402')
"Partial R2 of Treatment with Confounder") plt.xlabel(
## Text(0.5, 0, 'Partial R2 of Treatment with Confounder')
"Partial R2 of Outcome with Confounder") plt.ylabel(
## Text(0, 0.5, 'Partial R2 of Outcome with Confounder')
plt.plot(gridR2_DC,gridR2_YC)
## [<matplotlib.lines.Line2D object at 0x000000004ABF09A0>]
plt.show()
19.6 Bias Analysis with Sensemakr
<- sensemakr(model = dml.darfur.model,
dml.darfur.sensitivity treatment = "resD")
summary(dml.darfur.sensitivity)
## Sensitivity Analysis to Unobserved Confounding
##
## Model Formula: resY ~ resD
##
## Null hypothesis: q = 1 and reduce = TRUE
## -- This means we are considering biases that reduce the absolute value of the current estimate.
## -- The null hypothesis deemed problematic is H0:tau = 0
##
## Unadjusted Estimates of 'resD':
## Coef. estimate: 0.1003
## Standard Error: 0.0183
## t-value (H0:tau = 0): 5.491
##
## Sensitivity Statistics:
## Partial R2 of treatment with outcome: 0.0231
## Robustness Value, q = 1: 0.1425
## Robustness Value, q = 1, alpha = 0.05: 0.0941
##
## Verbal interpretation of sensitivity statistics:
##
## -- Partial R2 of the treatment with the outcome: an extreme confounder (orthogonal to the covariates) that explains 100% of the residual variance of the outcome, would need to explain at least 2.31% of the residual variance of the treatment to fully account for the observed estimated effect.
##
## -- Robustness Value, q = 1: unobserved confounders (orthogonal to the covariates) that explain more than 14.25% of the residual variance of both the treatment and the outcome are strong enough to bring the point estimate to 0 (a bias of 100% of the original estimate). Conversely, unobserved confounders that do not explain more than 14.25% of the residual variance of both the treatment and the outcome are not strong enough to bring the point estimate to 0.
##
## -- Robustness Value, q = 1, alpha = 0.05: unobserved confounders (orthogonal to the covariates) that explain more than 9.41% of the residual variance of both the treatment and the outcome are strong enough to bring the estimate to a range where it is no longer 'statistically different' from 0 (a bias of 100% of the original estimate), at the significance level of alpha = 0.05. Conversely, unobserved confounders that do not explain more than 9.41% of the residual variance of both the treatment and the outcome are not strong enough to bring the estimate to a range where it is no longer 'statistically different' from 0, at the significance level of alpha = 0.05.
# Imports
from sensemakr import sensemakr
from sensemakr import sensitivity_stats
from sensemakr import bias_functions
from sensemakr import ovb_bounds
from sensemakr import ovb_plots
import statsmodels.api as sm
import statsmodels.formula.api as smf
import numpy as np
import pandas as pd
# We need to double check why the function does not allow to run withour the benchmark_covariates argument
= sensemakr.Sensemakr(dml_darfur_model, "resD", benchmark_covariates = "Intercept")
dml_darfur_sensitivity ovb_plots.extract_from_sense_obj( dml_darfur_sensitivity )
## ('resD', 0.10033644720396692, 0.018272934212004013, 1274.0, 0 2.166013e-18
## Name: r2dz_x, dtype: float64, 0 6.808156e-18
## Name: r2yz_dx, dtype: float64, 0 1x Intercept
## Name: bound_label, dtype: object, True, 0.0, 1.9618292555617416)
# Make a contour plot for the estimate
plot(dml.darfur.sensitivity, nlevels = 15)
# Make a contour plot for the estimate
=dml_darfur_sensitivity, sensitivity_of='estimate')
ovb_plots.ovb_contour_plot(sense_obj plt.show()
19.7 Next We use Random Forest as ML tool for Partialling Out
The following code does DML with clsutered standard errors by ClusterID
<- function(x, d, y, dreg, yreg, nfold=2, clusterID) {
DML2.for.PLM <- nrow(x) #number of observations
nobs
<- rep.int(1:nfold,times = ceiling(nobs/nfold))[sample.int(nobs)] #define folds indices
foldid
<- split(1:nobs, foldid) #split observation indices into folds
I
<- dtil <- rep(NA, nobs)
ytil
cat("fold: ")
for(b in 1:length(I)){
<- dreg(x[-I[[b]],], d[-I[[b]]]) #take a fold out
dfit
<- yreg(x[-I[[b]],], y[-I[[b]]]) # take a foldt out
yfit
<- predict(dfit, x[I[[b]],], type="response") #predict the left-out fold
dhat
<- predict(yfit, x[I[[b]],], type="response") #predict the left-out fold
yhat
<- (d[I[[b]]] - dhat) #record residual for the left-out fold
dtil[I[[b]]]
<- (y[I[[b]]] - yhat) #record residial for the left-out fold
ytil[I[[b]]]
cat(b," ")
}
<- felm(ytil ~ dtil |0|0|clusterID) #get clustered standard errors using felm
rfit
<- summary(rfit)
rfitSummary
<- rfitSummary$coef[2] #extract coefficient
coef.est
<- rfitSummary$coef[2,2] #record robust standard error
se
cat(sprintf("\ncoef (se) = %g (%g)\n", coef.est , se)) #printing output
return( list(coef.est =coef.est , se=se, dtil=dtil, ytil=ytil) ) #save output and residuals
}
import itertools
from itertools import compress
def DML2_for_PLM(x, d, y, dreg, yreg, nfold, clu):
# Num ob observations
= x.shape[0]
nobs
# Define folds indices
= [*range(0, nfold, 1)]*nobs
list_1 = np.random.choice(nobs,nobs, replace=False).tolist()
sample = [list_1[index] for index in sample]
foldid
# Create split function(similar to R)
def split(z, f):
= max(f) + 1
count return tuple( list(itertools.compress(z, (el == i for el in f))) for i in range(count) )
# Split observation indices into folds
= [*range(0, nobs, 1)]
list_2 = split(list_2, foldid)
I
# loop to save results
for b in range(0,len(I)):
# Split data - index to keep are in mask as booleans
= set(I[b]) #Here should go I[b] Set is more efficient, but doesn't reorder your elements if that is desireable
include_idx = np.array([(i in include_idx) for i in range(len(x))])
mask
# Lasso regression, excluding folds selected
= dreg(x[~mask,], d[~mask,])
dfit = yreg(x[~mask,], y[~mask,])
yfit
# predict estimates using the
= dfit.predict( x[mask,] )
dhat = yfit.predict( x[mask,] )
yhat
# Create array to save errors
= np.zeros( len(x) ).reshape( len(x) , 1 )
dtil = np.zeros( len(x) ).reshape( len(x) , 1 )
ytil
# save errors
= d[mask,] - dhat.reshape( len(I[b]) , 1 )
dtil[mask] = y[mask,] - yhat.reshape( len(I[b]) , 1 )
ytil[mask] print(b, " ")
# Create dataframe
= pd.DataFrame(np.concatenate( ( ytil, dtil,clu ), axis = 1), columns = ['ytil','dtil','CountyCode'])
data_2
# OLS clustering at the County level
= "ytil ~ dtil"
model = smf.ols(model , data=data_2).fit().get_robustcov_results(cov_type = "cluster", groups= data_2['CountyCode'])
baseline_ols = baseline_ols.summary2().tables[1]['Coef.']['dtil']
coef_est = baseline_ols.summary2().tables[1]['Std.Err.']['dtil']
se
print("Coefficient is {}, SE is equal to {}".format(coef_est, se))
return coef_est, se, dtil, ytil, data_2
library(randomForest) #random Forest library
= model.matrix(~ femaleR + ageR + farmerR + herderR + pastvotedR + hhsizeR)
x= directlyharmedR
d= peacefactorR;
y dim(x)
## [1] 1276 7
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.preprocessing import LabelEncoder
# This new matrix include intercept
= patsy.dmatrix("~ femaleR + ageR + farmerR + herderR + pastvotedR + hhsizeR", darfurR)
x = darfurR['peacefactorR'].to_numpy().reshape( len(Y) , 1 )
y = darfurR['directlyharmedR'].to_numpy().reshape( len(Y) , 1 )
d x.shape
## (1276, 7)
#DML with Random Forest:
<- function(x,d){ randomForest(x, d) } #ML method=Forest
dreg <- function(x,y){ randomForest(x, y) } #ML method=Forest
yreg set.seed(1)
= DML2.for.PLM(x, d, y, dreg, yreg, nfold=10, clusterID=village) DML2.RF
## fold: 1 2 3 4 5 6 7 8 9 10
## coef (se) = 0.0963494 (0.0243608)
= DML2.RF$ytil
resY
= DML2.RF$dtil resD
# creating instance of labelencoder
= LabelEncoder()
labelencoder
# Assigning numerical values and storing in another column
'village_clu'] = labelencoder.fit_transform(darfurR['village'])
darfurR[
# Create cluster object
= darfurR['village_clu']
CLU = CLU.to_numpy().reshape( len(Y) , 1 )
clu
#DML with cross-validated Lasso:
def dreg(x,d):
= RandomForestRegressor( random_state = 0 ).fit( x, d )
result return result
def yreg(x,y):
= RandomForestRegressor( random_state = 0 ).fit( x, y )
result return result
= DML2_for_PLM(x, d, y, dreg, yreg, 2, clu) # set to 2 due to computation time DML2_RF
## 0
## 1
## Coefficient is 0.12079217531137135, SE is equal to 0.033980547376392126
= DML2_RF[2]
resY = DML2_RF[3] resD
print(c("Controls explain the following fraction of variance of Outcome", max(1-var(resY)/var(peacefactorR),0)))
## [1] "Controls explain the following fraction of variance of Outcome"
## [2] "0.0489042746089093"
= 1 - np.var(resY)/np.var(peacefactorR)
FVU_Y = 1 - np.var(resD)/np.var(directlyharmedR)
FVU_D
print("Controls explain the following fraction of variance of Outcome", FVU_Y)
## Controls explain the following fraction of variance of Outcome -0.16888432052333724
print(c("Controls explain the following fraction of variance of Treatment", max(1-var(resD)/var(directlyharmedR),0)))
## [1] "Controls explain the following fraction of variance of Treatment"
## [2] "0"
print("Controls explain the following fraction of variance of treatment", FVU_D)
## Controls explain the following fraction of variance of treatment 0.7327860005131104
= lm(resY~resD)
dml.darfur.model
<- sensemakr(model = dml.darfur.model,
dml.darfur.sensitivity treatment = "resD")
summary(dml.darfur.sensitivity)
## Sensitivity Analysis to Unobserved Confounding
##
## Model Formula: resY ~ resD
##
## Null hypothesis: q = 1 and reduce = TRUE
## -- This means we are considering biases that reduce the absolute value of the current estimate.
## -- The null hypothesis deemed problematic is H0:tau = 0
##
## Unadjusted Estimates of 'resD':
## Coef. estimate: 0.0963
## Standard Error: 0.0182
## t-value (H0:tau = 0): 5.305
##
## Sensitivity Statistics:
## Partial R2 of treatment with outcome: 0.0216
## Robustness Value, q = 1: 0.138
## Robustness Value, q = 1, alpha = 0.05: 0.0894
##
## Verbal interpretation of sensitivity statistics:
##
## -- Partial R2 of the treatment with the outcome: an extreme confounder (orthogonal to the covariates) that explains 100% of the residual variance of the outcome, would need to explain at least 2.16% of the residual variance of the treatment to fully account for the observed estimated effect.
##
## -- Robustness Value, q = 1: unobserved confounders (orthogonal to the covariates) that explain more than 13.8% of the residual variance of both the treatment and the outcome are strong enough to bring the point estimate to 0 (a bias of 100% of the original estimate). Conversely, unobserved confounders that do not explain more than 13.8% of the residual variance of both the treatment and the outcome are not strong enough to bring the point estimate to 0.
##
## -- Robustness Value, q = 1, alpha = 0.05: unobserved confounders (orthogonal to the covariates) that explain more than 8.94% of the residual variance of both the treatment and the outcome are strong enough to bring the estimate to a range where it is no longer 'statistically different' from 0 (a bias of 100% of the original estimate), at the significance level of alpha = 0.05. Conversely, unobserved confounders that do not explain more than 8.94% of the residual variance of both the treatment and the outcome are not strong enough to bring the estimate to a range where it is no longer 'statistically different' from 0, at the significance level of alpha = 0.05.
'resY_rf'] = resY
darfurR['resD_rf'] = resD
darfurR[
# linear model to use as input in sensemakr
= smf.ols('resY_rf~ resD_rf',data=darfurR ).fit()
dml_darfur_model_rf
# We need to double check why the function does not allow to run withour the benchmark_covariates argument
"resD_rf", benchmark_covariates = "Intercept") sensemakr.Sensemakr(dml_darfur_model_rf,
## <sensemakr.sensemakr.Sensemakr object at 0x0000000054938430>
dml_darfur_sensitivity.summary()
## Sensitivity Analysis to Unobserved Confounding
##
## Model Formula: resY ~ Intercept + resD
##
## Null hypothesis: q = 1 and reduce = True
##
## -- This means we are considering biases that reduce the absolute value of the current estimate.
## -- The null hypothesis deemed problematic is H0:tau = 0.0
##
## Unadjusted Estimates of ' resD ':
## Coef. estimate: 0.1
## Standard Error: 0.018
## t-value: 5.491
##
## Sensitivity Statistics:
## Partial R2 of treatment with outcome: 0.023
## Robustness Value, q = 1 : 0.142
## Robustness Value, q = 1 alpha = 0.05 : 0.094
##
## Verbal interpretation of sensitivity statistics:
##
## -- Partial R2 of the treatment with the outcome: an extreme confounder (orthogonal to the covariates) that explains 100% of the residual variance of the outcome, would need to explain at least 2.311921044413477 % of the residual variance of the treatment to fully account for the observed estimated effect.
##
## -- Robustness Value, q = 1 : unobserved confounders (orthogonal to the covariates) that explain more than 14.245999027817676 % of the residual variance of both the treatment and the outcome are strong enough to bring the point estimate to 0.0 (a bias of 100.0 % of the original estimate). Conversely, unobserved confounders that do not explain more than 14.245999027817676 % of the residual variance of both the treatment and the outcome are not strong enough to bring the point estimate to 0.0 .
##
## -- Robustness Value, q = 1 , alpha = 0.05 : unobserved confounders (orthogonal to the covariates) that explain more than 9.408806814165754 % of the residual variance of both the treatment and the outcome are strong enough to bring the estimate to a range where it is no longer 'statistically different' from 0.0 (a bias of 100.0 % of the original estimate), at the significance level of alpha = 0.05 . Conversely, unobserved confounders that do not explain more than 9.408806814165754 % of the residual variance of both the treatment and the outcome are not strong enough to bring the estimate to a range where it is no longer 'statistically different' from 0.0 , at the significance level of alpha = 0.05 .
##
## Bounds on omitted variable bias:
## --The table below shows the maximum strength of unobserved confounders with association with the treatment and the outcome bounded by a multiple of the observed explanatory power of the chosen benchmark covariate(s).
##
## bound_label r2dz_x ... adjusted_lower_CI adjusted_upper_CI
## 0 1x Intercept 2.166013e-18 ... 0.064474 0.136199
##
## [1 rows x 9 columns]
# Make a contour plot for the estimate
plot(dml.darfur.sensitivity,nlevels = 15)
# Make a contour plot for the estimate
=dml_darfur_sensitivity, sensitivity_of='estimate')
ovb_plots.ovb_contour_plot(sense_obj plt.show()