Chapter 3 Predictive Inference The Gender Wage Gap
In the previous lab, we already analyzed data from the March Supplement of the U.S. Current Population Survey (2015) and answered the question how to use job-relevant characteristics, such as education and experience, to best predict wages. Now, we focus on the following inference question:
What is the difference in predicted wages between men and women with the same job-relevant characteristics?
Thus, we analyze if there is a difference in the payment of men and women (gender wage gap). The gender wage gap may partly reflect discrimination against women in the labor market or may partly reflect a selection effect, namely that women are relatively more likely to take on occupations that pay somewhat less (for example, school teaching).
To investigate the gender wage gap, we consider the following log-linear regression model
\[log(Y) = \beta'X + \epsilon \] \[log(Y) = \beta_1 D + \beta_2' W + \epsilon\]
where \(D\) is the indicator of being female (\(1\) if female and \(0\) otherwise) and the \(W\)’s are controls explaining variation in wages. Considering transformed wages by the logarithm, we are analyzing the relative difference in the payment of men and women.
3.1 Data analysis
We consider the same subsample of the U.S. Current Population Survey (2015) as in the previous lab. Let us load the data set.
library(dplyr)
library(kableExtra)
import pandas as pd
import numpy as np
import pyreadr
import math
load("./data/wage2015_subsample_inference.Rdata")
# Dimensions
dim(data)
## [1] 5150 20
= pyreadr.read_r("./data/wage2015_subsample_inference.Rdata")
rdata_read = rdata_read[ 'data' ]
data # Dimensions
data.shape
## (5150, 20)
To start our (causal) analysis, we compare the sample means given gender:
<- data %>%
table1 select(lwage,sex,shs,hsg,scl,clg,ad,ne,mw,so,we,exp1)
<- apply(table1,2,mean)
All <- apply(table1[table1$sex==0,],2,mean)
Men <- apply(table1[table1$sex==1,],2,mean)
Women
<- c("Log Wage","Sex","Less then High School","High School Graduate","Some College","Collage Graduate","Advanced Degree", "Northeast","Midwest","South","West","Experience")
Variables
data.frame(Variables,All,Men,Women) %>%
kable("markdown",caption = "Sample means given gender")
Variables | All | Men | Women | |
---|---|---|---|---|
lwage | Log Wage | 2.9707867 | 2.9878296 | 2.9494849 |
sex | Sex | 0.4444660 | 0.0000000 | 1.0000000 |
shs | Less then High School | 0.0233010 | 0.0318071 | 0.0126693 |
hsg | High School Graduate | 0.2438835 | 0.2943027 | 0.1808650 |
scl | Some College | 0.2780583 | 0.2733310 | 0.2839668 |
clg | Collage Graduate | 0.3176699 | 0.2939532 | 0.3473132 |
ad | Advanced Degree | 0.1370874 | 0.1066061 | 0.1751857 |
ne | Northeast | 0.2277670 | 0.2219504 | 0.2350371 |
mw | Midwest | 0.2596117 | 0.2590003 | 0.2603757 |
so | South | 0.2965049 | 0.2981475 | 0.2944517 |
we | West | 0.2161165 | 0.2209018 | 0.2101354 |
exp1 | Experience | 13.7605825 | 13.7839916 | 13.7313237 |
= data[ ["lwage","sex","shs","hsg","scl","clg","ad","ne","mw","so","we","exp1"] ]
Z
= data[data[ 'sex' ] == 1 ]
data_female = data_female[ ["lwage","sex","shs","hsg","scl","clg","ad","ne","mw","so","we","exp1"] ]
Z_female
= data[ data[ 'sex' ] == 0 ]
data_male = data_male[ [ "lwage","sex","shs","hsg","scl","clg","ad","ne","mw","so","we","exp1" ] ]
Z_male
= np.zeros( (12, 3) )
table 0] = Z.mean().values
table[:, 1] = Z_male.mean().values
table[:, 2] = Z_female.mean().values
table[:, = pd.DataFrame( table, columns = [ 'All', 'Men', 'Women'])
table_pandas = ["Log Wage","Sex","Less then High School","High School Graduate","Some College","Gollage Graduate","Advanced Degree", "Northeast","Midwest","South","West","Experience"]
table_pandas.index = table_pandas.to_html()
table_html
table_pandas
## All Men Women
## Log Wage 2.970787 2.987830 2.949485
## Sex 0.444466 0.000000 1.000000
## Less then High School 0.023301 0.031807 0.012669
## High School Graduate 0.243883 0.294303 0.180865
## Some College 0.278058 0.273331 0.283967
## Gollage Graduate 0.317670 0.293953 0.347313
## Advanced Degree 0.137087 0.106606 0.175186
## Northeast 0.227767 0.221950 0.235037
## Midwest 0.259612 0.259000 0.260376
## South 0.296505 0.298148 0.294452
## West 0.216117 0.220902 0.210135
## Experience 13.760583 13.783992 13.731324
Alternatively, we can also print the table as latex.
print(data.frame(Variables,All,Men,Women) ,type="latex")
## Variables All Men Women
## lwage Log Wage 2.97078670 2.98782963 2.94948490
## sex Sex 0.44446602 0.00000000 1.00000000
## shs Less then High School 0.02330097 0.03180706 0.01266929
## hsg High School Graduate 0.24388350 0.29430269 0.18086501
## scl Some College 0.27805825 0.27333100 0.28396680
## clg Collage Graduate 0.31766990 0.29395316 0.34731324
## ad Advanced Degree 0.13708738 0.10660608 0.17518567
## ne Northeast 0.22776699 0.22195037 0.23503713
## mw Midwest 0.25961165 0.25900035 0.26037571
## so South 0.29650485 0.29814750 0.29445173
## we West 0.21611650 0.22090178 0.21013543
## exp1 Experience 13.76058252 13.78399161 13.73132372
# print(table_pandas, type="latex")
In particular, the table above shows that the difference in average logwage between men and women is equal to \(0,038\).
mean(table1$lwage[table1$sex==1])-mean(table1$lwage[table1$sex==0])
## [1] -0.03834473
'lwage'].mean() - data_male['lwage'].mean() data_female[
## -0.03834473367441493
Thus, the unconditional gender wage gap is about \(3,8 \%\) for the group of never married workers (women get paid less on average in our sample). We also observe that never married working women are relatively more educated than working men and have lower working experience.
This unconditional (predictive) effect of gender equals the coefficient \(\beta\) in the univariate ols regression of \(Y\) on \(D\):
\[log(Y) =\beta D + \epsilon\] We verify this by running an ols regression in R and Python.
3.1.1 OLS Regression
library(sandwich)
import statsmodels.api as sm
import statsmodels.formula.api as smf
<- lm(lwage ~ sex, data = data)
nocontrol.fit <- summary(nocontrol.fit)$coef["sex",1]
nocontrol.est <- vcovHC(nocontrol.fit, type = 'HC');
HCV.coefs <- sqrt(diag(HCV.coefs))[2] # Estimated std errors
nocontrol.se cat ("The estimated gender coefficient is",nocontrol.est," and the corresponding robust standard error is",nocontrol.se)
## The estimated gender coefficient is -0.03834473 and the corresponding robust standard error is 0.01590194
= smf.ols( formula = 'lwage ~ sex', data = data )
nocontrol_model = nocontrol_model.fit().summary2().tables[1]['Coef.']['sex']
nocontrol_est = nocontrol_model.fit().cov_HC0
HCV_coefs = np.power( HCV_coefs.diagonal() , 0.5)[1]
nocontrol_se print( f'The estimated gender coefficient is {nocontrol_est} and the corresponding robust standard error is {nocontrol_se}' )
## The estimated gender coefficient is -0.038344733674415696 and the corresponding robust standard error is 0.015901935079095802
The estimated gender coefficient is -0.0383447, and the corresponding robust standard error is 0.0159019.
Note that the standard error is computed with the R package sandwich to be robust to heteroskedasticity.
Next, we run an ols regression of \(Y\) on \((D,W)\) to control for the effect of covariates summarized in \(W\):
\[log(Y) =\beta_1 D + \beta_2' W + \epsilon\]
Here, we are considering the flexible model from the previous lab. Hence, \(W\) controls for experience, education, region, and occupation and industry indicators plus transformations and two-way interactions.
Let us run the ols regression with controls.
3.1.2 Ols regression with controls
<- lwage ~ sex + (exp1+exp2+exp3+exp4)*(shs+hsg+scl+clg+occ2+ind2+mw+so+we)
flex
# Note that ()*() operation in formula objects in R creates a formula of the sort:
# (exp1+exp2+exp3+exp4)+ (shs+hsg+scl+clg+occ2+ind2+mw+so+we) + (exp1+exp2+exp3+exp4)*(shs+hsg+scl+clg+occ2+ind2+mw+so+we)
# This is not intuitive at all, but that's what it does.
<- lm(flex, data=data)
control.fit <- summary(control.fit)$coef[2,1]
control.est
summary(control.fit)
##
## Call:
## lm(formula = flex, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.9384 -0.2782 -0.0041 0.2733 3.4934
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.8602606 0.4286188 9.006 < 2e-16 ***
## sex -0.0695532 0.0152180 -4.570 4.99e-06 ***
## exp1 -0.0677247 0.1519756 -0.446 0.655885
## exp2 1.6362944 1.6909253 0.968 0.333246
## exp3 -0.9154735 0.6880249 -1.331 0.183388
## exp4 0.1429357 0.0907569 1.575 0.115337
## shs -0.1233089 0.9068325 -0.136 0.891845
## hsg -0.5289024 0.1977559 -2.675 0.007508 **
## scl -0.2920581 0.1260155 -2.318 0.020510 *
## clg -0.0411641 0.0703862 -0.585 0.558688
## occ22 0.1613397 0.1297243 1.244 0.213665
## occ23 0.2101514 0.1686774 1.246 0.212869
## occ24 0.0708570 0.1837167 0.386 0.699746
## occ25 -0.3960076 0.1885398 -2.100 0.035745 *
## occ26 -0.2310611 0.1869662 -1.236 0.216576
## occ27 0.3147249 0.1941519 1.621 0.105077
## occ28 -0.1875417 0.1692988 -1.108 0.268022
## occ29 -0.3390270 0.1672301 -2.027 0.042685 *
## occ210 0.0209545 0.1564982 0.134 0.893490
## occ211 -0.6424177 0.3090899 -2.078 0.037723 *
## occ212 -0.0674774 0.2520486 -0.268 0.788929
## occ213 -0.2329781 0.2315379 -1.006 0.314359
## occ214 0.2562009 0.3226729 0.794 0.427236
## occ215 -0.1938585 0.2595082 -0.747 0.455086
## occ216 -0.0551256 0.1470658 -0.375 0.707798
## occ217 -0.4156093 0.1361144 -3.053 0.002275 **
## occ218 -0.4822168 1.0443540 -0.462 0.644290
## occ219 -0.2579412 0.3325215 -0.776 0.437956
## occ220 -0.3010203 0.2341022 -1.286 0.198556
## occ221 -0.4271811 0.2206486 -1.936 0.052922 .
## occ222 -0.8694527 0.2975222 -2.922 0.003490 **
## ind23 -1.2473654 0.6454941 -1.932 0.053365 .
## ind24 -0.0948281 0.4636021 -0.205 0.837935
## ind25 -0.5293860 0.4345990 -1.218 0.223244
## ind26 -0.6221688 0.4347226 -1.431 0.152441
## ind27 -0.5047497 0.5024770 -1.005 0.315176
## ind28 -0.7295442 0.4674008 -1.561 0.118623
## ind29 -0.8025334 0.4252462 -1.887 0.059190 .
## ind210 -0.5805840 0.4808776 -1.207 0.227358
## ind211 -0.9852350 0.4481566 -2.198 0.027966 *
## ind212 -0.7375777 0.4243260 -1.738 0.082232 .
## ind213 -1.0183283 0.4826544 -2.110 0.034922 *
## ind214 -0.5860174 0.4159033 -1.409 0.158892
## ind215 -0.3801359 0.5908517 -0.643 0.520014
## ind216 -0.5703905 0.4386579 -1.300 0.193556
## ind217 -0.8201843 0.4259846 -1.925 0.054239 .
## ind218 -0.7613604 0.4238287 -1.796 0.072495 .
## ind219 -0.8812815 0.4565671 -1.930 0.053635 .
## ind220 -0.9099021 0.4484198 -2.029 0.042499 *
## ind221 -0.7586534 0.4405801 -1.722 0.085143 .
## ind222 -0.4040775 0.4328735 -0.933 0.350620
## mw 0.1106834 0.0814463 1.359 0.174218
## so 0.0224244 0.0743855 0.301 0.763075
## we -0.0215659 0.0841591 -0.256 0.797767
## exp1:shs -0.1919981 0.1955408 -0.982 0.326206
## exp1:hsg -0.0173433 0.0572279 -0.303 0.761859
## exp1:scl -0.0664505 0.0433730 -1.532 0.125570
## exp1:clg -0.0550346 0.0310279 -1.774 0.076172 .
## exp1:occ22 -0.0736239 0.0501108 -1.469 0.141837
## exp1:occ23 -0.0714859 0.0637688 -1.121 0.262336
## exp1:occ24 -0.0723997 0.0747715 -0.968 0.332953
## exp1:occ25 0.0946732 0.0794005 1.192 0.233182
## exp1:occ26 -0.0348928 0.0712136 -0.490 0.624175
## exp1:occ27 -0.2279338 0.0784860 -2.904 0.003699 **
## exp1:occ28 -0.0727459 0.0645883 -1.126 0.260094
## exp1:occ29 0.0274143 0.0669517 0.409 0.682217
## exp1:occ210 0.0075628 0.0581715 0.130 0.896564
## exp1:occ211 0.1014221 0.1005094 1.009 0.312986
## exp1:occ212 -0.0862744 0.0874768 -0.986 0.324057
## exp1:occ213 0.0067149 0.0761825 0.088 0.929768
## exp1:occ214 -0.1369153 0.0974458 -1.405 0.160073
## exp1:occ215 -0.0400425 0.0898931 -0.445 0.656017
## exp1:occ216 -0.0539314 0.0520926 -1.035 0.300580
## exp1:occ217 0.0147277 0.0467903 0.315 0.752958
## exp1:occ218 0.1074099 0.4718440 0.228 0.819937
## exp1:occ219 0.0047165 0.1060745 0.044 0.964536
## exp1:occ220 0.0243156 0.0743274 0.327 0.743575
## exp1:occ221 0.0791776 0.0696947 1.136 0.255985
## exp1:occ222 0.1093246 0.0880828 1.241 0.214607
## exp1:ind23 0.4758891 0.2227484 2.136 0.032693 *
## exp1:ind24 0.0147304 0.1571102 0.094 0.925305
## exp1:ind25 0.1256987 0.1531626 0.821 0.411864
## exp1:ind26 0.1540275 0.1524289 1.010 0.312312
## exp1:ind27 0.1029245 0.1786939 0.576 0.564654
## exp1:ind28 0.2357669 0.1689203 1.396 0.162859
## exp1:ind29 0.1359079 0.1489486 0.912 0.361578
## exp1:ind210 0.1512578 0.1644341 0.920 0.357687
## exp1:ind211 0.3174885 0.1590023 1.997 0.045907 *
## exp1:ind212 0.2591089 0.1510588 1.715 0.086356 .
## exp1:ind213 0.3396094 0.1669241 2.035 0.041954 *
## exp1:ind214 0.1441411 0.1477994 0.975 0.329485
## exp1:ind215 -0.0568181 0.2349853 -0.242 0.808950
## exp1:ind216 0.0847295 0.1550425 0.546 0.584753
## exp1:ind217 0.1728867 0.1513280 1.142 0.253317
## exp1:ind218 0.1565399 0.1494171 1.048 0.294842
## exp1:ind219 0.1516103 0.1620851 0.935 0.349641
## exp1:ind220 0.1326629 0.1566883 0.847 0.397222
## exp1:ind221 0.2190905 0.1555052 1.409 0.158930
## exp1:ind222 0.1145814 0.1523427 0.752 0.452010
## exp1:mw -0.0279931 0.0296572 -0.944 0.345274
## exp1:so -0.0099678 0.0266868 -0.374 0.708786
## exp1:we 0.0063077 0.0301417 0.209 0.834248
## exp2:shs 1.9005060 1.4502480 1.310 0.190098
## exp2:hsg 0.1171642 0.5509729 0.213 0.831609
## exp2:scl 0.6217923 0.4629986 1.343 0.179344
## exp2:clg 0.4096746 0.3802171 1.077 0.281321
## exp2:occ22 0.6632173 0.5523220 1.201 0.229895
## exp2:occ23 0.6415456 0.7102783 0.903 0.366448
## exp2:occ24 0.9748422 0.8655351 1.126 0.260099
## exp2:occ25 -0.9778823 0.9737990 -1.004 0.315335
## exp2:occ26 0.1050860 0.8002267 0.131 0.895527
## exp2:occ27 3.1407119 0.9389423 3.345 0.000829 ***
## exp2:occ28 0.6710877 0.7192077 0.933 0.350818
## exp2:occ29 0.0231977 0.7629142 0.030 0.975744
## exp2:occ210 -0.2692292 0.6405270 -0.420 0.674267
## exp2:occ211 -1.0816539 1.0057575 -1.075 0.282221
## exp2:occ212 0.8323737 0.9341245 0.891 0.372933
## exp2:occ213 -0.2209813 0.7728463 -0.286 0.774942
## exp2:occ214 0.7511163 0.9272548 0.810 0.417955
## exp2:occ215 -0.0326858 0.9409116 -0.035 0.972290
## exp2:occ216 0.3635814 0.5509550 0.660 0.509342
## exp2:occ217 -0.2659285 0.4861131 -0.547 0.584369
## exp2:occ218 -2.5608762 5.1700911 -0.495 0.620393
## exp2:occ219 -0.1291756 1.0616901 -0.122 0.903165
## exp2:occ220 -0.3323297 0.7229071 -0.460 0.645743
## exp2:occ221 -0.9099997 0.6854114 -1.328 0.184349
## exp2:occ222 -0.8550536 0.8279414 -1.033 0.301773
## exp2:ind23 -5.9368948 2.4067939 -2.467 0.013670 *
## exp2:ind24 -1.1053411 1.7101982 -0.646 0.518100
## exp2:ind25 -2.0149181 1.6919190 -1.191 0.233748
## exp2:ind26 -2.2277748 1.6816902 -1.325 0.185325
## exp2:ind27 -1.4648099 2.0137888 -0.727 0.467022
## exp2:ind28 -2.9479949 1.8595425 -1.585 0.112955
## exp2:ind29 -1.7796219 1.6471248 -1.080 0.279999
## exp2:ind210 -2.1973300 1.7738638 -1.239 0.215507
## exp2:ind211 -3.8776807 1.7637372 -2.199 0.027956 *
## exp2:ind212 -3.1690425 1.6819362 -1.884 0.059602 .
## exp2:ind213 -3.9651983 1.8130709 -2.187 0.028789 *
## exp2:ind214 -2.0783289 1.6490355 -1.260 0.207610
## exp2:ind215 0.1911692 2.6075396 0.073 0.941559
## exp2:ind216 -1.3265850 1.7185648 -0.772 0.440202
## exp2:ind217 -2.2002873 1.6837183 -1.307 0.191341
## exp2:ind218 -2.2006232 1.6566630 -1.328 0.184125
## exp2:ind219 -1.9308536 1.7876673 -1.080 0.280152
## exp2:ind220 -1.9467267 1.7244008 -1.129 0.258983
## exp2:ind221 -3.1127363 1.7237908 -1.806 0.071019 .
## exp2:ind222 -1.8578340 1.6849542 -1.103 0.270254
## exp2:mw 0.2005611 0.3172911 0.632 0.527348
## exp2:so 0.0544354 0.2815662 0.193 0.846708
## exp2:we 0.0012717 0.3207873 0.004 0.996837
## exp3:shs -0.6721239 0.4426627 -1.518 0.128987
## exp3:hsg -0.0179937 0.2083176 -0.086 0.931171
## exp3:scl -0.1997877 0.1855189 -1.077 0.281572
## exp3:clg -0.1025230 0.1643648 -0.624 0.532819
## exp3:occ22 -0.2039403 0.2211386 -0.922 0.356455
## exp3:occ23 -0.2369620 0.2870372 -0.826 0.409103
## exp3:occ24 -0.4366958 0.3520168 -1.241 0.214830
## exp3:occ25 0.3885298 0.4118861 0.943 0.345577
## exp3:occ26 0.0484737 0.3293525 0.147 0.882997
## exp3:occ27 -1.3949288 0.4050109 -3.444 0.000578 ***
## exp3:occ28 -0.2053899 0.2895727 -0.709 0.478181
## exp3:occ29 -0.0909660 0.3143348 -0.289 0.772293
## exp3:occ210 0.1854753 0.2575565 0.720 0.471477
## exp3:occ211 0.3931553 0.3817758 1.030 0.303152
## exp3:occ212 -0.2202559 0.3660206 -0.602 0.547363
## exp3:occ213 0.0950356 0.2904370 0.327 0.743519
## exp3:occ214 -0.1443933 0.3341622 -0.432 0.665684
## exp3:occ215 0.1477077 0.3645191 0.405 0.685339
## exp3:occ216 -0.0378548 0.2151288 -0.176 0.860330
## exp3:occ217 0.1510497 0.1878081 0.804 0.421276
## exp3:occ218 1.4084443 1.8852467 0.747 0.455047
## exp3:occ219 0.0923425 0.4042308 0.228 0.819314
## exp3:occ220 0.1806994 0.2652079 0.681 0.495682
## exp3:occ221 0.3779083 0.2553031 1.480 0.138875
## exp3:occ222 0.2855058 0.2984206 0.957 0.338754
## exp3:ind23 2.6665808 0.9807497 2.719 0.006573 **
## exp3:ind24 0.7298431 0.6879811 1.061 0.288811
## exp3:ind25 0.9942250 0.6842435 1.453 0.146280
## exp3:ind26 1.0641428 0.6800948 1.565 0.117718
## exp3:ind27 0.7089089 0.8337963 0.850 0.395245
## exp3:ind28 1.2340948 0.7483474 1.649 0.099193 .
## exp3:ind29 0.8287315 0.6675904 1.241 0.214526
## exp3:ind210 1.0448162 0.7066717 1.479 0.139337
## exp3:ind211 1.6877578 0.7162155 2.356 0.018487 *
## exp3:ind212 1.3734455 0.6835570 2.009 0.044564 *
## exp3:ind213 1.6376669 0.7259301 2.256 0.024117 *
## exp3:ind214 1.0162910 0.6714525 1.514 0.130199
## exp3:ind215 0.1879483 1.0299675 0.182 0.855214
## exp3:ind216 0.6889680 0.6968028 0.989 0.322831
## exp3:ind217 1.0085540 0.6836992 1.475 0.140238
## exp3:ind218 1.0605598 0.6725232 1.577 0.114863
## exp3:ind219 0.8959865 0.7225602 1.240 0.215029
## exp3:ind220 0.9768944 0.6955822 1.404 0.160255
## exp3:ind221 1.4415215 0.6996480 2.060 0.039418 *
## exp3:ind222 0.9687884 0.6828498 1.419 0.156037
## exp3:mw -0.0625771 0.1241291 -0.504 0.614194
## exp3:so -0.0115842 0.1084217 -0.107 0.914917
## exp3:we -0.0124875 0.1251376 -0.100 0.920515
## exp4:shs 0.0777418 0.0475427 1.635 0.102071
## exp4:hsg 0.0004913 0.0265964 0.018 0.985264
## exp4:scl 0.0210760 0.0245289 0.859 0.390256
## exp4:clg 0.0078695 0.0227528 0.346 0.729457
## exp4:occ22 0.0176389 0.0289257 0.610 0.542021
## exp4:occ23 0.0303057 0.0376552 0.805 0.420962
## exp4:occ24 0.0584146 0.0457704 1.276 0.201927
## exp4:occ25 -0.0515181 0.0549489 -0.938 0.348514
## exp4:occ26 -0.0170182 0.0440847 -0.386 0.699488
## exp4:occ27 0.1905353 0.0558757 3.410 0.000655 ***
## exp4:occ28 0.0196522 0.0379084 0.518 0.604195
## exp4:occ29 0.0190014 0.0421099 0.451 0.651841
## exp4:occ210 -0.0333347 0.0338825 -0.984 0.325246
## exp4:occ211 -0.0465914 0.0479018 -0.973 0.330778
## exp4:occ212 0.0110212 0.0470536 0.234 0.814820
## exp4:occ213 -0.0136895 0.0358988 -0.381 0.702970
## exp4:occ214 0.0055582 0.0400331 0.139 0.889581
## exp4:occ215 -0.0327444 0.0462379 -0.708 0.478872
## exp4:occ216 -0.0089706 0.0275729 -0.325 0.744937
## exp4:occ217 -0.0256735 0.0239306 -1.073 0.283400
## exp4:occ218 -0.2121372 0.2204003 -0.963 0.335841
## exp4:occ219 -0.0169398 0.0513428 -0.330 0.741463
## exp4:occ220 -0.0296125 0.0323353 -0.916 0.359819
## exp4:occ221 -0.0524577 0.0317251 -1.654 0.098291 .
## exp4:occ222 -0.0350646 0.0360687 -0.972 0.331018
## exp4:ind23 -0.3851791 0.1329065 -2.898 0.003771 **
## exp4:ind24 -0.1209478 0.0899580 -1.344 0.178852
## exp4:ind25 -0.1441045 0.0897994 -1.605 0.108616
## exp4:ind26 -0.1526110 0.0892689 -1.710 0.087410 .
## exp4:ind27 -0.1001993 0.1119398 -0.895 0.370768
## exp4:ind28 -0.1609664 0.0979780 -1.643 0.100471
## exp4:ind29 -0.1178080 0.0877821 -1.342 0.179642
## exp4:ind210 -0.1482842 0.0918416 -1.615 0.106469
## exp4:ind211 -0.2322961 0.0944506 -2.459 0.013949 *
## exp4:ind212 -0.1872911 0.0899985 -2.081 0.037481 *
## exp4:ind213 -0.2155617 0.0946011 -2.279 0.022731 *
## exp4:ind214 -0.1483524 0.0884992 -1.676 0.093740 .
## exp4:ind215 -0.0532195 0.1313815 -0.405 0.685439
## exp4:ind216 -0.1044336 0.0916252 -1.140 0.254429
## exp4:ind217 -0.1427349 0.0899315 -1.587 0.112543
## exp4:ind218 -0.1546248 0.0885883 -1.745 0.080973 .
## exp4:ind219 -0.1269592 0.0948784 -1.338 0.180918
## exp4:ind220 -0.1468554 0.0911188 -1.612 0.107094
## exp4:ind221 -0.2032619 0.0920972 -2.207 0.027358 *
## exp4:ind222 -0.1480951 0.0897937 -1.649 0.099154 .
## exp4:mw 0.0062439 0.0158699 0.393 0.694007
## exp4:so 0.0003145 0.0136275 0.023 0.981591
## exp4:we 0.0017685 0.0159602 0.111 0.911776
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4708 on 4904 degrees of freedom
## Multiple R-squared: 0.3511, Adjusted R-squared: 0.3187
## F-statistic: 10.83 on 245 and 4904 DF, p-value: < 2.2e-16
cat("Coefficient for OLS with controls", control.est)
## Coefficient for OLS with controls -0.0695532
<- vcovHC(control.fit, type = 'HC');
HCV.coefs <- sqrt(diag(HCV.coefs))[2] # Estimated std errors control.se
= 'lwage ~ sex + (exp1+exp2+exp3+exp4)*(shs+hsg+scl+clg+occ2+ind2+mw+so+we)'
flex
# The smf api replicates R script when it transform data
= smf.ols( formula = flex, data = data )
control_model = control_model.fit().summary2().tables[1]['Coef.']['sex']
control_est
print(control_model.fit().summary2().tables[1])
## Coef. Std.Err. t P>|t| [0.025 0.975]
## Intercept 3.279677 0.284196 11.540202 2.037819e-30 2.722526 3.836828
## occ2[T.10] 0.020954 0.156498 0.133896 8.934903e-01 -0.285852 0.327761
## occ2[T.11] -0.642418 0.309090 -2.078417 3.772286e-02 -1.248372 -0.036463
## occ2[T.12] -0.067477 0.252049 -0.267716 7.889294e-01 -0.561605 0.426651
## occ2[T.13] -0.232978 0.231538 -1.006220 3.143593e-01 -0.686896 0.220940
## ... ... ... ... ... ... ...
## exp4:scl 0.021076 0.024529 0.859230 3.902557e-01 -0.027012 0.069164
## exp4:clg 0.007869 0.022753 0.345868 7.294565e-01 -0.036736 0.052475
## exp4:mw 0.006244 0.015870 0.393446 6.940073e-01 -0.024868 0.037356
## exp4:so 0.000314 0.013628 0.023075 9.815913e-01 -0.026402 0.027031
## exp4:we 0.001768 0.015960 0.110804 9.117763e-01 -0.029521 0.033058
##
## [246 rows x 6 columns]
print( f"Coefficient for OLS with controls {control_est}" )
## Coefficient for OLS with controls -0.06955320329684715
= control_model.fit().cov_HC0
HCV_coefs = np.power( HCV_coefs.diagonal() , 0.5)[1] control_se
The estimated regression coefficient \(\beta_1\approx-0.0696\) measures how our linear prediction of wage changes if we set the gender variable \(D\) from 0 to 1, holding the controls \(W\) fixed. We can call this the predictive effect (PE), as it measures the impact of a variable on the prediction we make. Overall, we see that the unconditional wage gap of size \(4\%\) for women increases to about \(7\%\) after controlling for worker characteristics.
Next, we are using the Frisch-Waugh-Lovell theorem from the lecture partialling-out the linear effect of the controls via ols.
3.2 Partialling-Out using ols
# Partialling-Out using ols
# models
<- lwage ~ (exp1+exp2+exp3+exp4)*(shs+hsg+scl+clg+occ2+ind2+mw+so+we) # model for Y
flex.y <- sex ~ (exp1+exp2+exp3+exp4)*(shs+hsg+scl+clg+occ2+ind2+mw+so+we) # model for D
flex.d
# partialling-out the linear effect of W from Y
<- lm(flex.y, data=data)$res
t.Y # partialling-out the linear effect of W from D
<- lm(flex.d, data=data)$res
t.D
# regression of Y on D after partialling-out the effect of W
<- lm(t.Y~t.D)
partial.fit <- summary(partial.fit)$coef[2,1]
partial.est
cat("Coefficient for D via partialling-out", partial.est)
## Coefficient for D via partialling-out -0.0695532
# standard error
<- vcovHC(partial.fit, type = 'HC')
HCV.coefs <- sqrt(diag(HCV.coefs))[2]
partial.se
# confidence interval
confint(partial.fit)[2,]
## 2.5 % 97.5 %
## -0.09867142 -0.04043498
# models
# model for Y
= 'lwage ~ (exp1+exp2+exp3+exp4)*(shs+hsg+scl+clg+occ2+ind2+mw+so+we)'
flex_y # model for D
= 'sex ~ (exp1+exp2+exp3+exp4)*(shs+hsg+scl+clg+occ2+ind2+mw+so+we)'
flex_d
# partialling-out the linear effect of W from Y
= smf.ols( formula = flex_y , data = data ).fit().resid
t_Y
# partialling-out the linear effect of W from D
= smf.ols( formula = flex_d , data = data ).fit().resid
t_D
= pd.DataFrame( np.vstack(( t_Y.values , t_D.values )).T , columns = [ 't_Y', 't_D' ] )
data_res # regression of Y on D after partialling-out the effect of W
= smf.ols( formula = 't_Y ~ t_D' , data = data_res ).fit()
partial_fit = partial_fit.summary2().tables[1]['Coef.']['t_D']
partial_est
print("Coefficient for D via partialling-out", partial_est)
# standard error
## Coefficient for D via partialling-out -0.06955320329684608
= partial_fit.cov_HC0
HCV_coefs = np.power( HCV_coefs.diagonal() , 0.5)[1]
partial_se
# confidence interval
=0.05 ).iloc[1, :] partial_fit.conf_int( alpha
## 0 -0.098671
## 1 -0.040435
## Name: t_D, dtype: float64
Again, the estimated coefficient measures the linear predictive effect (PE) of \(D\) on \(Y\) after taking out the linear effect of \(W\) on both of these variables. This coefficient equals the estimated coefficient from the ols regression with controls.
We know that the partialling-out approach works well when the dimension of \(W\) is low in relation to the sample size \(n\). When the dimension of \(W\) is relatively high, we need to use variable selection or penalization for regularization purposes.
In the following, we illustrate the partialling-out approach using lasso instead of ols.
3.3 Partialling-Out using lasso
# Partialling-Out using lasso
library(hdm)
# models
<- lwage ~ (exp1+exp2+exp3+exp4)*(shs+hsg+scl+clg+occ2+ind2+mw+so+we) # model for Y
flex.y <- sex ~ (exp1+exp2+exp3+exp4)*(shs+hsg+scl+clg+occ2+ind2+mw+so+we) # model for D
flex.d
# partialling-out the linear effect of W from Y
<- rlasso(flex.y, data=data)$res
t.Y # partialling-out the linear effect of W from D
<- rlasso(flex.d, data=data)$res
t.D
# regression of Y on D after partialling-out the effect of W
<- lm(t.Y~t.D)
partial.lasso.fit <- summary(partial.lasso.fit)$coef[2,1]
partial.lasso.est
cat("Coefficient for D via partialling-out using lasso", partial.lasso.est)
## Coefficient for D via partialling-out using lasso -0.07193551
# standard error
<- vcovHC(partial.lasso.fit, type = 'HC')
HCV.coefs <- sqrt(diag(HCV.coefs))[2] partial.lasso.se
from sklearn import linear_model
# flex_y
= linear_model.Lasso( alpha = 0.1 )
lasso_model
= smf.ols(formula = flex_y, data = data)
flex_y_covariables = lasso_model.fit( flex_y_covariables.exog, data[[ 'lwage' ]] ).predict( flex_y_covariables.exog )
Y_lasso_fitted = data[[ 'lwage' ]] - Y_lasso_fitted.reshape( Y_lasso_fitted.size, 1)
t_Y
# extraflex_d
= smf.ols( flex_d, data=data)
flex_d_covariables = lasso_model.fit( flex_d_covariables.exog, data[[ 'sex' ]] ).predict( flex_d_covariables.exog )
D_lasso_fitted = data[[ 'sex' ]] - D_lasso_fitted.reshape( D_lasso_fitted.size, 1)
t_D
= pd.DataFrame( np.hstack(( t_Y , t_D )) , columns = [ 't_Y', 't_D' ] )
data_res
# regression of Y on D after partialling-out the effect of W
= smf.ols( formula = 't_Y ~ t_D' , data = data_res ).fit()
partial_lasso_fit = partial_lasso_fit.summary2().tables[1]['Coef.']['t_D']
partial_lasso_est
print( f"Coefficient for D via partialling-out using lasso {partial_lasso_est}" )
# standard error
## Coefficient for D via partialling-out using lasso -0.06487798557263097
= partial_lasso_fit.cov_HC0
HCV_coefs = np.power( HCV_coefs.diagonal() , 0.5)[1] partial_lasso_se
Using lasso for partialling-out here provides similar results as using ols.
Next, we summarize the results.
<- c(nocontrol.est,control.est,partial.est,partial.lasso.est)
Estimate <- c(nocontrol.se,control.se,partial.se,partial.lasso.se)
StdError
<- data.frame(Estimate,StdError)
table
colnames(table)<- c("Estimate","Std. Error")
rownames(table)<- c("Without controls", "full reg", "partial reg", "partial reg via lasso")
kable(table,format = "markdown",caption = "Partialling-out-Lasso")
Estimate | Std. Error | |
---|---|---|
Without controls | -0.0383447 | 0.0159019 |
full reg | -0.0695532 | 0.0150005 |
partial reg | -0.0695532 | 0.0150005 |
partial reg via lasso | -0.0719355 | 0.0153910 |
= np.zeros( (4, 2) )
table2 0,0] = nocontrol_est
table2[0,1] = nocontrol_se
table2[1,0] = control_est
table2[1,1] = control_se
table2[2,0] = partial_est
table2[2,1] = partial_se
table2[3,0] = partial_lasso_est
table2[3,1] = partial_lasso_se
table2[= pd.DataFrame( table2, columns = [ "Estimate","Std. Error" ])
table2_pandas = [ "Without controls", "full reg", "partial reg", "partial reg via lasso" ]
table2_pandas.index = table2_pandas.to_html()
table2_html table2_pandas
## Estimate Std. Error
## Without controls -0.038345 0.015902
## full reg -0.069553 0.144608
## partial reg -0.069553 0.015000
## partial reg via lasso -0.064878 0.015557
It it worth to notice that controlling for worker characteristics increases the gender wage gap from less that 4% to 7%. The controls we used in our analysis include 5 educational attainment indicators (less than high school graduates, high school graduates, some college, college graduate, and advanced degree), 4 region indicators (midwest, south, west, and northeast); a quartic term (first, second, third, and fourth power) in experience and 22 occupation and 23 industry indicators.
Keep in mind that the predictive effect (PE) does not only measures discrimination (causal effect of being female), it also may reflect selection effects of unobserved differences in covariates between men and women in our sample.
Next we try “extra” flexible model, where we take interactions of all controls, giving us about 1000 controls.
3.4 “Extra” flexible model
<- lwage ~ sex + (exp1+exp2+exp3+exp4+shs+hsg+scl+clg+occ2+ind2+mw+so+we)^2
extraflex <- lm(extraflex, data=data)
control.fit #summary(control.fit)
<- summary(control.fit)$coef[2,1]
control.est cat("Number of Extra-Flex Controls", length(control.fit$coef)-1, "\n")
## Number of Extra-Flex Controls 979
cat("Coefficient for OLS with extra flex controls", control.est)
## Coefficient for OLS with extra flex controls -0.06127046
<- vcovHC(control.fit, type = 'HC');
HCV.coefs
= length(data$wage); p =length(control.fit$coef);
n<- sqrt(diag(HCV.coefs))[2]*sqrt(n/(n-p)) # Estimated std errors
control.se # crude adjustment for the effect of dimensionality on OLS standard errors, motivated by Cattaneo, Jannson, and Newey (2018)
# for really correct way of doing this, we need to implement Cattaneo, Jannson, and Newey (2018)'s procedure.
= 'lwage ~ sex + (exp1+exp2+exp3+exp4+shs+hsg+scl+clg+occ2+ind2+mw+so+we)**2'
extraflex = smf.ols( formula = extraflex, data=data).fit()
control_fit #summary( control_fit )
= control_fit.summary2().tables[1]['Coef.']['sex']
control_est print( f"Number of Extra-Flex Controls {control_fit.summary2().tables[1].shape[0]-1} \nCoefficient for OLS with extra flex controls {control_est}" )
# standard error
## Number of Extra-Flex Controls 979
## Coefficient for OLS with extra flex controls -0.06127046379396306
= control_fit.cov_HC0
HCV_coefs
= len(data[ 'wage' ])
n= len(control_fit.summary2().tables[1]['Coef.'])
p
= control_fit.summary2().tables[1]['Std.Err.']['sex']*math.sqrt(n/(n-p))
control_se # control_se
# crude adjustment for the effect of dimensionality on OLS standard errors, motivated by Cattaneo, Jannson, and Newey (2018)
# for really correct way of doing this, we need to implement Cattaneo, Jannson, and Newey (2018)'s procedure.
3.5 Laso “Extra” Flexible model
library(hdm)
# models
<- lwage ~ (exp1+exp2+exp3+exp4+shs+hsg+scl+clg+occ2+ind2+mw+so+we)^2 # model for Y
extraflex.y <- sex ~ (exp1+exp2+exp3+exp4+shs+hsg+scl+clg+occ2+ind2+mw+so+we)^2 # model for D
extraflex.d
# partialling-out the linear effect of W from Y
<- rlasso(extraflex.y, data=data)$res
t.Y # partialling-out the linear effect of W from D
<- rlasso(extraflex.d, data=data)$res
t.D
# regression of Y on D after partialling-out the effect of W
<- lm(t.Y~t.D)
partial.lasso.fit <- summary(partial.lasso.fit)$coef[2,1]
partial.lasso.est
cat("Coefficient for D via partialling-out using lasso", partial.lasso.est)
## Coefficient for D via partialling-out using lasso -0.07159928
# standard error
<- vcovHC(partial.lasso.fit, type = 'HC')
HCV.coefs <- sqrt(diag(HCV.coefs))[2] partial.lasso.se
# models
# model for Y
= 'lwage ~ (exp1+exp2+exp3+exp4+shs+hsg+scl+clg+occ2+ind2+mw+so+we)**2'
extraflex_y # model for
= 'sex ~ (exp1+exp2+exp3+exp4+shs+hsg+scl+clg+occ2+ind2+mw+so+we)**2'
extraflex_d
# extraflex_y
= linear_model.Lasso( alpha = 0.1 )
lasso_model = smf.ols(formula = extraflex_y, data = data)
extraflex_y_covariables = lasso_model.fit( extraflex_y_covariables.exog, data[[ 'lwage' ]] ).predict( extraflex_y_covariables.exog ) Y_lasso_fitted
## C:\Users\MSI-NB\ANACON~1\envs\TENSOR~2\lib\site-packages\sklearn\linear_model\_coordinate_descent.py:530: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations. Duality gap: 351.031340445538, tolerance: 0.1675169484647612
## model = cd_fast.enet_coordinate_descent(
= data[[ 'lwage' ]] - Y_lasso_fitted.reshape( Y_lasso_fitted.size, 1)
t_Y
# extraflex_d
= smf.ols( extraflex_d, data=data)
extraflex_d_covariables = lasso_model.fit( extraflex_d_covariables.exog, data[[ 'sex' ]] ).predict( extraflex_d_covariables.exog ) D_lasso_fitted
## C:\Users\MSI-NB\ANACON~1\envs\TENSOR~2\lib\site-packages\sklearn\linear_model\_coordinate_descent.py:530: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations. Duality gap: 2.703140950271518, tolerance: 0.12716172815533988
## model = cd_fast.enet_coordinate_descent(
= data[[ 'sex' ]] - D_lasso_fitted.reshape( D_lasso_fitted.size, 1)
t_D = pd.DataFrame( np.hstack(( t_Y , t_D )) , columns = [ 't_Y', 't_D' ] )
data_res
# regression of Y on D after partialling-out the effect of W
= smf.ols( formula = 't_Y ~ t_D' , data = data_res ).fit()
partial_lasso_fit = partial_lasso_fit.summary2().tables[1]['Coef.']['t_D']
partial_lasso_est print( f"Coefficient for D via partialling-out using lasso {partial_lasso_est}" )
# standard error
## Coefficient for D via partialling-out using lasso -0.06676571026846852
= partial_lasso_fit.cov_HC0
HCV_coefs = np.power( HCV_coefs.diagonal() , 0.5)[1] partial_lasso_se
3.6 Summarize the results
<- c(control.est,partial.lasso.est)
Estimate <- c(control.se,partial.lasso.se)
StdError
<- data.frame(Estimate,StdError)
table2
colnames(table2)<- c("Estimate","Std. Error")
rownames(table2)<- c("full reg","partial reg via lasso")
kable(table2,format = "markdown",caption = "Summarize the results")
Estimate | Std. Error | |
---|---|---|
full reg | -0.0612705 | 0.0168996 |
partial reg via lasso | -0.0715993 | 0.0152640 |
= np.zeros( ( 2, 2 ) )
table3
0,0] = control_est
table3[0,1] = control_se
table3[1,0] = partial_lasso_est
table3[1,1] = partial_lasso_se
table3[
= pd.DataFrame( table3, columns = [ "Estimate","Std. Error" ])
table3_pandas = [ "full reg","partial reg via lasso" ]
table3_pandas.index round(8) table3_pandas.
## Estimate Std. Error
## full reg -0.061270 0.017760
## partial reg via lasso -0.066766 0.015468