Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Impute itemized expense amounts for non-itemizers in PUF data #275

Merged
merged 25 commits into from
Aug 24, 2018
Merged

Impute itemized expense amounts for non-itemizers in PUF data #275

merged 25 commits into from
Aug 24, 2018

Conversation

martinholmer
Copy link
Contributor

@martinholmer martinholmer commented Aug 20, 2018

This pull request does what the title says. The fact that there are no itemized expense amounts for non-itemizers is inherent in the PUF, which is a sample of tax returns. This lack of itemized expenses for non-itemizers makes it impossible to produce sensible estimates of policy reforms that reduce the standard deduction, which has forced us to develop a set of warning messages that are costly to maintain and confusing to some users. See issue #32 for a discussion of an earlier attempt to impute these missing data.

Read the docstring a the top of the new puf_data/impute_itmexp.py file for a description of the methods used to conduct the imputation. As described there, the imputation targets the following JCT estimates:

jct-nonitemizer-impuations

Note that the impute_itmexp.py logic was developed with the DUMP and CALIBRATING variables set to True, but they have all been set to False for production work. This comment shows the full DUMP output.

There will be a corresponding Tax-Calculator pull request that uses the new puf.csv file and eliminates the warning messages about lower the values of the standard deduction policy parameters.

@martinholmer
Copy link
Contributor Author

Here is the output of python impute_itmexp.py when the puf.csv file is as on the master branch and the three DUMP variables and the CALIBRATING variable are all se to True.

ALL raw count = 248591
PUF raw count = 241245
CPS raw count =   7346
PUF fraction of ALL = 0.9704
ALL itemizer mean = 0.4374
PUF itemizer mean = 0.4507
CPS itemizer mean = 0.0000
frac and mean for itemizers with e18400>0 = 0.8852  61248.02
frac and mean for itemizers with e18500>0 = 0.9056  11911.55
frac and mean for itemizers with e19200>0 = 0.7910  21883.09
frac and mean for itemizers with e19800>0 = 0.8359  26695.59
frac and mean for itemizers with e20100>0 = 0.4723  20923.73
frac and mean for itemizers with e20400>0 = 0.6477  20803.07
frac and mean for itemizers with e17500>0 = 0.1443  15949.11
frac and mean for itemizers with g20500>0 = 0.0020  75433.10
itmexp correlation coefficients for itemizers:
          e18400    e18500    e19200    e19800
e18400  1.000000  0.363621  0.255311  0.226451
e18500  0.363621  1.000000  0.291070  0.176777
e19200  0.255311  0.291070  1.000000  0.104885
e19800  0.226451  0.176777  0.104885  1.000000
e20100  0.153225  0.111082  0.063946  0.218948
e20400  0.335436  0.338850  0.342370  0.179658
e17500 -0.029363 -0.028708 -0.027123 -0.004438
g20500 -0.001406  0.006668  0.000184 -0.000774
          e20100    e20400    e17500    g20500
e18400  0.153225  0.335436 -0.029363 -0.001406
e18500  0.111082  0.338850 -0.028708  0.006668
e19200  0.063946  0.342370 -0.027123  0.000184
e19800  0.218948  0.179658 -0.004438 -0.000774
e20100  1.000000  0.197059  0.008778 -0.000472
e20400  0.197059  1.000000  0.000177 -0.000260
e17500  0.008778  0.000177  1.000000 -0.002747
g20500 -0.000472 -0.000260 -0.002747  1.000000
frac of non-itemizers with e18400>0 = 0.0060
frac of non-itemizers with e18500>0 = 0.0047
frac of non-itemizers with e19200>0 = 0.0037
frac of non-itemizers with e19800>0 = 0.0050
frac of non-itemizers with e20100>0 = 0.0022
frac of non-itemizers with e20400>0 = 0.0033
frac of non-itemizers with e17500>0 = 0.0005
frac of non-itemizers with g20500>0 = 0.0000
****** IMPUTE e18400 ******
                           Logit Regression Results                           
==============================================================================
Dep. Variable:                 e18400   No. Observations:               108740
Model:                          Logit   Df Residuals:                   108731
Method:                           MLE   Df Model:                            8
Date:                Mon, 20 Aug 2018   Pseudo R-squ.:                0.003426
Time:                        12:30:34   Log-Likelihood:                -38615.
converged:                       True   LL-Null:                       -38748.
                                        LLR p-value:                 8.778e-53
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
constant       2.1013      0.023     89.581      0.000       2.055       2.147
MARS2         -0.3257      0.031    -10.613      0.000      -0.386      -0.266
MARS3         -0.1762      0.068     -2.594      0.009      -0.309      -0.043
MARS4         -0.0077      0.055     -0.141      0.888      -0.115       0.100
XTOT           0.0817      0.010      8.260      0.000       0.062       0.101
e00200     -2.733e-08   7.17e-09     -3.810      0.000   -4.14e-08   -1.33e-08
e00600      -9.67e-08   1.61e-08     -6.001      0.000   -1.28e-07   -6.51e-08
e00900     -2.527e-07   3.51e-08     -7.208      0.000   -3.21e-07   -1.84e-07
e02000     -2.254e-08   6.61e-09     -3.410      0.001   -3.55e-08   -9.59e-09
==============================================================================
0    0.970723
2    0.967901
3    0.967901
4    0.978530
6    0.972955
dtype: float64
0.9712
139851
size of e18400 OLS sample = 53673
max e18400 value = 9.35010231435
avg e18400 value = 7.84
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                 e18400   R-squared:                       0.260
Model:                            OLS   Adj. R-squared:                  0.260
Method:                 Least Squares   F-statistic:                     2358.
Date:                Mon, 20 Aug 2018   Prob (F-statistic):               0.00
Time:                        12:30:34   Log-Likelihood:                -64006.
No. Observations:               53673   AIC:                         1.280e+05
Df Residuals:                   53664   BIC:                         1.281e+05
Df Model:                           8                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
constant       7.1500      0.008    927.897      0.000       7.135       7.165
MARS2          0.6333      0.011     57.887      0.000       0.612       0.655
MARS3          0.1575      0.025      6.252      0.000       0.108       0.207
MARS4          0.1669      0.016     10.559      0.000       0.136       0.198
XTOT           0.0260      0.004      6.672      0.000       0.018       0.034
e00200      3.018e-06   4.03e-08     74.804      0.000    2.94e-06     3.1e-06
e00600      6.939e-07   6.36e-08     10.909      0.000    5.69e-07    8.19e-07
e00900      1.749e-06   7.86e-08     22.255      0.000    1.59e-06     1.9e-06
e02000      2.528e-07   2.31e-08     10.958      0.000    2.08e-07    2.98e-07
==============================================================================
Omnibus:                    23956.002   Durbin-Watson:                   1.151
Prob(Omnibus):                  0.000   Jarque-Bera (JB):           510174.106
Skew:                          -1.647   Prob(JB):                         0.00
Kurtosis:                      17.740   Cond. No.                     1.11e+06
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.11e+06. This might indicate that there are
strong multicollinearity or other numerical problems.
OLS std error of regression = 0.80
mean cap_imputed_amt = 7.487
mean adj_imputed_amt = 6.734
mean imputed_amount = 1156.65
****** IMPUTE e18500 ******
                           Logit Regression Results                           
==============================================================================
Dep. Variable:                 e18500   No. Observations:               108740
Model:                          Logit   Df Residuals:                   108730
Method:                           MLE   Df Model:                            9
Date:                Mon, 20 Aug 2018   Pseudo R-squ.:                 0.09520
Time:                        12:30:34   Log-Likelihood:                -30748.
converged:                       True   LL-Null:                       -33983.
                                        LLR p-value:                     0.000
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
constant       1.1881      0.022     54.971      0.000       1.146       1.230
MARS2          1.6058      0.036     44.234      0.000       1.535       1.677
MARS3          0.0815      0.055      1.474      0.141      -0.027       0.190
MARS4          0.2881      0.045      6.359      0.000       0.199       0.377
XTOT           0.0479      0.014      3.364      0.001       0.020       0.076
e00200      8.635e-08   1.89e-08      4.567      0.000    4.93e-08    1.23e-07
e00600      1.947e-07    5.5e-08      3.540      0.000    8.69e-08    3.03e-07
e00900      3.474e-07   7.11e-08      4.889      0.000    2.08e-07    4.87e-07
e02000      1.083e-07   1.49e-08      7.292      0.000    7.92e-08    1.37e-07
e18400      1.106e-07    8.7e-08      1.272      0.204   -5.99e-08    2.81e-07
==============================================================================
0    0.176307
2    0.196081
3    0.196091
4    0.257537
6    0.183416
dtype: float64
0.3135
139851
size of e18500 OLS sample = 67517
max e18500 value = 9.35010231435
avg e18500 value = 8.07
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                 e18500   R-squared:                       0.176
Model:                            OLS   Adj. R-squared:                  0.176
Method:                 Least Squares   F-statistic:                     1606.
Date:                Mon, 20 Aug 2018   Prob (F-statistic):               0.00
Time:                        12:30:34   Log-Likelihood:                -75443.
No. Observations:               67517   AIC:                         1.509e+05
Df Residuals:                   67507   BIC:                         1.510e+05
Df Model:                           9                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
constant       7.5097      0.007   1064.932      0.000       7.496       7.524
MARS2          0.5662      0.009     60.849      0.000       0.548       0.584
MARS3         -0.0368      0.024     -1.531      0.126      -0.084       0.010
MARS4          0.1930      0.015     13.031      0.000       0.164       0.222
XTOT           0.0184      0.003      6.180      0.000       0.013       0.024
e00200      3.267e-07   9.07e-09     36.016      0.000    3.09e-07    3.45e-07
e00600      3.201e-07   2.99e-08     10.695      0.000    2.61e-07    3.79e-07
e00900      3.687e-07   2.13e-08     17.293      0.000    3.27e-07    4.11e-07
e02000      1.055e-07   7.07e-09     14.922      0.000    9.16e-08    1.19e-07
e18400      1.166e-06   7.25e-08     16.078      0.000    1.02e-06    1.31e-06
==============================================================================
Omnibus:                    12670.923   Durbin-Watson:                   1.188
Prob(Omnibus):                  0.000   Jarque-Bera (JB):            28762.186
Skew:                          -1.078   Prob(JB):                         0.00
Kurtosis:                       5.360   Cond. No.                     4.06e+06
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 4.06e+06. This might indicate that there are
strong multicollinearity or other numerical problems.
OLS std error of regression = 0.74
mean cap_imputed_amt = 7.741
mean adj_imputed_amt = 6.811
mean imputed_amount = 434.61
****** IMPUTE e19200 ******
                           Logit Regression Results                           
==============================================================================
Dep. Variable:                 e19200   No. Observations:               108740
Model:                          Logit   Df Residuals:                   108729
Method:                           MLE   Df Model:                           10
Date:                Mon, 20 Aug 2018   Pseudo R-squ.:                 0.05435
Time:                        12:30:35   Log-Likelihood:                -52717.
converged:                       True   LL-Null:                       -55746.
                                        LLR p-value:                     0.000
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
constant       0.1535      0.017      9.070      0.000       0.120       0.187
MARS2          0.1490      0.024      6.335      0.000       0.103       0.195
MARS3          0.3842      0.051      7.510      0.000       0.284       0.484
MARS4          0.3031      0.041      7.434      0.000       0.223       0.383
XTOT           0.4256      0.010     44.553      0.000       0.407       0.444
e00200     -2.471e-08    8.8e-09     -2.806      0.005    -4.2e-08   -7.45e-09
e00600     -8.424e-08   1.73e-08     -4.859      0.000   -1.18e-07   -5.03e-08
e00900     -5.438e-08   3.39e-08     -1.604      0.109   -1.21e-07    1.21e-08
e02000     -8.395e-08   7.53e-09    -11.145      0.000   -9.87e-08   -6.92e-08
e18400      1.005e-07   5.11e-08      1.966      0.049    3.28e-10    2.01e-07
e18500      8.973e-06   6.01e-07     14.924      0.000    7.79e-06    1.02e-05
==============================================================================
0    0.060635
2    0.125968
3    0.126435
4    0.322665
6    0.089400
dtype: float64
0.1483
139851
size of e19200 OLS sample = 39045
max e19200 value = 9.35010231435
avg e19200 value = 8.15
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                 e19200   R-squared:                       0.085
Model:                            OLS   Adj. R-squared:                  0.085
Method:                 Least Squares   F-statistic:                     364.2
Date:                Mon, 20 Aug 2018   Prob (F-statistic):               0.00
Time:                        12:30:35   Log-Likelihood:                -66811.
No. Observations:               39045   AIC:                         1.336e+05
Df Residuals:                   39034   BIC:                         1.337e+05
Df Model:                          10                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
constant       7.5777      0.018    412.391      0.000       7.542       7.614
MARS2          0.3565      0.023     15.515      0.000       0.312       0.402
MARS3         -0.0328      0.057     -0.574      0.566      -0.145       0.079
MARS4          0.4791      0.036     13.199      0.000       0.408       0.550
XTOT           0.1635      0.007     23.329      0.000       0.150       0.177
e00200      -7.74e-08   9.71e-09     -7.971      0.000   -9.64e-08   -5.84e-08
e00600     -1.881e-07   2.61e-08     -7.196      0.000   -2.39e-07   -1.37e-07
e00900     -3.097e-07   4.02e-08     -7.707      0.000   -3.88e-07   -2.31e-07
e02000     -1.147e-07   8.55e-09    -13.411      0.000   -1.31e-07   -9.79e-08
e18400     -3.454e-08   5.95e-08     -0.581      0.562   -1.51e-07    8.21e-08
e18500     -1.521e-05   5.79e-07    -26.280      0.000   -1.63e-05   -1.41e-05
==============================================================================
Omnibus:                    21348.936   Durbin-Watson:                   1.651
Prob(Omnibus):                  0.000   Jarque-Bera (JB):           163946.801
Skew:                          -2.578   Prob(JB):                         0.00
Kurtosis:                      11.613   Cond. No.                     7.82e+06
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 7.82e+06. This might indicate that there are
strong multicollinearity or other numerical problems.
OLS std error of regression = 1.34
mean cap_imputed_amt = 7.841
mean adj_imputed_amt = 7.559
mean imputed_amount = 523.47
****** IMPUTE e19800 ******
                           Logit Regression Results                           
==============================================================================
Dep. Variable:                 e19800   No. Observations:               108740
Model:                          Logit   Df Residuals:                   108728
Method:                           MLE   Df Model:                           11
Date:                Mon, 20 Aug 2018   Pseudo R-squ.:                 0.09614
Time:                        12:30:35   Log-Likelihood:                -43868.
converged:                       True   LL-Null:                       -48534.
                                        LLR p-value:                     0.000
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
constant       0.6713      0.018     36.892      0.000       0.636       0.707
MARS2          1.0346      0.028     37.402      0.000       0.980       1.089
MARS3          0.0381      0.053      0.724      0.469      -0.065       0.141
MARS4          0.1852      0.038      4.905      0.000       0.111       0.259
XTOT          -0.0813      0.010     -8.203      0.000      -0.101      -0.062
e00200      1.011e-07   2.77e-08      3.657      0.000    4.69e-08    1.55e-07
e00600      4.517e-07      1e-07      4.510      0.000    2.55e-07    6.48e-07
e00900      6.696e-07   8.58e-08      7.802      0.000    5.01e-07    8.38e-07
e02000       3.28e-07   2.45e-08     13.401      0.000     2.8e-07    3.76e-07
e18400      3.796e-06   2.84e-07     13.372      0.000    3.24e-06    4.35e-06
e18500      5.087e-05   1.58e-06     32.113      0.000    4.78e-05     5.4e-05
e19200      -6.71e-07   2.52e-07     -2.668      0.008   -1.16e-06   -1.78e-07
==============================================================================
0    0.502115
2    0.482706
3    0.489161
4    0.461088
6    0.473349
dtype: float64
0.5493
139851
size of e19800 OLS sample = 66888
max e19800 value = 9.35010231435
avg e19800 value = 7.15
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                 e19800   R-squared:                       0.075
Model:                            OLS   Adj. R-squared:                  0.075
Method:                 Least Squares   F-statistic:                     491.6
Date:                Mon, 20 Aug 2018   Prob (F-statistic):               0.00
Time:                        12:30:35   Log-Likelihood:            -1.1800e+05
No. Observations:               66888   AIC:                         2.360e+05
Df Residuals:                   66876   BIC:                         2.361e+05
Df Model:                          11                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
constant       6.4964      0.014    478.142      0.000       6.470       6.523
MARS2          0.7396      0.018     41.232      0.000       0.704       0.775
MARS3          0.0395      0.044      0.888      0.374      -0.048       0.127
MARS4          0.3062      0.029     10.545      0.000       0.249       0.363
XTOT          -0.0199      0.006     -3.488      0.000      -0.031      -0.009
e00200      1.107e-07   9.05e-09     12.223      0.000    9.29e-08    1.28e-07
e00600      5.023e-08   3.38e-08      1.487      0.137    -1.6e-08    1.16e-07
e00900      3.701e-07    3.6e-08     10.271      0.000    2.99e-07    4.41e-07
e02000      1.771e-07   1.03e-08     17.266      0.000    1.57e-07    1.97e-07
e18400      7.893e-09   5.09e-08      0.155      0.877   -9.19e-08    1.08e-07
e18500      1.181e-05    4.9e-07     24.079      0.000    1.08e-05    1.28e-05
e19200     -4.166e-07   1.59e-07     -2.616      0.009   -7.29e-07   -1.04e-07
==============================================================================
Omnibus:                     8270.290   Durbin-Watson:                   1.419
Prob(Omnibus):                  0.000   Jarque-Bera (JB):            12353.615
Skew:                          -0.914   Prob(JB):                         0.00
Kurtosis:                       4.046   Cond. No.                     6.18e+06
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 6.18e+06. This might indicate that there are
strong multicollinearity or other numerical problems.
OLS std error of regression = 1.41
mean cap_imputed_amt = 6.713
mean adj_imputed_amt = 5.243
mean imputed_amount = 248.60
****** IMPUTE e20100 ******
                           Logit Regression Results                           
==============================================================================
Dep. Variable:                 e20100   No. Observations:               108740
Model:                          Logit   Df Residuals:                   108727
Method:                           MLE   Df Model:                           12
Date:                Mon, 20 Aug 2018   Pseudo R-squ.:                 0.01234
Time:                        12:30:35   Log-Likelihood:                -74277.
converged:                       True   LL-Null:                       -75205.
                                        LLR p-value:                     0.000
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
constant      -0.5738      0.015    -38.638      0.000      -0.603      -0.545
MARS2          0.4320      0.020     21.785      0.000       0.393       0.471
MARS3         -0.0548      0.046     -1.196      0.232      -0.145       0.035
MARS4          0.3254      0.033     10.005      0.000       0.262       0.389
XTOT           0.0701      0.006     11.078      0.000       0.058       0.082
e00200      9.307e-09   6.35e-09      1.465      0.143   -3.14e-09    2.18e-08
e00600       1.18e-08    1.5e-08      0.788      0.431   -1.76e-08    4.12e-08
e00900     -2.106e-07   3.27e-08     -6.437      0.000   -2.75e-07   -1.46e-07
e02000     -6.938e-08   6.81e-09    -10.189      0.000   -8.27e-08    -5.6e-08
e18400      3.746e-08   3.24e-08      1.157      0.247    -2.6e-08    1.01e-07
e18500     -2.009e-06   3.67e-07     -5.474      0.000   -2.73e-06   -1.29e-06
e19200     -1.706e-07   1.11e-07     -1.542      0.123   -3.87e-07    4.63e-08
e19800      1.729e-08   3.41e-08      0.506      0.613   -4.96e-08    8.42e-08
==============================================================================
0    0.213298
2    0.216128
3    0.215974
4    0.332306
6    0.225542
dtype: float64
0.2738
139851
size of e20100 OLS sample = 48559
max e20100 value = 9.35010231435
avg e20100 value = 6.37
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                 e20100   R-squared:                       0.018
Model:                            OLS   Adj. R-squared:                  0.017
Method:                 Least Squares   F-statistic:                     72.92
Date:                Mon, 20 Aug 2018   Prob (F-statistic):          5.83e-178
Time:                        12:30:35   Log-Likelihood:                -73231.
No. Observations:               48559   AIC:                         1.465e+05
Df Residuals:                   48546   BIC:                         1.466e+05
Df Model:                          12                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
constant       6.0884      0.013    471.311      0.000       6.063       6.114
MARS2          0.1865      0.017     11.127      0.000       0.154       0.219
MARS3         -0.0670      0.042     -1.592      0.111      -0.149       0.015
MARS4          0.1558      0.027      5.871      0.000       0.104       0.208
XTOT           0.0325      0.005      6.470      0.000       0.023       0.042
e00200      5.703e-08   6.51e-09      8.762      0.000    4.43e-08    6.98e-08
e00600     -5.661e-08    1.6e-08     -3.532      0.000    -8.8e-08   -2.52e-08
e00900      1.406e-07   2.97e-08      4.734      0.000    8.24e-08    1.99e-07
e02000      4.371e-08   6.67e-09      6.557      0.000    3.06e-08    5.68e-08
e18400     -1.266e-07   3.84e-08     -3.296      0.001   -2.02e-07   -5.13e-08
e18500      3.364e-06   3.91e-07      8.606      0.000     2.6e-06    4.13e-06
e19200     -3.248e-07   1.17e-07     -2.786      0.005   -5.53e-07   -9.63e-08
e19800      -8.23e-08   5.55e-08     -1.482      0.138   -1.91e-07    2.66e-08
==============================================================================
Omnibus:                     3881.755   Durbin-Watson:                   1.570
Prob(Omnibus):                  0.000   Jarque-Bera (JB):            14836.602
Skew:                          -0.340   Prob(JB):                         0.00
Kurtosis:                       5.621   Cond. No.                     8.25e+06
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 8.25e+06. This might indicate that there are
strong multicollinearity or other numerical problems.
OLS std error of regression = 1.09
mean cap_imputed_amt = 6.225
mean adj_imputed_amt = 5.595
mean imputed_amount = 135.33
****** IMPUTE e20400 ******
                           Logit Regression Results                           
==============================================================================
Dep. Variable:                 e20400   No. Observations:               108740
Model:                          Logit   Df Residuals:                   108726
Method:                           MLE   Df Model:                           13
Date:                Mon, 20 Aug 2018   Pseudo R-squ.:                 0.02149
Time:                        12:30:36   Log-Likelihood:                -69038.
converged:                       True   LL-Null:                       -70554.
                                        LLR p-value:                     0.000
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
constant       0.4865      0.015     32.006      0.000       0.457       0.516
MARS2          0.1674      0.021      8.077      0.000       0.127       0.208
MARS3         -0.1072      0.046     -2.322      0.020      -0.198      -0.017
MARS4          0.1135      0.034      3.376      0.001       0.048       0.179
XTOT          -0.0764      0.007    -11.328      0.000      -0.090      -0.063
e00200      1.062e-07   1.26e-08      8.414      0.000    8.15e-08    1.31e-07
e00600      1.182e-06   8.24e-08     14.341      0.000    1.02e-06    1.34e-06
e00900     -3.357e-07   3.28e-08    -10.238      0.000      -4e-07   -2.71e-07
e02000     -5.449e-08   8.83e-09     -6.170      0.000   -7.18e-08   -3.72e-08
e18400      7.075e-07    8.7e-08      8.136      0.000    5.37e-07    8.78e-07
e18500      1.168e-05   6.15e-07     19.011      0.000    1.05e-05    1.29e-05
e19200      4.087e-08   1.93e-07      0.212      0.832   -3.37e-07    4.18e-07
e19800      3.935e-07   1.06e-07      3.707      0.000    1.85e-07    6.02e-07
e20100      2.537e-06   3.57e-07      7.110      0.000    1.84e-06    3.24e-06
==============================================================================
0    0.147446
2    0.125098
3    0.125789
4    0.124589
6    0.137223
dtype: float64
0.1398
139851
size of e20400 OLS sample = 51955
max e20400 value = 9.35010231435
avg e20400 value = 6.64
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                 e20400   R-squared:                       0.039
Model:                            OLS   Adj. R-squared:                  0.039
Method:                 Least Squares   F-statistic:                     161.3
Date:                Mon, 20 Aug 2018   Prob (F-statistic):               0.00
Time:                        12:30:36   Log-Likelihood:                -98654.
No. Observations:               51955   AIC:                         1.973e+05
Df Residuals:                   51941   BIC:                         1.975e+05
Df Model:                          13                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
constant       6.1478      0.018    349.198      0.000       6.113       6.182
MARS2          0.4009      0.023     17.234      0.000       0.355       0.446
MARS3          0.1040      0.058      1.782      0.075      -0.010       0.218
MARS4          0.4597      0.038     12.043      0.000       0.385       0.535
XTOT          -0.0023      0.007     -0.312      0.755      -0.017       0.012
e00200      8.296e-08   1.13e-08      7.339      0.000    6.08e-08    1.05e-07
e00600      2.443e-07   4.78e-08      5.114      0.000    1.51e-07    3.38e-07
e00900     -4.855e-08   4.38e-08     -1.108      0.268   -1.34e-07    3.73e-08
e02000      3.015e-08   1.01e-08      2.972      0.003    1.03e-08       5e-08
e18400      2.047e-07    8.2e-08      2.498      0.013    4.41e-08    3.65e-07
e18500       1.51e-05   6.64e-07     22.741      0.000    1.38e-05    1.64e-05
e19200      1.102e-06   2.33e-07      4.734      0.000    6.46e-07    1.56e-06
e19800       1.21e-07   7.77e-08      1.557      0.119   -3.13e-08    2.73e-07
e20100     -2.796e-07   1.01e-07     -2.776      0.006   -4.77e-07   -8.22e-08
==============================================================================
Omnibus:                     1658.902   Durbin-Watson:                   1.442
Prob(Omnibus):                  0.000   Jarque-Bera (JB):             1805.332
Skew:                          -0.451   Prob(JB):                         0.00
Kurtosis:                       2.854   Cond. No.                     7.25e+06
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 7.25e+06. This might indicate that there are
strong multicollinearity or other numerical problems.
OLS std error of regression = 1.62
mean cap_imputed_amt = 6.312
mean adj_imputed_amt = 6.032
mean imputed_amount = 161.56
****** IMPUTE e17500 ******
                           Logit Regression Results                           
==============================================================================
Dep. Variable:                 e17500   No. Observations:               108740
Model:                          Logit   Df Residuals:                   108725
Method:                           MLE   Df Model:                           14
Date:                Mon, 20 Aug 2018   Pseudo R-squ.:                  0.2812
Time:                        12:30:36   Log-Likelihood:                -32263.
converged:                       True   LL-Null:                       -44885.
                                        LLR p-value:                     0.000
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
constant       0.1285      0.022      5.760      0.000       0.085       0.172
MARS2          0.6483      0.030     21.403      0.000       0.589       0.708
MARS3         -0.2880      0.070     -4.093      0.000      -0.426      -0.150
MARS4          0.1371      0.046      2.993      0.003       0.047       0.227
XTOT          -0.1491      0.013    -11.858      0.000      -0.174      -0.124
e00200     -2.031e-05   2.69e-07    -75.532      0.000   -2.08e-05   -1.98e-05
e00600     -3.253e-06   2.93e-07    -11.088      0.000   -3.83e-06   -2.68e-06
e00900     -5.294e-06   1.96e-07    -27.007      0.000   -5.68e-06   -4.91e-06
e02000     -1.149e-06   5.59e-08    -20.567      0.000   -1.26e-06   -1.04e-06
e18400     -2.391e-05   1.11e-06    -21.597      0.000   -2.61e-05   -2.17e-05
e18500     -1.298e-05   1.65e-06     -7.878      0.000   -1.62e-05   -9.75e-06
e19200     -2.982e-05   1.22e-06    -24.507      0.000   -3.22e-05   -2.74e-05
e19800     -5.267e-06   7.36e-07     -7.161      0.000   -6.71e-06   -3.83e-06
e20100     -3.947e-07   3.74e-07     -1.055      0.292   -1.13e-06    3.39e-07
e20400     -3.471e-07   5.48e-07     -0.634      0.526   -1.42e-06    7.26e-07
==============================================================================

Possibly complete quasi-separation: A fraction 0.21 of observations can be
perfectly predicted. This might indicate that there is complete
quasi-separation. In this case some parameters will not be identified.
0    0.065852
2    0.033229
3    0.032592
4    0.020861
6    0.053572
dtype: float64
0.0486
139851
size of e17500 OLS sample = 6820
max e17500 value = 9.35010231435
avg e17500 value = 8.64
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                 e17500   R-squared:                       0.429
Model:                            OLS   Adj. R-squared:                  0.427
Method:                 Least Squares   F-statistic:                     364.7
Date:                Mon, 20 Aug 2018   Prob (F-statistic):               0.00
Time:                        12:30:36   Log-Likelihood:                -3486.9
No. Observations:                6820   AIC:                             7004.
Df Residuals:                    6805   BIC:                             7106.
Df Model:                          14                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
constant       8.1848      0.011    719.361      0.000       8.163       8.207
MARS2          0.7894      0.015     52.578      0.000       0.760       0.819
MARS3          0.0894      0.044      2.041      0.041       0.004       0.175
MARS4          0.3746      0.021     17.893      0.000       0.334       0.416
XTOT          -0.0501      0.006     -8.049      0.000      -0.062      -0.038
e00200      2.524e-06   1.76e-07     14.319      0.000    2.18e-06    2.87e-06
e00600      4.319e-06   5.78e-07      7.477      0.000    3.19e-06    5.45e-06
e00900      1.415e-06   2.67e-07      5.291      0.000    8.91e-07    1.94e-06
e02000      1.972e-08   7.39e-08      0.267      0.790   -1.25e-07    1.65e-07
e18400      1.626e-06   1.12e-06      1.455      0.146   -5.65e-07    3.82e-06
e18500      2.927e-07   1.06e-06      0.276      0.783   -1.79e-06    2.37e-06
e19200     -5.686e-06   5.13e-07    -11.080      0.000   -6.69e-06   -4.68e-06
e19800      1.717e-06   1.03e-06      1.672      0.095   -2.96e-07    3.73e-06
e20100     -2.901e-06   1.53e-06     -1.895      0.058    -5.9e-06    9.95e-08
e20400     -6.616e-07   8.88e-07     -0.745      0.456    -2.4e-06    1.08e-06
==============================================================================
Omnibus:                     1689.975   Durbin-Watson:                   1.299
Prob(Omnibus):                  0.000   Jarque-Bera (JB):             4017.142
Skew:                          -1.380   Prob(JB):                         0.00
Kurtosis:                       5.554   Cond. No.                     6.21e+05
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 6.21e+05. This might indicate that there are
strong multicollinearity or other numerical problems.
OLS std error of regression = 0.40
mean cap_imputed_amt = 8.440
mean adj_imputed_amt = 8.130
mean imputed_amount = 185.34
BEFORE: num of nonitemizers with sum>stdded = 4923
BEFORE: frac of nonitemizers with sum>stdded = 0.0352
AFTER: num of nonitemizers with sum>stdded = 0
AFTER: frac of nonitemizers with sum>stdded = 0.0000

@andersonfrailey
Copy link
Collaborator

Thanks for working on this @martinholmer. It'll be a huge improvement to the data. I gave this a quick look over today and initial review looks good. I want to take a closer look tomorrow though if you don't mind waiting to merge.

@martinholmer
Copy link
Contributor Author

@andersonfrailey said:

Thanks for working on this. It'll be a huge improvement to the data. I gave this a quick look over today and initial review looks good. I want to take a closer look tomorrow though if you don't mind waiting to merge.

I'm in no rush. I'd appreciate a thorough review. Meanwhile I'm checking (with an overnight run) that the PUF weights and ratios don't change. I'm also preparing a Tax-Calculator pull request that incorporates the slightly changed test results associated with the new puf.csv file.

print(round(positive_imputed.mean(), 4))
print(len(nonitemizer_data))
# estimate OLS parameters for the positive amount using a sample of
# itemizers who have positive ievar amounts than are less than the
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you mean "...amounts that are..."?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, thanks for catching this mistake.

@hdoupe
Copy link
Collaborator

hdoupe commented Aug 21, 2018

@martinholmer this is very impressive work. Thanks for imputing these variables. I'm excited to hear that the warning logic can be removed once Tax-Calculator is updated for the new puf.csv file.

I tried to review your work here, but most of this stuff is over my head. My only question is: do you think the order that the variables are imputed in matters? In the statistical methodology paper, it mentions ordering them by how many values are missing. However, it doesn't seem like that ordering is really applicable in this case. I was just curious if you experimented with the order of the variables since they get fed into the regression models after they are imputed.

@martinholmer
Copy link
Contributor Author

@hdoupe said:

Thanks for imputing these variables. I'm excited to hear that the warning logic can be removed once Tax-Calculator is updated for the new puf.csv file.

Actually, the leading comment in PR #275 is a little misleading. It turns out that we can eliminate the warning message when users reduce the value of a standard deduction policy parameter. Those were the most frequently encountered warning messages and the ones that cause the most user confusion and that required considerable code complexity in Tax-Calculator. But there are other warning that do not go away, so the warning message framework cannot be completely eliminated.

@martinholmer
Copy link
Contributor Author

@hdoupe asked:

My only question is: do you think the order that the variables are imputed in matters? In the statistical methodology paper, it mentions ordering them by how many values are missing. However, it doesn't seem like that ordering is really applicable in this case. I was just curious if you experimented with the order of the variables since they get fed into the regression models after they are imputed.

I experimented only a little bit. I ordered the itmexp variables roughly by prevalence of positive amounts among itemizers. I also thought a little about causation. In short, there was no "rocket science" involved in the ordering of the imputed variables.

@hdoupe
Copy link
Collaborator

hdoupe commented Aug 21, 2018

@martinholmer said:

Actually, the leading comment in PR #275 is a little misleading. It turns out that we can eliminate the warning message when users reduce the value of a standard deduction policy parameter. Those were the most frequently encountered warning messages and the ones that cause the most user confusion and that required considerable code complexity in Tax-Calculator. But there are other warning that do not go away, so the warning message framework cannot be completely eliminated.

Ah, ok. Thanks for clearing up my confusion. I'll follow your development related to this in PSLmodels/Tax-Calculator#2052.

@hdoupe
Copy link
Collaborator

hdoupe commented Aug 21, 2018

@martinholmer said:

I experimented only a little bit. I ordered the itmexp variables roughly by prevalence of positive amounts among itemizers. I also thought a little about causation. In short, there was no "rocket science" involved in the ordering of the imputed variables.

OK, that makes sense. I wasn't sure if there was enough of a relationship between the itemexp variables for the order to have an effect. Thanks for explaining your thought process on this.

@MaxGhenis
Copy link
Contributor

MaxGhenis commented Aug 21, 2018

Fantastic, I've been looking to include standard deduction repeal in an analysis, so this is great timing.

Evaluating on a holdout set could inform modeling choices like whether to include imputed variables in other imputations.

Also BTW I've continued some of Avi's work by comparing performance of quantile regression methods: OLS, linear quantile regression, random forests, and deep learning. Here's a notebook, and I'll write it up in a blog post soon, but the crux is: On a relatively small sample dataset (Boston housing), random forests performs best, then deep learning (though this could be tuned more), then linear quantile regression, then OLS. Random forests reduced average quantile loss in the holdout by 30% over OLS.

With this more standardized imputation code, it seems like it'll be easier to update models if they show room for improvement. I'd be happy to help with this, and will be trying out the approaches in the CPS data next.

@MattHJensen
Copy link
Contributor

Here is a direct link to the methodological description.

@martinholmer
Copy link
Contributor Author

@MaxGhenis said:

Fantastic, I've been looking to include standard deduction repeal in an analysis, so this is great timing.

We're glad that the timing of this data enhancement is good for you. It's been a long time coming.

@andersonfrailey
Copy link
Collaborator

Thanks for giving me time to review, @martinholmer. This is a bit over my head as well so I might have missed it, but how did you determine the values in the logit_prom_af and log_amount_af dictionaries? Are they based on the JCT numbers?

@martinholmer
Copy link
Contributor Author

@andersonfrailey asked:

... how did you determine the values in the logit_prob_af and log_amount_af dictionaries? Are they based on the JCT numbers?

Yes, they are related to the JCT targets in an indirect way. The af additive factor values are determined by hand calibration. Which means python impute_itmexp.py is run several times adjusting an af value until the error_msg generated by the check function for that af value goes away. Look at the logic of the check function. This is yet another example of a technique we discussed months ago: finding the root of an equation. The equation in this case if the difference between the tabulated value of the check statistic and its JCT target as a function of the af value). The root of an equation is the input value that produces a value of zero for the equation. Finding the af value that is the root means we have found the af value that make the tabulated statistic close to its JCT target.

As I mentioned months ago, the best place to read about using bisection and interpolation techniques for root finding is one of the chapter in Press, et al., Numerical Recipes in C: The Art of Scientific Computing." But if you don't have access to that classic book, look at this tutorial which uses the inefficient bisection method (rather than using linear interpolation once you've bracketed the root).

@martinholmer
Copy link
Contributor Author

I'm happy to leave taxdata PR #275 open for a longer period of time if people are actively reviewing it.
But if the reviewing is now complete or almost complete, I hope to merge #275 this Friday afternoon.
Does anybody need more time than that to review this pull request?

@andersonfrailey
Copy link
Collaborator

@martinholmer a Friday afternoon merge is good with me.

@martinholmer
Copy link
Contributor Author

@andersonfrailey said:

a Friday afternoon merge is good with me.

OK. Thanks for looking over this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants