-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Impute itemized expense amounts for non-itemizers in PUF data #275
Conversation
Here is the output of
|
Thanks for working on this @martinholmer. It'll be a huge improvement to the data. I gave this a quick look over today and initial review looks good. I want to take a closer look tomorrow though if you don't mind waiting to merge. |
@andersonfrailey said:
I'm in no rush. I'd appreciate a thorough review. Meanwhile I'm checking (with an overnight run) that the PUF weights and ratios don't change. I'm also preparing a Tax-Calculator pull request that incorporates the slightly changed test results associated with the new |
puf_data/impute_itmexp.py
Outdated
print(round(positive_imputed.mean(), 4)) | ||
print(len(nonitemizer_data)) | ||
# estimate OLS parameters for the positive amount using a sample of | ||
# itemizers who have positive ievar amounts than are less than the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you mean "...amounts that are..."?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, thanks for catching this mistake.
@martinholmer this is very impressive work. Thanks for imputing these variables. I'm excited to hear that the warning logic can be removed once Tax-Calculator is updated for the new I tried to review your work here, but most of this stuff is over my head. My only question is: do you think the order that the variables are imputed in matters? In the statistical methodology paper, it mentions ordering them by how many values are missing. However, it doesn't seem like that ordering is really applicable in this case. I was just curious if you experimented with the order of the variables since they get fed into the regression models after they are imputed. |
@hdoupe said:
Actually, the leading comment in PR #275 is a little misleading. It turns out that we can eliminate the warning message when users reduce the value of a standard deduction policy parameter. Those were the most frequently encountered warning messages and the ones that cause the most user confusion and that required considerable code complexity in Tax-Calculator. But there are other warning that do not go away, so the warning message framework cannot be completely eliminated. |
@hdoupe asked:
I experimented only a little bit. I ordered the itmexp variables roughly by prevalence of positive amounts among itemizers. I also thought a little about causation. In short, there was no "rocket science" involved in the ordering of the imputed variables. |
@martinholmer said:
Ah, ok. Thanks for clearing up my confusion. I'll follow your development related to this in PSLmodels/Tax-Calculator#2052. |
@martinholmer said:
OK, that makes sense. I wasn't sure if there was enough of a relationship between the itemexp variables for the order to have an effect. Thanks for explaining your thought process on this. |
Fantastic, I've been looking to include standard deduction repeal in an analysis, so this is great timing. Evaluating on a holdout set could inform modeling choices like whether to include imputed variables in other imputations. Also BTW I've continued some of Avi's work by comparing performance of quantile regression methods: OLS, linear quantile regression, random forests, and deep learning. Here's a notebook, and I'll write it up in a blog post soon, but the crux is: On a relatively small sample dataset (Boston housing), random forests performs best, then deep learning (though this could be tuned more), then linear quantile regression, then OLS. Random forests reduced average quantile loss in the holdout by 30% over OLS. With this more standardized imputation code, it seems like it'll be easier to update models if they show room for improvement. I'd be happy to help with this, and will be trying out the approaches in the CPS data next. |
Here is a direct link to the methodological description. |
@MaxGhenis said:
We're glad that the timing of this data enhancement is good for you. It's been a long time coming. |
Thanks for giving me time to review, @martinholmer. This is a bit over my head as well so I might have missed it, but how did you determine the values in the |
@andersonfrailey asked:
Yes, they are related to the JCT targets in an indirect way. The As I mentioned months ago, the best place to read about using bisection and interpolation techniques for root finding is one of the chapter in Press, et al., Numerical Recipes in C: The Art of Scientific Computing." But if you don't have access to that classic book, look at this tutorial which uses the inefficient bisection method (rather than using linear interpolation once you've bracketed the root). |
@martinholmer a Friday afternoon merge is good with me. |
@andersonfrailey said:
OK. Thanks for looking over this PR. |
This pull request does what the title says. The fact that there are no itemized expense amounts for non-itemizers is inherent in the PUF, which is a sample of tax returns. This lack of itemized expenses for non-itemizers makes it impossible to produce sensible estimates of policy reforms that reduce the standard deduction, which has forced us to develop a set of warning messages that are costly to maintain and confusing to some users. See issue #32 for a discussion of an earlier attempt to impute these missing data.
Read the docstring a the top of the new
puf_data/impute_itmexp.py
file for a description of the methods used to conduct the imputation. As described there, the imputation targets the following JCT estimates:Note that the
impute_itmexp.py
logic was developed with the DUMP and CALIBRATING variables set to True, but they have all been set to False for production work. This comment shows the full DUMP output.There will be a corresponding Tax-Calculator pull request that uses the new
puf.csv
file and eliminates the warning messages about lower the values of the standard deduction policy parameters.