-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TaxData refactor #367
Merged
andersonfrailey
merged 18 commits into
PSLmodels:master
from
andersonfrailey:datarefactor
Dec 31, 2020
Merged
TaxData refactor #367
andersonfrailey
merged 18 commits into
PSLmodels:master
from
andersonfrailey:datarefactor
Dec 31, 2020
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Looks like this PR has little to no effects on Tax-Calculator projections. A comparison can be seen in this PDF. |
Since this PR has been open for awhile, I'm going to review it again this afternoon. If I don't find anything wrong I'll merge it tomorrow. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR probably should've been two or three PRs, but I got a little caught up in the work and it all ended up being one big PR. It's not quite ready, I still need to clean up some files and run an analysis of the results, but I believe the bulk of the work is done. The goal of this PR is to start the process of making taxdata flexible enough to work with different datasets and become a stand alone package.
PUF
These are the biggest changes. I got rid of all the extra code that was used to produce the CPS based tax units that are matched on the PUF in favor of using the same code that we use to create
cps.csv.gz
. The big advantage here is it drops a lot of duplicate code so any time we need to make changes to the CPS tax units for any reason we only need to do so in one spot instead of two. I then broke out the statistical matching process into its own module as well so that it can be used to match any two datasets.The two important files in the new
puf
directory. arepreppuf.py
andfinalprep.py
.preppuf.py
does some basic modifying of the raw PUF like dropping aggregated variables.finalprep.py
hasn't changed much so it's still geared specifically towards preparing the matched PUF for tax-calc. Eventually we may want to change this is that it is just for the PUF in this directory and have a separate final prep file for the matched file.One change I did make to the final prep process is I removed the itemized deduction imputations for non-filers because the variables we were imputing were available in the CPS, so we just use those.
See
createpuf.py
to see how to createpuf.csv
now.CPS
There are minimal substantive changes to the CPS. I moved all of the scripts used to create tax units from the CPS to
taxdata/cps/
. Seecreatecps.py
to see the new way to createcps.csv.gz
.The one substantive change I made was adding a condition for determining a married filing separately filing type. While working on the statistical matching process, I noticed that there were a few individuals in the CPS listed as married, but a single filer. I made this group married filing separately filers. Before, we just assumed that anyone who was married would be a joint filer.
Statistical matching
I created a new
statmatch
directory that right now only containsstatmatch.py
, which has taken the statistical matching process we use and makes it more general so you can pass in any two datasets and match them. I'd like for this directory to eventually contain a bunch of different modules that contain different matching methods users can chose from. If it gets big enough, we might even want to spin this off into its own package.Other
A few other changes that needed to be made:
.gitignore
to ignore the right filesMakefile
doc
directorytaxcalc
to the development environmentI think that just about covers the changes. I'll be doing some file clean up over the next couple of days. Looking forward to y'all's comments and suggestions!
cc @MaxGhenis @jdebacker @MattHJensen @rickecon