Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TaxData refactor #367

Merged
merged 18 commits into from
Dec 31, 2020
Merged

Conversation

andersonfrailey
Copy link
Collaborator

This PR probably should've been two or three PRs, but I got a little caught up in the work and it all ended up being one big PR. It's not quite ready, I still need to clean up some files and run an analysis of the results, but I believe the bulk of the work is done. The goal of this PR is to start the process of making taxdata flexible enough to work with different datasets and become a stand alone package.

PUF

These are the biggest changes. I got rid of all the extra code that was used to produce the CPS based tax units that are matched on the PUF in favor of using the same code that we use to create cps.csv.gz. The big advantage here is it drops a lot of duplicate code so any time we need to make changes to the CPS tax units for any reason we only need to do so in one spot instead of two. I then broke out the statistical matching process into its own module as well so that it can be used to match any two datasets.

The two important files in the new puf directory. are preppuf.py and finalprep.py. preppuf.py does some basic modifying of the raw PUF like dropping aggregated variables. finalprep.py hasn't changed much so it's still geared specifically towards preparing the matched PUF for tax-calc. Eventually we may want to change this is that it is just for the PUF in this directory and have a separate final prep file for the matched file.

One change I did make to the final prep process is I removed the itemized deduction imputations for non-filers because the variables we were imputing were available in the CPS, so we just use those.

See createpuf.py to see how to create puf.csv now.

CPS

There are minimal substantive changes to the CPS. I moved all of the scripts used to create tax units from the CPS to taxdata/cps/. See createcps.py to see the new way to create cps.csv.gz.

The one substantive change I made was adding a condition for determining a married filing separately filing type. While working on the statistical matching process, I noticed that there were a few individuals in the CPS listed as married, but a single filer. I made this group married filing separately filers. Before, we just assumed that anyone who was married would be a joint filer.

Statistical matching

I created a new statmatch directory that right now only contains statmatch.py, which has taken the statistical matching process we use and makes it more general so you can pass in any two datasets and match them. I'd like for this directory to eventually contain a bunch of different modules that contain different matching methods users can chose from. If it gets big enough, we might even want to spin this off into its own package.

Other

A few other changes that needed to be made:

  • Updated .gitignore to ignore the right files
  • Updated the paths in Makefile
  • Moved a bunch of documentation from deep in a variety of subditectories to the single high level doc directory
  • Added taxcalc to the development environment

I think that just about covers the changes. I'll be doing some file clean up over the next couple of days. Looking forward to y'all's comments and suggestions!

cc @MaxGhenis @jdebacker @MattHJensen @rickecon

@andersonfrailey
Copy link
Collaborator Author

Looks like this PR has little to no effects on Tax-Calculator projections. A comparison can be seen in this PDF.

@andersonfrailey
Copy link
Collaborator Author

Since this PR has been open for awhile, I'm going to review it again this afternoon. If I don't find anything wrong I'll merge it tomorrow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant