Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add puf_stage3 #58

Merged
merged 27 commits into from
Feb 10, 2017
Merged

Add puf_stage3 #58

merged 27 commits into from
Feb 10, 2017

Conversation

andersonfrailey
Copy link
Collaborator

This PR adds the necessary files to add a "stage 3" step to the extrapolation/blowup process, as discussed in TaxCalc issue #1110.

The methodology of this has been discussed in the previously mentioned PR and can be found in Stage III.md, so I won't go over it again here, but I'm happy to answer any questions and open to any suggestions that would improve it.

The biggest aspect of this PR is adding the Stage III directory, which includes the stage 3 script, targets used to determine income distributions, stage 1 factors, and weights file.

I've also adjusted the stage 1 and stage 2 scripts so they will write a copy of Stage_I_factors.csv and WEIGHTS.csv to Stage III directory.

After talking with @MattHJensen, interest income will be the only variable adjusted at first, but the stage 3 script is written so that adding additional variables in the future won't be difficult.

I'm running the final tests to implement this into TaxCalc and will open a corresponding PR when that is finished.

@Amy-Xu @martinholmer @codykallen

@martinholmer
Copy link
Contributor

Why are the values in the adjustment_factors.csv file not rounded in the same manner as the factors in the StageIfactors.csv` file? As it stands now you're serving up another 50+ MB file that GitHub is going to complain about. Is there is reason for this degree of precision in the adjustment factors?

@andersonfrailey
Copy link
Collaborator Author

@martinholmer the precision is there so that the aggregate total for interest income doesn't change with the adjustments. I tried rounding off the factors to carrying degrees but in each case total interest income changed.

@martinholmer
Copy link
Contributor

@andersonfrailey said:

the precision is there so that the aggregate total for interest income doesn't change with the adjustments. I tried rounding off the factors to carrying degrees but in each case total interest income changed.

You need to be more forthcoming: how much rounding caused exactly how much change in aggregate interest income? Show us the quantitative results of the rounding experiments you carried out.

@andersonfrailey
Copy link
Collaborator Author

Decimal Places File Size (MB) 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025 2026 Average
0 11.3 -1.520E+09 3.711E+08 -1.107E+09 -2.091E+09 -4.117E+09 -6.309E+09 5.322E+08 -6.370E+08 -2.131E+09 -2.083E+09 3.192E+08 -5.816E+09 -7.362E+09 -2457868222.82
1 14.9 7.175E+07 -1.122E+08 -9.291E+08 -7.600E+08 3.271E+08 -1.731E+09 -6.030E+08 -1.464E+08 -1.253E+09 -1.390E+09 -2.159E+09 -1.503E+09 4.026E+08 -752721112.94
2 17.9 -7.960E+07 1.228E+08 -6.236E+07 -1.555E+08 1.581E+08 7.167E+07 -1.338E+08 -1.658E+08 -2.337E+08 -2.494E+06 -3.362E+07 -3.178E+08 1.753E+08 -50528751.55
3 22.1 1835011.34 10598768.44 6497875.99 -348412.70 5731224.04 7633024.65 -3812229.34 -20078424.56 16570007.02 8672310.85 -4813411.98 6894223.22 -10410730.69 1920710.48
4 25.8 1297920.78 -125631.54 -724752.53 7756.60 534477.88 -331840.58 -2363954.77 -1404327.45 784129.51 -9698.48 354585.69 3287426.40 -264741.41 80103.85
5 29.6 40393.68 26361.11 679.17 84300.90 13486.23 -37383.36 124319.17 34012.57 -34982.72 160456.96 106237.19 -100765.31 150211.37 43640.53
6 33.1 564.21 -6630.67 7031.73 8040.60 -11719.94 14119.08 14161.92 16986.15 -1000.36 13652.63 -11018.65 -491.86 13319.46 4385.72
7 36.9 201.95 930.66 -999.86 -2166.97 748.02 587.84 790.36 343.20 -1092.20 594.42 -677.57 -2110.39 -539.24 -260.75
8 40.6 -90.06 35.84 40.92 -107.72 -161.39 15.67 -43.92 -120.66 45.39 3.57 -117.44 -56.75 -131.77 -52.95
9 44.5 -2.38 6.66 -1.57 -5.67 9.79 -17.82 -24.14 8.31 -2.56 -5.05 5.73 23.63 2.23 -0.22
10 48.3 1.25 1.65 -1.65 -0.61 -0.35 -0.30 -0.32 -0.76 1.68 -0.52 0.86 -0.63 2.01 0.18
11 52 -0.01 0.10 0.03 0.09 -0.10 0.03 -0.14 -0.04 0.18 0.06 0.06 -0.01 -0.38 -0.01
12 54.3 -0.23 -0.19 -0.25 -0.37 -0.35 -0.33 -0.41 -0.35 -0.39 -0.30 -0.45 -0.42 -0.25 -0.33

@martinholmer here are levels of precision and the difference between interest income with and without the adjustment for the years 2014-2026 as well as the average. If the goal is to stay under 50MB, up to 10 decimal places will work, and the difference will be pretty small.

Keep in mind this is also factors for just one variable. If more are added over time the file size will continue to increase unless a method to compare TaxCalc distributions to SOI distributions that doesn't rely on AGI is used so that assigning each record a factor before running TaxCalc becomes unnecessary.

@andersonfrailey
Copy link
Collaborator Author

@martinholmer @MattHJensen, what are y'all's thoughts on how precise to make the factors given the information above? Like WEIGHTS.csv, adjustment_factors.csv will be specific to the PUF so it could go into the separate TaxPuf repo like was discussed in TaxCalc issue #1146 before the size of the weights file was reduced significantly.

@martinholmer
Copy link
Contributor

@andersonfrailey asked about issues related to taxdata pull request #58.

Anderson, I'd like to schedule a phone call so that we can discuss a variety of issues related to #58.
@MattHJensen has given me your phone number. I know you are very busy, but is there sometime this week we could talk? Why don't we arrange this via private email?

@martinholmer martinholmer changed the title Add stage3 [WAIT] Add stage3 Feb 5, 2017
@andersonfrailey
Copy link
Collaborator Author

Ignore most recent commits. Didn't notice the merge conflicts. Will commit again after fixing.

@andersonfrailey
Copy link
Collaborator Author

andersonfrailey commented Feb 8, 2017

I did some more work on stage 3 making it fit the redesigned TaxData repo. This changes the directory and file names to fit the new convention we're going with in TaxData and added the necessary files for stage 3 to work and adds documentation.

I've also updated stage3.py so that it reads input data files from the stage1, puf_stage2, and puf_data directories rather than needed a second copy of them.

Files in stage 3 directory:

  1. stage3.py: contains the script to create the adjustment factors
  2. stage3_targets.csv: contains bin amounts for interest income for years 2009-2014
  3. pufadj_factors.csv: contains adjustment factors for each year. Rather than having each record be directly assigned a factor, it only has 19 factors, one for each AGI bin

@martinholmer and I talked about adding documentation for each step in the data process to its specific directory rather than putting it all in one high-level docs directory. I didn't do that in this PR to fit with our current documentation, but if we decide to go that route the stage 3 documentation can be moved.

The results of adding stage 3 can be seen in this notebook

I'll open a corresponding PR in the TaxCalc repo after this has been reviewed and any problems addressed.

cc: @MattHJensen

@martinholmer martinholmer changed the title [WAIT] Add stage3 Add puf_stage3 Feb 9, 2017
@martinholmer
Copy link
Contributor

@andersonfrailey, Thanks for the revisions to taxdata pull request #58, which adds a Stage 3 to the puf taxdata preparations. This pull request looks good on substance, but needs a few cosmetic changes that will reduce the chance that others will be confused by Stage 3 logic. Here is what I have in mind:

(1) We need a unique noun to name what is produced by Stage 3. As you know, Stage 1 produces grow factors in the growfactors.csv file and Stage 2 produces sample weights in the puf_weights.csv file. It seems to me what your Stage 3 produces AGI-bin-specific adjustment ratios that should be stored in a puf_ratios.csv file. Using the noun factors to refer to two entirely different things is likely to lead to confusion. Can you suggest a better noun that ratios? If so, use that; otherwise please change the code and file names in #58 to use ratios instead of factors.

(2) The puf_ratios.csv output file has a problematic structure. Would something like this be more clear?

factor_year,bin0,bin1,bin2,...,bin17,bin18
INT2010,0.894570657511,...
INT2011,...
INT2012,...
...
INT2025,...
INT2026,...

This file structure makes the bin numbering explicit and is better suited to adding extra years and extra ratio types (like DIV). It is better suited because adding new ratio types or years involves simply adding rows without changing any existing rows. Does this change in the structure of the puf_ratios.csv file make sense? If not, what am I missing? If it does make sense, please revise the puf_stage3/stage3.py script to write out a puf_ratios.csv file with this structure.

@andersonfrailey
Copy link
Collaborator Author

@martinholmer changing from factors to ratios is good with me.

With regards to the restructure to puf_ratios.csv you suggested, I agree with the point you're making about making adding extra ratio types easier and I think it may improve the overall look of the file, but I did find that it is easier to implement into TaxCalc if it is structured the way it is now. This way the ratios can be applied in TaxCalc with one line of code whereas when I tried to implement it with the change there were a lot more errors to work around. This however would be easily worked around by simply transposing the data after it is read into TaxCalc.

@martinholmer
Copy link
Contributor

@andersonfrailey said:

Changing from factors to ratios is good with me.

OK, go ahead and make the changes when you have time.

@andersonfrailey said:

With regards to the restructure to puf_ratios.csv you suggested, I agree with the point you're making about making adding extra ratio types easier and I think it may improve the overall look of the file, but I did find that it is easier to implement into TaxCalc if it is structured the way it is now. This way the ratios can be applied in TaxCalc with one line of code whereas when I tried to implement it with the change there were a lot more errors to work around. This however would be easily worked around by simply transposing the data after it is read into TaxCalc.

OK, these are important considerations. Let talk about this. What is the one line of code in Tax-Calculator that makes it so easy to apply the adjustment ratios when the puf_ratios.csv file is structured as it is now? If the structure of the puf_ratios.csv file was to be changed like I suggested, would a simple DataFrame transpose() be the only additional code in Tax-Calculator? If it is this simple, then I would say we should use the file structure I suggested and you just add the one extra transpose() statement in Tax-Calculator. Or maybe I don't understand completely because I haven't seen your using-Stage3 code for Tax-Calculator.

@andersonfrailey
Copy link
Collaborator Author

andersonfrailey commented Feb 10, 2017

@martinholmer, right now I can just use
self.e00300 *= self.ADJ['INT{}'.format(year)][self.agi_bin]
in TaxCalc to apply the adjustment ratios. With the new structure, something as simple as this runs into key errors. But a simple transpose() when reading in the ratios fixes this so the new structure would work.

@martinholmer
Copy link
Contributor

@andersonfrailey said:

right now I can just use

self.e00300 *= self.ADJ['INT{}'.format(year)][self.agi_bin]

in TaxCalc to apply the adjustment ratios. With the new structure, something as simple as this runs into key errors. But a simple transpose() when reading in the ratios fixes this so the new structure would work.

OK, please do it that way: read in the puf_ratios.csv file and immediately save the transpose() as the self.ADJ DataFrame. That will add only one line of code to Tax-Calculator but make the structure of the puf_ratios.csv file much clearer.

@andersonfrailey
Copy link
Collaborator Author

@martinholmer taken care of in latest commit.

@martinholmer
Copy link
Contributor

@andersonfrailey, Thanks for all your work on taxdata pull request #58.
I'll merge this now and you can continue your work developing a taxcalc pull request that uses the new puf_ratios.csv file to make variable adjustments.

@martinholmer martinholmer merged commit 4fa1d0a into PSLmodels:master Feb 10, 2017
@andersonfrailey andersonfrailey deleted the AddStage3 branch June 13, 2020 16:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants