Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Records variable metadata in evars.json #1179

Merged
merged 9 commits into from
Apr 12, 2017

Conversation

zrisher
Copy link
Contributor

@zrisher zrisher commented Feb 4, 2017

This PR introduces a new file taxcalc/evars.json that can act as a single source of truth for records variables, similar to how taxcalc/current_law_policy.json describes policy parameters.

While declaring the variables directly as sets in the records class is simpler (and clearly doesn't take nearly as many lines), metadata about those variables has to be described separately, e.g. descriptions in e_variable_info.csv and types represented by holding the names in different variables.

By declaring records variable data in a json file, we improve its accessibility:

  • All info about the variables is available in once place
  • Data is read as a dict, so it's easy to query and map to different formats
  • Data can be read by other processes, e.g. client-side javascript

This PR changes the records class to pull its variable data from evars.json. Tests are included that confirm the generated variable lists are the same, but they're removed in the final commit. To run the tests, checkout my branch evars-json-test.

These tests also confirm that the descriptions provided exactly match those from e_variable_info.csv. I believe that once this PR is incorporated, stats_summary.py could be refactored to derive its information from evars.json instead and these issues could be closed: #1089, #633.

Obviously the amount of provided descriptions is inadequate. We can fix that over time, but I wanted to have them exactly match e_variable_info initially.

Tax logic is unaffected by this PR, all tests pass on my machine, and I believe coverage should be unaffected.

@codecov-io
Copy link

codecov-io commented Feb 4, 2017

Codecov Report

Merging #1179 into master will increase coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master    #1179      +/-   ##
==========================================
+ Coverage   98.83%   98.83%   +<.01%     
==========================================
  Files          39       39              
  Lines        3002     3007       +5     
==========================================
+ Hits         2967     2972       +5     
  Misses         35       35
Impacted Files Coverage Δ
taxcalc/records.py 99.13% <100%> (+0.01%)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2884037...a944872. Read the comment docs.

@MattHJensen
Copy link
Contributor

@zrisher, is there any other meta data that you are planning to add to this file in the short term? For example, would it contain form and line numbers correspondences to each years' tax forms?

I think we'll probably want a more general name than evars, but I don't have a great suggestion right now

@martinholmer
Copy link
Collaborator

@MattHJensen asked about @zrisher pull request #1179:

I think we'll probably want a more general name than evars.json.

I agree. The proposed JSON file contains information about both input and output variables.
Why not call the new file iovars.json instead of evars.json? Or is there a better name?

@MattHJensen also asked:

Is there any other meta data that you are planning to add to this file in the short term?
For example, would it contain form and line numbers correspondences to each years' tax forms?

The form/line numbers are in both the taxcalc/validation/CSV_INPUT_VARS.md and taxcalc/e_variable_info.csv files. If this new JSON file is supposed to replace them, then the new file needs to include that information.

Another problem is that too many "description" values are empty in the current version of the new JSON file. Can the best descriptions from taxcalc/validation/CSV_INPUT_VARS.md and taxcalc/e_variable_info.csv be integrated into the new JSON file now?

@zrisher
Copy link
Contributor Author

zrisher commented Feb 7, 2017

@MattHJensen @martinholmer

file name
I was hoping to get your input on the official name for the full set of Records variables. I thought it might be evars, but clearly that only refers to variables starting with e. How about something like record_vars, records_variables, or records_cols? iovars.json might not be specific enough, since there are many methods of io contained within the taxcalc package.

inclusion of form line correspondence info
Form field to Records variable mappings can change by year and are not 1-1, so I had planned on keeping this data within taxcalc/filings. We could move the direct mappings into this file, but some of them would need to be indexed by year.

descriptions
This PR actually includes all the descriptions from e_variable_info.csv already, as well as a test that confirms they match. I was not aware additional descriptions were available in taxcalc/validation/CSV_INPUT_VARS.md, and it does appear it has some information that wasn't available in e_variable_info.csv. I can update this PR with that additional data as soon as I'm done with the form parsing PR I'm currently working to finish.

@martinholmer
Copy link
Collaborator

@zrisher first said:

file name
I was hoping to get your input on the official name for the full set of Records variables. I thought it might be evars, but clearly that only refers to variables starting with e. How about something like record_vars, records_variables, or records_cols? iovars.json might not be specific enough, since there are many methods of io contained within the taxcalc package.

I have a strong preference for records_vars.json.

@zrisher then said:

inclusion of form line correspondence info
Form field to Records variable mappings can change by year and are not 1-1, so I had planned on keeping this data within taxcalc/filings. We could move the direct mappings into this file, but some of them would need to be indexed by year.

I think that would be extremely important to include the Form/Line# in the new records_vars.json file because it should be right next to the Records class, which it is documenting. I know it is not very much information, but the work behind taxcalc/validation/CSV_INPUT_VARS.md has confirmed that the Form/Line# in that file are correct and the same for 2015 and 2016.

@zrisher finally said:

descriptions
This PR actually includes all the descriptions from taxcalc/e_variable_info.csv already, as well as a test that confirms they match. I was not aware additional descriptions were available in taxcalc/validation/CSV_INPUT_VARS.md, and it does appear it has some information that wasn't available in e_variable_info.csv. I can update this PR with that additional data as soon as I'm done with the form parsing PR I'm currently working to finish.

That would be very good. And if there is a difference in descriptions between taxcalc/e_variable_info.csv and taxcalc/validation/CSV_INPUT_VARS.md, and you are not sure which is better, please bring the difference to our attention here in PR #1179.

@MattHJensen @Amy-Xu

@martinholmer
Copy link
Collaborator

@zrisher said:

PR #1179 actually includes all the descriptions from taxcalc/e_variable_info.csv already, as well as a test that confirms they match.

Such a test is an excellent idea for the transition work, but isn't the idea of #1179 to replace both the taxcalc/e_variable_info.csv and taxcalc/validation/CSV_INPUT_VARS.md files with just the single new JSON file?

And such a test is problematic when you start merging and choosing between descriptions in those two old files. What, exactly, is the goal of #1179.

@martinholmer
Copy link
Collaborator

@zrisher, If the new records_vars.json file replaced the two older files containing similar information, do you have any ideas about how documentation (like that in the existing taxcalc/validation/CSV_INPUT_VARS.md file) could be generated automatically from the records_vars.json file?

@zrisher
Copy link
Contributor Author

zrisher commented Feb 8, 2017

@martinholmer said:

Such a test is an excellent idea for the transition work, but isn't the idea of #1179 to replace both the taxcalc/e_variable_info.csv and taxcalc/validation/CSV_INPUT_VARS.md files with just the single new JSON file? ... What, exactly, is the goal of #1179.

Yep, the goal is to provide a single source of truth for record variable metadata.

And such a test is problematic when you start merging and choosing between descriptions in those two old files.

That is why I removed the test in the final commit c01e237. 😄

@zrisher, If the new records_vars.json file replaced the two older files containing similar information, do you have any ideas about how documentation (like that in the existing taxcalc/validation/CSV_INPUT_VARS.md file) could be generated automatically from the records_vars.json file?

Well anything that can parse JSON could construct more user-centric documentation using the file. Clearly python can parse it, so we could build a python notebook similar to those in /docs/notebooks. webapp-public could create a page displaying it, loading the data either on the server or directly client side. We could also set up a python script that would generate a github-compatible .md file using this json and attach it to the release process.

@zrisher
Copy link
Contributor Author

zrisher commented Feb 8, 2017

How about simply rvars.json for the file name?

@zrisher
Copy link
Contributor Author

zrisher commented Feb 8, 2017

Hey @martinholmer sorry I missed your first comment.

I have a strong preference for records_vars.json.

Sounds great to me!

I think that would be extremely important to include the Form/Line# in the new records_vars.json file because it should be right next to the Records class, which it is documenting.

Thinking about it more, I agree. After #1184 is polished and merged I will amend its direct mapping logic to pull data from record_vars.json instead of its internal file.

If there is a difference in descriptions between taxcalc/e_variable_info.csv and taxcalc/validation/CSV_INPUT_VARS.md, and you are not sure which is better, please bring the difference to our attention here in PR #1179.

Will do.

@zrisher
Copy link
Contributor Author

zrisher commented Feb 20, 2017

^ Rebased to latest master & fixed conflicts.

@zrisher
Copy link
Contributor Author

zrisher commented Feb 20, 2017

@martinholmer

I have a strong preference for records_vars.json.

Done.

I can update this PR with [the descriptions from validation/CSV_INPUT_VARS.md]

Done.

If there is a difference in descriptions between e_variable_info.csv and validation/CSV_INPUT_VARS.md, and you are not sure which is better, please bring the difference to our attention here

Here are the descriptions that differ between the above that which I'm not sure how to merge:

Var e_variable_info descr. CSV_INPUT_VARS descr. notes
e00700 Taxable refunds, credits, or offsets of state and local income taxes taxable refunds of state and local income tax Is the former more accurate?
e01500 Pensions and annuities in income partial total of pensions and annuities Perhaps we could clarify why it's partial?
e01700 Pensions and annuities in AGI taxable pensions and annuities Latter is simpler?
e02000 Rental real estate, royalties, partnerships, S corporations, trusts, etc. (Sch E) Sch E rental, royalty, S-corp, etc, income/loss Latter is sufficient?
e03500 Adjustments: Alimony paid alimony paid Latter is sufficient?
e19200 Sch A: Total interest deduction total interest paid Former is more accurate?
e87530 Form 8863: Lifetime Learning Total Qualified Exp.s total LLC adjusted qualified expenses for all students Latter is more accurate, but use of "LLC" a little confusing? Perhaps simply "Lifetime learning qualified expenses for all students"?
p08000 Other statutory credit (computer) other tax credits (but not including Sch R credit) Unsure best descr here.

Note that the placement of the source form name in the description is inconsistent between the variables, e.g. "General business credit (Form 3800)", "Taxes: Recapture tax - Form 4255", "Sch A: Medical and dental expenses". But since we plan to provide the tax form mapping directly via additional metadata I think we can ignore those for now.

@MattHJensen
Copy link
Contributor

MattHJensen commented Apr 7, 2017

@andersonfrailey , could you give @zrisher guidance when you have a chance. Not urgent, but it could be a nice break from other work.

@martinholmer
Copy link
Collaborator

martinholmer commented Apr 11, 2017

@zrisher, Thanks for the insightful idea behind PR#1179. It does make much more sense to centralize the metadata about the variables read into and calculated by the Records class in a json file separate from the Records class source code.

Because we have been so slow to act on your pull request, it has needed some updating in order to be merged into the master Tax-Calculator branch. I've done that on the martinholmer:pr-1179 branch, which appears on GitHub as Tax-Calculator pull request #1285.

Thanks again for all your work on this. It is a very helpful contribution to Tax-Calculator.

@MattHJensen MattHJensen merged commit a944872 into PSLmodels:master Apr 12, 2017
MattHJensen added a commit that referenced this pull request Apr 12, 2017
Prepare zrisher PR #1179 for merging into master branch
@MattHJensen MattHJensen mentioned this pull request Apr 12, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

stats_summary.py errors
4 participants