Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Need to move read_egg_csv logic from Tax-Calculator to TaxBrain #501

Closed
martinholmer opened this issue Mar 6, 2017 · 10 comments
Closed
Milestone

Comments

@martinholmer
Copy link
Contributor

Currently the Tax-Calculator repo contains TaxBrain logic, which logically should be part of TaxBrain.

There is a read_egg_csv function in taxcalc/utils.py and it is called from the records.py file and from the growfactors.py file.

Can this utility function and its calls be moved to TaxBrain? Tax-Calculator knows nothing about eggs. Isn't that a TaxBrain concept?

@MattHJensen @PeterDSteinberg @brittainhard

@MattHJensen MattHJensen added this to the b4 milestone Mar 9, 2017
@martinholmer
Copy link
Contributor Author

No concerns have been expressed over the past week with regards to the request in #501 to move egg-reading-logic from Tax-Calculator code to TaxBrain code. Plus #501 has been added to the milestones in the webapp-public repo. So, I'm going to remove the egg logic from Tax-Calculator code in the next few days.

@MattHJensen @PeterDSteinberg @brittainhard

@MattHJensen
Copy link
Contributor

@martinholmer, it makes more sense to me to wait until the logic is in webapp-public before removing it from Tax-Calculator so that we don't have a period of brokenness.

@martinholmer
Copy link
Contributor Author

@MattHJensen said:

it makes more sense to me to wait until the logic is in webapp-public before removing it from Tax-Calculator so that we don't have a period of brokenness.

How long will that wait be? #501 has such a low priority that it seems to me that is unlikely to be dealt with over the next half year.

@MattHJensen MattHJensen modified the milestones: b3 - follow ons from b2, b4 - cleanup after b2/b3 Mar 15, 2017
@MattHJensen
Copy link
Contributor

MattHJensen commented Mar 15, 2017

@martinholmer, did you mean to close this issue?

Perhaps I don't understand why #501 is higher priority than the urgent issues in b2 and b3. Could you explain why #501 is urgent? Is it blocking other work?

@martinholmer
Copy link
Contributor Author

Reopening #501 after mistaken closing of issue.

It is not so much that I think #501 is more urgent than the issues/bugs in the b2 and b3 milestone lists, but rather that it is a trivial thing to move the few lines of egg code from Tax-Calculator to TaxBrain. Tasks that can be helpful to progress in other repositories and that take little time should be handled quickly, in my view. If the future is anything like the recent past, as the weeks go on new items will be added to b2 and b3 and so items in b4 (like #501) will run the risk never getting addressed.

@martinholmer martinholmer reopened this Mar 15, 2017
@talumbau
Copy link
Member

Hi,

Egg files don't have anything to do with Tax Brain. Egg files are a format from (one part of) the long history of Python packaging. Here is a reasonable summary from a couple of the top five links in a google search:

http://stackoverflow.com/questions/2051192/what-is-a-python-egg
https://packaging.python.org/wheel_egg/

It is correct that the act of reading a data file from an installed package is not tax-policy related logic. The utils.py file in Tax-Calculator is a good place for such things. Here is a nice Stack Overflow answer that encapsulates the problem the read_egg_csv function solves. There are other ways to solve the problem, but the existing function seems to be a reasonable one.

Removing the read_egg_csv function from Tax-Calculator will have the following impacts:

  • the Tax-Calculator will no longer be usable as an installed package
  • the Tax-Calculator will only be usable if you clone the source repo (which members of the development team do)
  • All clients of Tax-Calculator (OGUSA, BTax, the TaxBrain front end, and the worker nodes) will no longer be able to use Tax-Calculator (due to the first reason listed)
  • Each client of Tax-Calculator will need some kind of way of reading the data files that are packaged with Tax-Calculator, so they will each have to write a different function to deal with this issue. Even then, it's not clear how one would extract the data files and feed them back in to the Tax-Calculator at various places where the Tax-Calculator uses the data inside the files for its calculations.

The benefit of this action seems to be "remove code that doesn't encapsulate tax logic from a Python package about tax logic." The disadvantages seem worse than the benefit as far as I can tell.

@martinholmer
Copy link
Contributor Author

martinholmer commented Mar 15, 2017

@talumbau, Thanks for your informative response to TaxBrain issue #501 on the best location for egg logic. Given what you say, I've got a few questions.

(1) So, the egg is inside (that is, part of) the taxcalc package, right or wrong?

(2) Why are we using the obsolete egg technology instead of the PEP427 wheel technology?

(3) Given that the puf.csv file does not seem to be in the egg (because it is not listed in the MANIFEST.in file), how do all the Tax-Calculator clients get the puf.csv file? Seems like your statement that "All clients of Tax-Calculator (OGUSA, BTax, the TaxBrain front end, and the worker nodes) will no longer be able to use Tax-Calculator (due to the first reason listed)" should already be true. For example, how does TaxBrain get a copy of the puf.csv file if it is not in the taxcalc package egg?

@talumbau
Copy link
Member

(1) So, the egg is inside (that is, part of) the taxcalc package, right or wrong?

No, that is not correct. One way to say it is that the egg file is the physical manifestation of the package on the filesystem of the computer. The egg file format is the specification of how this file is laid out on disk. The data files that are distributed with the Tax-Calculator package are inside this file, in a way governed by the egg file format (which is essentially a zip file).

(2) Why are we using the obsolete egg technology instead of the PEP427 wheel technology?

The wheel format has seen less adoption than the egg format, although the official recommendation is wheel. The wheel format was meant to solve many of the shortcomings of egg (including issues of binary dependencies) but the problem is that it didn't actually solve all of the issues. There's much that could be said here, and many people are working hard to make wheels better. The momentum behind wheels is not as great anymore because many people choose to solve the problem by using conda packages. Some good reading is this blog post that discusses how conda fits in to this whole picture. As I said in my original post, the history of Python packaging is long.

(3) Given that the puf.csv file does not seem to be in the egg (because it is not listed in the MANIFEST.in file), how do all the Tax-Calculator clients get the puf.csv file?

I thought all of this was hashed out in PSLmodels/Tax-Calculator#1119 on Tax-Calculator (and follow-on issues). @PeterDSteinberg knows the details. One salient point: some clients of Tax-Calculator don't need the PUF. For example, the front end for Tax Brain needs to install Tax-Calculator as a dependency but it does not need the PUF. The reason is because it needs to call certain functions like Policy.default_data to populate the initial TaxBrain page with defaults for the given year. Obviously, the compute nodes need the PUF to do the actual calculations.

Seems like your statement that "All clients of Tax-Calculator (OGUSA, BTax, the TaxBrain front end, and the worker nodes) will no longer be able to use Tax-Calculator (due to the first reason listed)" should already be true

I don't know what this means. Are you asking why everything works right now? The package works right now because, for every file in the MANIFEST.in, we use a standard practice for distributing data files with the package and for reading data files from that package. For example, when calling Policy.default_data as I describe above, the class method _params_dict_from_json_file in ParametersBase must be called. For this call to succeed, a JSON file for the Policy class must be read in. This succeeds on TaxBrain because we distribute current_law_policy.json with the package.

For example, how does TaxBrain get a copy of the puf.csv file if it is not in the taxcalc package egg?

This is solved via the deploy repo. I think @MattHJensen could set up a call between you and @PeterDSteinberg to provide the details.

My main question would be the following: what is the benefit of the removal of this capability from the Tax-Calculator? This issue is already fairly long (and could become even longer) but there doesn't appear to be a clear stated benefit for this proposed change. I have seen several PRs recently on Tax-Calculator mention that some of the few remaining lines that are not under test coverage involve these lines of code. Is that the driver here? Surely the better path is to pursue a script to run under Travis CI that exercises this code, rather than remove the ability to use Tax-Calculator as a package that one can have as a dependency in some other code.

@martinholmer
Copy link
Contributor Author

@talumbau, Thanks for your extremely helpful answers to my questions in issue #501.

I'm in total agreement with your final thought:

I have seen several PRs recently on Tax-Calculator mention that some of the few remaining lines that are not under test coverage involve these [read_egg_csv or read_egg_json] lines of code. Is that the driver here? Surely the better path is to pursue a script to run under Travis CI that exercises this code, rather than remove the ability to use Tax-Calculator as a package that one can have as a dependency in some other code.

The problem is that I've asked how to construct such tests but didn't get an answer.

I'm closing issue #501 now. Thanks again for the tutorial on Python eggs and Conda packages.

@martinholmer
Copy link
Contributor Author

martinholmer commented Jul 28, 2017

Since this TaxBrain issue #501 is being read again in July, 2017, let me clarify that the read_egg_* utility functions should remain in the Tax-Calculator repository. The reason is that the Tax-Calculator command line interface, tc, runs (like TaxBrain) from a conda taxcalc package, which now contains the new CPS input data. This means that there is a need for these two utility functions in Tax-Calculator when executing something like this command:

tc cps.csv 2020 --tables

@andersonfrailey

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants