Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve dissemination / deployment related to Public Use File (puf.csv) #53

Closed
PeterDSteinberg opened this issue Dec 14, 2016 · 22 comments

Comments

@PeterDSteinberg
Copy link

Issue - Better track and distribute puf.csv

We have found OG-USA regression testing difficult because the Tax-Calculator, OG-USA and B-Tax packages that require a puf.csv public use file do not maintain metadata about which version of that puf.csv is okay.

We have discussed the idea of coming up with a taxpuf private Anaconda package to help ensure installations of Tax-Calculator, B-Tax, OG-USA get the right puf.csv. Each of the packages may require a version of taxpuf just like requiring a specific version of numpy.

There are a few ideas about how to do this with minimal interruption of existing code in Tax-Calculator, B-Tax and OG-USA.

Idea 1 - write_puf

At the top of our modules in B-Tax, Tax-Calculator, and OG-USA, just put the following:

from taxpuf import write_puf
write_puf()
# the rest of the import statements

write_puf() would write puf.csv to the top level of the repo where it is currently expected, but would not write it if the md5sum indicated no change was needed.

Idea 2 - from taxpuf import PUF

from taxpuf import PUF then PUF is the string contents of the CSV file puf.csv. This would require some modifications in the related repositories where currently a CSV path is expected. I don't think we want from taxpuf import PUF to import a path to a CSV because the path may need to be processed differently, depending on whether it is a path within an egg.

Auto-build taxpuf package

In either case of Idea 1 or 2 above, I have some helper scripts started for automatically building a new taxpuf package whenever we need to distribute a new puf.csv. It is a one line package creation script used like:

python build_puf_package.py puf.csv "testing it out" 0.0.4

The above would take puf.csv from the current directory and make it the 0.0.4 version of taxpuf package, adding the metadata "testing it out".

Thoughts on Idea 1 vs 2?

@talumbau
Copy link
Member

talumbau commented Dec 14, 2016

In my version of Idea 2, PUF would be a DataFrame, therefore one would type:

from taxpuf import PUF
rec = Records(data=PUF)

@PeterDSteinberg
Copy link
Author

In your view @talumbau of Idea 2, should PUF always be a dataframe constructed from the CSV installed as part of taxpuf (the correct version) or should taxpuf prefer a local puf.csv over the installed one (may be helpful for testing several puf.csv files)?

@talumbau
Copy link
Member

talumbau commented Dec 14, 2016

the first one - that is, when executing from taxpuf import PUF, you would get the DataFrame of the PUF associated with that version of taxpuf. The only nontrivial aspect of it is that the data loading would not happen until the execution of the import statement that involves the PUF variable. So, import taxpuf; print(taxpuf.__version__) would not incur any data loading penalties.

@martinholmer
Copy link
Contributor

@PeterDSteinberg said:

We have found OG-USA regression testing difficult because the Tax-Calculator, OG-USA and B-Tax packages that require a puf.csv public use file do not maintain metadata about which version of that puf.csv is okay.

What is the nature of these problems exactly?
I'm not aware of any problems of this sort with Tax-Calculator.

@PeterDSteinberg
Copy link
Author

@martinholmer We have set up Jenkins CI testing in a way that allows running any OG-USA / Tax-Calculator new versions with the same set of reforms that were tested with OG-USA 0.5.5 and Tax-Calc 0.6.6 several months ago. We made an automated system of installing the right versions of the whole stack but found full automation of the regression tests on every PR to OG-USA were not possible because the puf.csv is often changed in conjunction with OG-USA and Tax-Calculator changes.

@talumbau
Copy link
Member

What is the nature of these problems exactly?
I'm not aware of any problems of this sort with Tax-Calculator.

The issue is more precisely defined as follows: the PUF is a moving target and the PUF comes with no metadata that allows one to know, for a given instance of the PUF sitting on your computer, which version(s) of Tax-Calculator that PUF is compatible with.

This issue presents as a problem in primarily two contexts:

  1. If one is only occasionally using Tax-Calculator, or just not keeping up with day-to-day activities on Tax-Calculator, when one goes to use Tax-Calculator (either the latest released version or the latest master on the repo), there can frequently be issues in getting everything to run. Then, it's necessary to go find an email that describes any PUF changes (if one is on the list of recipients), find the link to the dropbox containing the appropriate PUF, which might be the most recent PUF, or a previous version if one is using the latest release of Tax-Calculator and the latest PUF has a breaking change. It is sometimes not obvious which one to get.

  2. Deploying Tax-Calculator as a component of a live web application, or even as a component of a regression testing setup. The situation here is that we need several (maybe 10 or more) machines, likely cloud-based machines, to come online and run either Tax-Calculator itself, or a package that depends on Tax-Calculator (like OGUSA or B-Tax). These packages need a certain version of the PUF to go with the version of Tax-Calculator they depend on. Since there is no meta-data embedded in the PUF and there is no way of getting the PUF except via the process I just described above, deployments are error-prone.

@martinholmer
Copy link
Contributor

@talumbau said in the conversation about taxdata issue #53:

The issue is more precisely defined as follows: the PUF is a moving target and the PUF comes with no metadata that allows one to know, for a given instance of the PUF sitting on your computer, which version(s) of Tax-Calculator that PUF is compatible with.

Yes, I can see that there is no metadata embedded in the puf.csv file, but I don't understand why that is a problem for Tax-Calculator. I was requested to make the changes in Tax-Calculator pull request #1035, which was merged on 06-Nov-2016, in order to allow virtually any version of the puf.csv file to be used as input to Tax-Calculator. Given those changes, I don't see why there is any need to change Tax-Calculator logic again.

I can imagine that various clients of Tax-Calculator may need to have a mechanism to select an older version of the puf.csv file that they want (need?) to use. But it seems to me that some kind of database of puf.csv file versions is what is needed for those clients. Such a database would have a single table with two fields: a version number or date of the puf.csv file and the contents of the puf.csv file. Won't that work fine for these Tax-Calculator clients?

In any case, whatever solution you come up with for the clients should leave Tax-Calculator logic unchanged.

@MattHJensen @PeterDSteinberg

@PeterDSteinberg
Copy link
Author

@martinholmer The Anaconda private package would accomplish essentially what the database you mentioned would accomplish, but in a more portable way. Then the Tax-Calculator, B-Tax, and OG-USA repositories can each choose how they want to make use of the new taxpuf package. I think for deployment and CI testing purposes it would be helpful for those packages to pin to a specific puf.csv. In each of the last two B-Tax deployments we have had some live experimentation needed to figure out whether we were using a compatible puf.csv. I'm open to ideas. cc @MattHJensen

@martinholmer
Copy link
Contributor

@PeterDSteinberg said:

The Anaconda private package would accomplish essentially what the database you mentioned would accomplish, but in a more portable way. Then the Tax-Calculator, B-Tax, and OG-USA repositories can each choose how they want to make use of the new taxpuf package.

Can you be more specific? What do you mean by "Then the Tax-Calculator, B-Tax, and OG-USA repositories can each choose how they want to make use of the new taxpuf package"?
Are you saying that Tax-Calculator logic is going to have to change in this new regime?
If so, you haven't convinced me why this is necessary.

@PeterDSteinberg
Copy link
Author

@martinholmer I mean that Tax-Calculator, B-Tax and OG-USA could name in their conda.recipe's meta.yaml, a specific version of taxpuf to ensure they are installed with the correct version of puf.csv. Currently there is no way for us to do the OG-USA regression testing in an automated way because there is no metadata linking packages to the puf.csv with which they are compatible. If your PR has changed Tax-Calculator to make it agnostic with respect to puf version, that helps with interactive use and many users, but we still in production deployments and regression testing need a way to ensure the right puf.csv gets installed.

@MattHJensen
Copy link
Contributor

MattHJensen commented Dec 15, 2016

@PeterDSteinberg, I'm also having trouble understanding the specific problems after Tax-Calculator PSLmodels/Tax-Calculator#1035.

Is this an accurate example of a problem situation?

  1. OG-USA regression testing is turned on and automatically compares every new OG-USA PR to baseline.
  2. A new version of the puf.csv is released and installed on the testing server.
  3. Some changes are made to OG-USA and a PR is opened
  4. The test is run and there are differences between HEAD and MASTER.
  5. We don't know whether the differences stem from OG-USA or the data update?

If so, why not just always adopt the latest version of puf.csv whenever it comes out and trigger a test at that time? Is the problem that "always adopt the latest version of puf.csv whenever it comes" is a fraught human process?

@jdebacker, could you also chime in whith a description of when and why puf.csv releases are painful for you, particularly after PSLmodels/Tax-Calculator#1035?

cc @Amy-Xu @andersonfrailey @zrisher

@MattHJensen
Copy link
Contributor

@martinholmer, could you describe the downside you see to having a Conda-versioned puf.csv file?

@MattHJensen
Copy link
Contributor

MattHJensen commented Dec 15, 2016

@PeterDSteinberg @talumbau, would we be able to (easily) maintain backwards compatibility between the current Tax-Calculator and earlier versions of the puf.csv files that have already been distributed with the Anaconda private package system?

@PeterDSteinberg
Copy link
Author

@talumbau @MattHJensen If we put something like taxpuf=0.0.4 into the meta.yaml of OG-USA, Tax-Calculator, and B-Tax, the installation would only work if the installing machine has the Anaconda token for taxpuf. Then, after the install, we can have it set up so the taxpuf CSV can be overriden by one in the local directory (which I'm sure we'll need for testing so that we don't have to repackage the puf.csv every time we want to try a modification to it). The backwards compat would be up to each of Tax-Calculator, B-Tax and OG-USA.

@MattHJensen
Copy link
Contributor

I asked:

would we be able to (easily) maintain backwards compatibility between the current Tax-Calculator and earlier versions of the puf.csv files that have already been distributed with the Anaconda private package system?

Which @PeterDSteinberg answered:

we can have it set up so the taxpuf CSV can be overriden by one in the local directory (which I'm sure we'll need for testing so that we don't have to repackage the puf.csv every time we want to try a modification to it). The backwards compat would be up to each of Tax-Calculator, B-Tax and OG-USA.

Got it, thanks.

@jdebacker
Copy link
Member

@MattHJensen says:

If so, why not just always adopt the latest version of puf.csv whenever it comes out and trigger a test at that time? Is the problem that "always adopt the latest version of puf.csv whenever it comes" is a fraught human process?

Yes. If we one is not following each change to the TC, it's hard to know a new PUF is needed. And when one is aware, several manual steps are necessary to acquire the file. In my case, I then need to drop any updated PUF into three repos on each machine I'm running TC, B-Tax, and OG-USA on. With 4+ versions of the PUF since July, it can be difficult to keep track of which version is where.

@jdebacker, could you also chime in whith a description of when and why puf.csv releases are painful for you, particularly after PSLmodels/Tax-Calculator#1035?

PR #1035 has been very helpful. Now model runs don't break. But as @PeterDSteinberg and @talumbau note above, this PR doesn't solve issues with testing when values in the PUF change across versions.

To me, @PeterDSteinberg and @talumbau propose a very elegant solution to this without altering the logic of any of the models and with only a few minor changes to the source code to use the proposed conda package.

@talumbau
Copy link
Member

One area to clarify would be to describe how this would impact Tax-Calculator, as this seemed to be an area of concern for @martinholmer. To answer that, I would say that it impacts Tax-Calculator to the extent that the Tax-Calculator package is concerned with where the PUF comes from. To my knowledge, the current repo cares very little about this. A Records object can either be created by passing a path to the PUF file, or passing a DataFrame that one gets from reading in the PUF. There's essentially no logic, as far as I'm aware, as to what happens before this in a user's script. So, as far as I can tell, I don't think it would really matter to the Tax-Calculator, if we added the following capability:

from taxpuf import PUF
from taxcalc import *
rec = Records(data=PUF)

instead of what is often done:

from taxcalc import *
rec = Records(data=path_to_puf)

or

from taxcalc import *
import pandas as pd
df = pd.read_csv("puf.csv.gz", compression="gzip")
rec = Records(data=df)

Or similar functionality addition if we use Idea 1 above. In the example above, PUF is a dataframe, so it's essentially a way of executing the last example (manual reading of the file into a pandas DataFrame) except that it comes with all of the configuration/dependency resolution mechanisms that Peter has been describing in this issue.

@zrisher
Copy link

zrisher commented Dec 22, 2016

@talumbau @martinholmer @MattHJensen @PeterDSteinberg @jdebacker I like the idea of using conda to manage the puf_csv dependency. As mentioned, it provides all the needed versioning support in a very lightweight way. Providing the dataframe directly seems reasonable since this repo and its dependent projects all use pandas.

My biggest concern is this:

How would we declare the dependency in tax-calculator in such a way that contributors without access to that package could still build their environment?

Right now puf.csv is an optional dependency, but Conda doesn't understand such things.

We could distribute a separate pip requirements file (which we just got rid of) to contributors with access and make it part of their setup process. It could also hold the package access credentials. But that's replacing a manual dependency deployment problem with a less frequent manual deployment. And it would mean tax-calculator is still failing to declare its preferred puf.csv version as an installed package via Conda.

Hoping someone has an easy answer to this.

@martinholmer
Copy link
Contributor

@zrisher asked:

How would we declare the dependency in tax-calculator in such a way that contributors without access to that [private] package could still build their environment?

This is an excellent question, particularly because the change
being discussed in taxdata issue #53 is billed as a simplification.

And I have two more questions about this plan:

(1) Will the puf.csv file still be distributed as now (for those
who view the change being discussed in taxdata issue #53
as a complication rather than simplification)?

(2) Will there be absolutely no changes to the records.py file?

If we are going to make this sort of change, the discussion
about what is involved in the change needs to get much more
explicit.

@MattHJensen @PeterDSteinberg @talumbau

@martinholmer
Copy link
Contributor

The creation of the private taxpuf repository resolves taxdata issue #53, right? Unless somebody corrects my understanding, I'll close this issue on Monday, Feb 6th.

@andersonfrailey @MattHJensen @Amy-Xu @PeterDSteinberg

@martinholmer
Copy link
Contributor

@PeterDSteinberg, The creation of the private taxpuf repository resolves taxdata issue #53, right?
Can you confirm this?

@PeterDSteinberg
Copy link
Author

Yes, this is resolved by taxpuf. taxpuf is now in use in the deploy repo for deploying dropq workers. If anyone needs help with taxpuf installs or updates, let me know at any time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants