Improve dissemination / deployment related to Public Use File (puf.csv) #53

PeterDSteinberg · 2016-12-14T23:16:53Z

Issue - Better track and distribute `puf.csv`

We have found OG-USA regression testing difficult because the Tax-Calculator, OG-USA and B-Tax packages that require a puf.csv public use file do not maintain metadata about which version of that puf.csv is okay.

We have discussed the idea of coming up with a taxpuf private Anaconda package to help ensure installations of Tax-Calculator, B-Tax, OG-USA get the right puf.csv. Each of the packages may require a version of taxpuf just like requiring a specific version of numpy.

There are a few ideas about how to do this with minimal interruption of existing code in Tax-Calculator, B-Tax and OG-USA.

Idea 1 - `write_puf`

At the top of our modules in B-Tax, Tax-Calculator, and OG-USA, just put the following:

from taxpuf import write_puf
write_puf()
# the rest of the import statements

write_puf() would write puf.csv to the top level of the repo where it is currently expected, but would not write it if the md5sum indicated no change was needed.

Idea 2 - `from taxpuf import PUF`

from taxpuf import PUF then PUF is the string contents of the CSV file puf.csv. This would require some modifications in the related repositories where currently a CSV path is expected. I don't think we want from taxpuf import PUF to import a path to a CSV because the path may need to be processed differently, depending on whether it is a path within an egg.

Auto-build `taxpuf` package

In either case of Idea 1 or 2 above, I have some helper scripts started for automatically building a new taxpuf package whenever we need to distribute a new puf.csv. It is a one line package creation script used like:

python build_puf_package.py puf.csv "testing it out" 0.0.4

The above would take puf.csv from the current directory and make it the 0.0.4 version of taxpuf package, adding the metadata "testing it out".

Thoughts on Idea 1 vs 2?

The text was updated successfully, but these errors were encountered:

talumbau · 2016-12-14T23:18:20Z

In my version of Idea 2, PUF would be a DataFrame, therefore one would type:

from taxpuf import PUF
rec = Records(data=PUF)

PeterDSteinberg · 2016-12-14T23:37:02Z

In your view @talumbau of Idea 2, should PUF always be a dataframe constructed from the CSV installed as part of taxpuf (the correct version) or should taxpuf prefer a local puf.csv over the installed one (may be helpful for testing several puf.csv files)?

talumbau · 2016-12-14T23:39:27Z

the first one - that is, when executing from taxpuf import PUF, you would get the DataFrame of the PUF associated with that version of taxpuf. The only nontrivial aspect of it is that the data loading would not happen until the execution of the import statement that involves the PUF variable. So, import taxpuf; print(taxpuf.__version__) would not incur any data loading penalties.

martinholmer · 2016-12-14T23:45:55Z

@PeterDSteinberg said:

We have found OG-USA regression testing difficult because the Tax-Calculator, OG-USA and B-Tax packages that require a puf.csv public use file do not maintain metadata about which version of that puf.csv is okay.

What is the nature of these problems exactly?
I'm not aware of any problems of this sort with Tax-Calculator.

PeterDSteinberg · 2016-12-15T00:01:32Z

@martinholmer We have set up Jenkins CI testing in a way that allows running any OG-USA / Tax-Calculator new versions with the same set of reforms that were tested with OG-USA 0.5.5 and Tax-Calc 0.6.6 several months ago. We made an automated system of installing the right versions of the whole stack but found full automation of the regression tests on every PR to OG-USA were not possible because the puf.csv is often changed in conjunction with OG-USA and Tax-Calculator changes.

talumbau · 2016-12-15T14:49:41Z

What is the nature of these problems exactly?
I'm not aware of any problems of this sort with Tax-Calculator.

The issue is more precisely defined as follows: the PUF is a moving target and the PUF comes with no metadata that allows one to know, for a given instance of the PUF sitting on your computer, which version(s) of Tax-Calculator that PUF is compatible with.

This issue presents as a problem in primarily two contexts:

If one is only occasionally using Tax-Calculator, or just not keeping up with day-to-day activities on Tax-Calculator, when one goes to use Tax-Calculator (either the latest released version or the latest master on the repo), there can frequently be issues in getting everything to run. Then, it's necessary to go find an email that describes any PUF changes (if one is on the list of recipients), find the link to the dropbox containing the appropriate PUF, which might be the most recent PUF, or a previous version if one is using the latest release of Tax-Calculator and the latest PUF has a breaking change. It is sometimes not obvious which one to get.
Deploying Tax-Calculator as a component of a live web application, or even as a component of a regression testing setup. The situation here is that we need several (maybe 10 or more) machines, likely cloud-based machines, to come online and run either Tax-Calculator itself, or a package that depends on Tax-Calculator (like OGUSA or B-Tax). These packages need a certain version of the PUF to go with the version of Tax-Calculator they depend on. Since there is no meta-data embedded in the PUF and there is no way of getting the PUF except via the process I just described above, deployments are error-prone.

martinholmer · 2016-12-15T15:12:58Z

@talumbau said in the conversation about taxdata issue #53:

The issue is more precisely defined as follows: the PUF is a moving target and the PUF comes with no metadata that allows one to know, for a given instance of the PUF sitting on your computer, which version(s) of Tax-Calculator that PUF is compatible with.

Yes, I can see that there is no metadata embedded in the puf.csv file, but I don't understand why that is a problem for Tax-Calculator. I was requested to make the changes in Tax-Calculator pull request #1035, which was merged on 06-Nov-2016, in order to allow virtually any version of the puf.csv file to be used as input to Tax-Calculator. Given those changes, I don't see why there is any need to change Tax-Calculator logic again.

I can imagine that various clients of Tax-Calculator may need to have a mechanism to select an older version of the puf.csv file that they want (need?) to use. But it seems to me that some kind of database of puf.csv file versions is what is needed for those clients. Such a database would have a single table with two fields: a version number or date of the puf.csv file and the contents of the puf.csv file. Won't that work fine for these Tax-Calculator clients?

In any case, whatever solution you come up with for the clients should leave Tax-Calculator logic unchanged.

@MattHJensen @PeterDSteinberg

PeterDSteinberg · 2016-12-15T19:45:11Z

@martinholmer The Anaconda private package would accomplish essentially what the database you mentioned would accomplish, but in a more portable way. Then the Tax-Calculator, B-Tax, and OG-USA repositories can each choose how they want to make use of the new taxpuf package. I think for deployment and CI testing purposes it would be helpful for those packages to pin to a specific puf.csv. In each of the last two B-Tax deployments we have had some live experimentation needed to figure out whether we were using a compatible puf.csv. I'm open to ideas. cc @MattHJensen

martinholmer · 2016-12-15T19:57:54Z

@PeterDSteinberg said:

The Anaconda private package would accomplish essentially what the database you mentioned would accomplish, but in a more portable way. Then the Tax-Calculator, B-Tax, and OG-USA repositories can each choose how they want to make use of the new taxpuf package.

Can you be more specific? What do you mean by "Then the Tax-Calculator, B-Tax, and OG-USA repositories can each choose how they want to make use of the new taxpuf package"?
Are you saying that Tax-Calculator logic is going to have to change in this new regime?
If so, you haven't convinced me why this is necessary.

PeterDSteinberg · 2016-12-15T21:12:21Z

@martinholmer I mean that Tax-Calculator, B-Tax and OG-USA could name in their conda.recipe's meta.yaml, a specific version of taxpuf to ensure they are installed with the correct version of puf.csv. Currently there is no way for us to do the OG-USA regression testing in an automated way because there is no metadata linking packages to the puf.csv with which they are compatible. If your PR has changed Tax-Calculator to make it agnostic with respect to puf version, that helps with interactive use and many users, but we still in production deployments and regression testing need a way to ensure the right puf.csv gets installed.

MattHJensen · 2016-12-15T21:36:04Z

@PeterDSteinberg, I'm also having trouble understanding the specific problems after Tax-Calculator PSLmodels/Tax-Calculator#1035.

Is this an accurate example of a problem situation?

OG-USA regression testing is turned on and automatically compares every new OG-USA PR to baseline.
A new version of the puf.csv is released and installed on the testing server.
Some changes are made to OG-USA and a PR is opened
The test is run and there are differences between HEAD and MASTER.
We don't know whether the differences stem from OG-USA or the data update?

If so, why not just always adopt the latest version of puf.csv whenever it comes out and trigger a test at that time? Is the problem that "always adopt the latest version of puf.csv whenever it comes" is a fraught human process?

@jdebacker, could you also chime in whith a description of when and why puf.csv releases are painful for you, particularly after PSLmodels/Tax-Calculator#1035?

cc @Amy-Xu @andersonfrailey @zrisher

MattHJensen · 2016-12-15T21:41:08Z

@martinholmer, could you describe the downside you see to having a Conda-versioned puf.csv file?

MattHJensen · 2016-12-15T21:43:26Z

@PeterDSteinberg @talumbau, would we be able to (easily) maintain backwards compatibility between the current Tax-Calculator and earlier versions of the puf.csv files that have already been distributed with the Anaconda private package system?

PeterDSteinberg · 2016-12-15T21:50:26Z

@talumbau @MattHJensen If we put something like taxpuf=0.0.4 into the meta.yaml of OG-USA, Tax-Calculator, and B-Tax, the installation would only work if the installing machine has the Anaconda token for taxpuf. Then, after the install, we can have it set up so the taxpuf CSV can be overriden by one in the local directory (which I'm sure we'll need for testing so that we don't have to repackage the puf.csv every time we want to try a modification to it). The backwards compat would be up to each of Tax-Calculator, B-Tax and OG-USA.

MattHJensen · 2016-12-15T21:57:40Z

I asked:

would we be able to (easily) maintain backwards compatibility between the current Tax-Calculator and earlier versions of the puf.csv files that have already been distributed with the Anaconda private package system?

Which @PeterDSteinberg answered:

we can have it set up so the taxpuf CSV can be overriden by one in the local directory (which I'm sure we'll need for testing so that we don't have to repackage the puf.csv every time we want to try a modification to it). The backwards compat would be up to each of Tax-Calculator, B-Tax and OG-USA.

Got it, thanks.

jdebacker · 2016-12-15T21:58:43Z

@MattHJensen says:

If so, why not just always adopt the latest version of puf.csv whenever it comes out and trigger a test at that time? Is the problem that "always adopt the latest version of puf.csv whenever it comes" is a fraught human process?

Yes. If we one is not following each change to the TC, it's hard to know a new PUF is needed. And when one is aware, several manual steps are necessary to acquire the file. In my case, I then need to drop any updated PUF into three repos on each machine I'm running TC, B-Tax, and OG-USA on. With 4+ versions of the PUF since July, it can be difficult to keep track of which version is where.

@jdebacker, could you also chime in whith a description of when and why puf.csv releases are painful for you, particularly after PSLmodels/Tax-Calculator#1035?

PR #1035 has been very helpful. Now model runs don't break. But as @PeterDSteinberg and @talumbau note above, this PR doesn't solve issues with testing when values in the PUF change across versions.

To me, @PeterDSteinberg and @talumbau propose a very elegant solution to this without altering the logic of any of the models and with only a few minor changes to the source code to use the proposed conda package.

talumbau · 2016-12-16T04:39:34Z

One area to clarify would be to describe how this would impact Tax-Calculator, as this seemed to be an area of concern for @martinholmer. To answer that, I would say that it impacts Tax-Calculator to the extent that the Tax-Calculator package is concerned with where the PUF comes from. To my knowledge, the current repo cares very little about this. A Records object can either be created by passing a path to the PUF file, or passing a DataFrame that one gets from reading in the PUF. There's essentially no logic, as far as I'm aware, as to what happens before this in a user's script. So, as far as I can tell, I don't think it would really matter to the Tax-Calculator, if we added the following capability:

from taxpuf import PUF
from taxcalc import *
rec = Records(data=PUF)

instead of what is often done:

from taxcalc import *
rec = Records(data=path_to_puf)

or

from taxcalc import *
import pandas as pd
df = pd.read_csv("puf.csv.gz", compression="gzip")
rec = Records(data=df)

Or similar functionality addition if we use Idea 1 above. In the example above, PUF is a dataframe, so it's essentially a way of executing the last example (manual reading of the file into a pandas DataFrame) except that it comes with all of the configuration/dependency resolution mechanisms that Peter has been describing in this issue.

zrisher · 2016-12-22T07:33:34Z

@talumbau @martinholmer @MattHJensen @PeterDSteinberg @jdebacker I like the idea of using conda to manage the puf_csv dependency. As mentioned, it provides all the needed versioning support in a very lightweight way. Providing the dataframe directly seems reasonable since this repo and its dependent projects all use pandas.

My biggest concern is this:

How would we declare the dependency in tax-calculator in such a way that contributors without access to that package could still build their environment?

Right now puf.csv is an optional dependency, but Conda doesn't understand such things.

We could distribute a separate pip requirements file (which we just got rid of) to contributors with access and make it part of their setup process. It could also hold the package access credentials. But that's replacing a manual dependency deployment problem with a less frequent manual deployment. And it would mean tax-calculator is still failing to declare its preferred puf.csv version as an installed package via Conda.

Hoping someone has an easy answer to this.

martinholmer · 2016-12-22T14:18:02Z

@zrisher asked:

How would we declare the dependency in tax-calculator in such a way that contributors without access to that [private] package could still build their environment?

This is an excellent question, particularly because the change
being discussed in taxdata issue #53 is billed as a simplification.

And I have two more questions about this plan:

(1) Will the puf.csv file still be distributed as now (for those
who view the change being discussed in taxdata issue #53
as a complication rather than simplification)?

(2) Will there be absolutely no changes to the records.py file?

If we are going to make this sort of change, the discussion
about what is involved in the change needs to get much more
explicit.

@MattHJensen @PeterDSteinberg @talumbau

martinholmer · 2017-02-03T14:01:31Z

The creation of the private taxpuf repository resolves taxdata issue #53, right? Unless somebody corrects my understanding, I'll close this issue on Monday, Feb 6th.

@andersonfrailey @MattHJensen @Amy-Xu @PeterDSteinberg

martinholmer · 2017-02-06T14:57:53Z

@PeterDSteinberg, The creation of the private taxpuf repository resolves taxdata issue #53, right?
Can you confirm this?

PeterDSteinberg · 2017-02-06T15:05:25Z

Yes, this is resolved by taxpuf. taxpuf is now in use in the deploy repo for deploying dropq workers. If anyone needs help with taxpuf installs or updates, let me know at any time.

talumbau mentioned this issue Jan 3, 2017

TaxBrain and Tax-Calculator results are not the same in 0.7.3 PSLmodels/Tax-Calculator#1119

Closed

PeterDSteinberg closed this as completed Feb 6, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve dissemination / deployment related to Public Use File (puf.csv) #53

Improve dissemination / deployment related to Public Use File (puf.csv) #53

PeterDSteinberg commented Dec 14, 2016

talumbau commented Dec 14, 2016 •

edited

Loading

PeterDSteinberg commented Dec 14, 2016

talumbau commented Dec 14, 2016 •

edited

Loading

martinholmer commented Dec 14, 2016

PeterDSteinberg commented Dec 15, 2016

talumbau commented Dec 15, 2016

martinholmer commented Dec 15, 2016

PeterDSteinberg commented Dec 15, 2016

martinholmer commented Dec 15, 2016

PeterDSteinberg commented Dec 15, 2016

MattHJensen commented Dec 15, 2016 •

edited

Loading

MattHJensen commented Dec 15, 2016

MattHJensen commented Dec 15, 2016 •

edited

Loading

PeterDSteinberg commented Dec 15, 2016

MattHJensen commented Dec 15, 2016

jdebacker commented Dec 15, 2016

talumbau commented Dec 16, 2016

zrisher commented Dec 22, 2016

martinholmer commented Dec 22, 2016

martinholmer commented Feb 3, 2017

martinholmer commented Feb 6, 2017

PeterDSteinberg commented Feb 6, 2017

Improve dissemination / deployment related to Public Use File (puf.csv) #53

Improve dissemination / deployment related to Public Use File (puf.csv) #53

Comments

PeterDSteinberg commented Dec 14, 2016

Issue - Better track and distribute puf.csv

Idea 1 - write_puf

Idea 2 - from taxpuf import PUF

Auto-build taxpuf package

talumbau commented Dec 14, 2016 • edited Loading

PeterDSteinberg commented Dec 14, 2016

talumbau commented Dec 14, 2016 • edited Loading

martinholmer commented Dec 14, 2016

PeterDSteinberg commented Dec 15, 2016

talumbau commented Dec 15, 2016

martinholmer commented Dec 15, 2016

PeterDSteinberg commented Dec 15, 2016

martinholmer commented Dec 15, 2016

PeterDSteinberg commented Dec 15, 2016

MattHJensen commented Dec 15, 2016 • edited Loading

MattHJensen commented Dec 15, 2016

MattHJensen commented Dec 15, 2016 • edited Loading

PeterDSteinberg commented Dec 15, 2016

MattHJensen commented Dec 15, 2016

jdebacker commented Dec 15, 2016

talumbau commented Dec 16, 2016

zrisher commented Dec 22, 2016

martinholmer commented Dec 22, 2016

martinholmer commented Feb 3, 2017

martinholmer commented Feb 6, 2017

PeterDSteinberg commented Feb 6, 2017

Issue - Better track and distribute `puf.csv`

Idea 1 - `write_puf`

Idea 2 - `from taxpuf import PUF`

Auto-build `taxpuf` package

talumbau commented Dec 14, 2016 •

edited

Loading

talumbau commented Dec 14, 2016 •

edited

Loading

MattHJensen commented Dec 15, 2016 •

edited

Loading

MattHJensen commented Dec 15, 2016 •

edited

Loading