Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sample from weights when not reading full dataset #844

Merged
merged 5 commits into from
Aug 10, 2016

Conversation

talumbau
Copy link
Member

  • There are separate data sources for tax records and the weights
    associated with those records. This change allows one to read in a
    sample from the full set of tax records and then get the
    corresponding weights from the full data set of weights

 - There are separate data sources for tax records and the weights
   associated with those records. This change allows one to read in a
   sample from the full set of  tax records and then get the
   corresponding weights from the full data set of weights
@codecov-io
Copy link

codecov-io commented Jul 27, 2016

Current coverage is 98.12% (diff: 100%)

Merging #844 into master will increase coverage by 0.04%

@@             master       #844   diff @@
==========================================
  Files            13         13          
  Lines          1777       1816    +39   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
+ Hits           1743       1782    +39   
  Misses           34         34          
  Partials          0          0          

Powered by Codecov. Last update 0474d8f...e59ad62

@martinholmer
Copy link
Collaborator

@talumbau, Just curious why the following import is added to records.py:

from __future__ import print_function

@talumbau
Copy link
Member Author

Ah this was a leftover import from some debugging printing I was doing. I'll remove it and push.

@martinholmer
Copy link
Collaborator

@talumbau, When (say) reading a 10% sample are the sample weights going to be multiplied by 10? Or, when reading a 5% sample, be multiplied by 20?

@MattHJensen @feenberg @Amy-Xu

@talumbau
Copy link
Member Author

that is a good point. In fact, when running example dropq calculations with this feature, I am getting very low revenue estimates compared to the full dataset and it is likely due to this issue (this is in support of a desired webapp feature ospc-org/ospc.org#274). So, in order to get some kind of reasonable approximation of a true revenue estimate with 2% of the data, I think I would have to do as follows:

  • sample the desired fraction for each binned group (say income group)
  • increase the weights in each group so that, in a weighted sense, the number of taxpayers is roughly the same.

Does that sound like a good way to implement this feature @MattHJensen? Due to this additional work, perhaps the right thing to do is to offer a weights vector that does this correctly for some pre-determined fractions (e.g. 2%, 5%, 25% of the total data)?

@martinholmer
Copy link
Collaborator

@talumbau, The basic idea from statistics is this:

If you are drawing a sample N out of a population of P, then the sampling fraction is f = N / P. If an observation is drawn in the sample and has a population weight of W, then to get the same sort of aggregate totals from the sample as from the population, you need to use new weight for that observations that is equal to W / f.

Does that make sense?

If you are going to allow TaxBrain users to do this sampling (in order to reduce execution time, I guess), you are going to have to be clear about the lack of accuracy inherent in a small sample.

@MattHJensen @feenberg @Amy-Xu @GoFroggyRun @zrisher

@talumbau
Copy link
Member Author

If you are drawing a sample N out of a population of P, then the sampling fraction is f = N / P. If an observation is drawn in the sample and has a population weight of W, then to get the same sort of aggregate totals from the sample as from the population, you need to use new weight for that observations that is equal to W / f.

Yes, that makes sense to me. My comment above was regarding the way that the web app presents statistics on the reforms. In order to have reasonably similar results for a quick calculation, my thinking was that we would need to consider each bin as its own population, and then modify the weights as you suggest. So, we would actually have P_1, .. P_n, and we would have to make sure that we are sampling at some fraction f on each of the P_i, instead of just sampling at some fraction f across the global P of all the records. Does that idea seem like the right way to proceed?

@feenberg
Copy link
Contributor

On Wed, 27 Jul 2016, T.J. Alumbaugh wrote:

  If you are drawing a sample N out of a population of P, then the sampling fraction is f = N / P. If an
  observation is drawn in the sample and has a population weight of W, then to get the same sort of
  aggregate totals from the sample as from the population, you need to use new weight for that observations
  that is equal to W / f.

Yes, that makes sense to me. My comment above was regarding the way that the web app presents statistics on the
reforms. In order to have reasonably similar results for a quick calculation, my thinking was that we would need to
consider each bin as its own population, and then modify the weights as you suggest. So, we would actually have P_1,
.. P_n, and we would have to make sure that we are sampling at some fraction f on each of the P_i, instead of just
sampling at some fraction f across the global P of all the records. Does that idea seem like the right way to proceed?

In my experience, taking a 2% (if mod(n,50)=0;)sample and multiplying
the weights by 50 doesn't create any seriously misleading results. This is
over many years of requests.

@martinholmer
Copy link
Collaborator

@talumbau, Let me correct the weight-adjustment formulas so that they work for a population where each observation (or filing unit) has a different weight. Earlier I said:

If you are drawing a sample N out of a population of P, then the sampling fraction is f = N / P. If an observation is drawn in the sample and has a population weight of W, then to get the same sort of aggregate totals from the sample as from the population, you need to use new weight for that observations that is equal to W / f.

The formula for f needs to be generalized for our case in which observation i in the population has a weight denoted by W_i (that is supposed to be capital W with an i subscript). So, the index i runs from 1 to P and the sum of the population weight can be denoted by SUM_pop (the sum of the W_i for i running over the range from 1 to P). In a similar fashion, if you have a sample of size N where the original population weight of sample observation j is denoted by W_j, then SUM_sam is sum of the W_j for j running over the range from 1 to N.

Then f = SUM_sam / SUM_pop and the adjusted sample weights are equal to W_j / f.

If you want to do stratified sampling, I think you can apply these concepts to each stratum.

@MattHJensen @feenberg @Amy-Xu @GoFroggyRun @zrisher

@talumbau
Copy link
Member Author

talumbau commented Jul 28, 2016

yes, to me, the concept of stratified sampling seems like the right idea since TaxBrain users are still going to get breakdowns by decile of income (or something similar). Therefore, I wonder if we should precompute these weights and have them as part of the Tax-Calculator. Then, one could load an "approved" fraction like this:

   recs = Records(puf_df, fraction=0.02)

And then we would have a certain fixed number of allowed fractions (e.g. 2%, 5%, 10%, 25%). @MattHJensen do you have thoughts here?

@feenberg
Copy link
Contributor

On Wed, 27 Jul 2016, T.J. Alumbaugh wrote:

yes, to me, the concept of stratified sampling seems like the right idea since
TaxBrain users are still going to get breakdowns by decile of income (or something
similar). Therefore, I wonder if we should precompute these weights and have them as
part of the Tax-Calculator. Then, one could load an "approved" fraction like this:

recs = Records(puf_df, fraction=0.02)

And then we would have a certain fixed number of allowed fractions (e.g. 2%, 5%, 10%,
25%)

A variety of fractions isn't really necessary. I run with 2% till the end,
then a final run with 100%. The answer always changes a little, but I have
never been greatly disappointed. For years when the 2% results were off I
would try with 100%, just in case it was the small sample that was causing
the trouble. It never was. If there is trouble it is always an unforeseen
interaction among the tax provisions, which is why it is important to have
lots of intermediate values available.

dan


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the
thread.[AHvQVQD5ihlmlgjG9drloA9sbW84lnYyks5qaBMZgaJpZM4JWP4z.gif]

@talumbau
Copy link
Member Author

talumbau commented Aug 8, 2016

I believe this is now ready to merge. Any comments @martinholmer @MattHJensen @feenberg ?

@MattHJensen
Copy link
Contributor

@talumbau, this looks good to me although I haven't given it a deep review. Since it doesn't affect TC results I'd be happy for it to be merged and then reviewed in more depth later. I do have a few follow up questions that I hope don't hang things up:

Once this is incorporated on TB, should I expect to get the same result every time
I run the same reform, or will their be some fluctuation from random samples? @feenberg, do you have a strong preference for the first iteration of this feature? My initial view is that we should start with what's easy.

I also wonder if we should add a regression test for the results of the 2% sample since we will be displaying those on TB -- could be the same tests as we have for the 100% sample. This would also allow us to track how close the 2% sample and 100% sample results are as well.. Again, I don't think this needs to be in the first iteration though.

@talumbau
Copy link
Member Author

talumbau commented Aug 9, 2016

Once this is incorporated on TB, should I expect to get the same result every time
I run the same reform, or will their be some fluctuation from random samples? @feenberg, do you have a strong preference for the first iteration of this feature?

This is a good point. I will check. I think the desire would be to make it the same each time. This PR can't affect that though, because it is just matching up the weights (and multiplying them) after the choice has been made

I also wonder if we should add a regression test for the results of the 2% sample since we will be displaying those on TB -- could be the same tests as we have for the 100% sample. This would also allow us to track how close the 2% sample and 100% sample results are as well..

Also a very good point. I think I can add something to one of the requires_puf tests with this. When do you want to do the release?

@MattHJensen
Copy link
Contributor

TC is good to go for the release as soon as this is merged, and I think you can merge this whenever you'd like.

@talumbau
Copy link
Member Author

talumbau commented Aug 9, 2016

@martinholmer @MattHJensen @feenberg I added a PUF-required test to assert on the sample data accuracy. I made the margins quite wide: with a 10% sample, I get a max relative difference in combined liability estimate of 0.6%, but the test will pass on any max relative difference up to 5%. We can tighten these tolerances at your discretion.

@MattHJensen
Copy link
Contributor

I added a PUF-required test to assert on the sample data accuracy.

Thanks @talumbau.

I made the margins quite wide: with a 10% sample, I get a max relative difference in combined liability estimate of 0.6%,

Is the plan to use the 10% sample for the quick calculation on TB or a 2% sample? I think we should test whatever is used on TB.

@talumbau
Copy link
Member Author

talumbau commented Aug 9, 2016

That's a good idea. I changed it to a 2% fraction. The max relative difference with a two percent sample is 2.45%, but I kept the pass criterion at less than 5%.

@talumbau talumbau merged commit 4fb4aba into PSLmodels:master Aug 10, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants