Sample from weights when not reading full dataset #844

talumbau · 2016-07-27T14:27:20Z

There are separate data sources for tax records and the weights
associated with those records. This change allows one to read in a
sample from the full set of tax records and then get the
corresponding weights from the full data set of weights

- There are separate data sources for tax records and the weights associated with those records. This change allows one to read in a sample from the full set of tax records and then get the corresponding weights from the full data set of weights

codecov-io · 2016-07-27T14:34:27Z

Current coverage is 98.12% (diff: 100%)

Merging #844 into master will increase coverage by 0.04%

@@             master       #844   diff @@
==========================================
  Files            13         13          
  Lines          1777       1816    +39   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
+ Hits           1743       1782    +39   
  Misses           34         34          
  Partials          0          0

Powered by Codecov. Last update 0474d8f...e59ad62

martinholmer · 2016-07-27T15:16:20Z

@talumbau, Just curious why the following import is added to records.py:

from __future__ import print_function

talumbau · 2016-07-27T15:19:40Z

Ah this was a leftover import from some debugging printing I was doing. I'll remove it and push.

martinholmer · 2016-07-27T15:20:18Z

@talumbau, When (say) reading a 10% sample are the sample weights going to be multiplied by 10? Or, when reading a 5% sample, be multiplied by 20?

@MattHJensen @feenberg @Amy-Xu

talumbau · 2016-07-27T16:56:38Z

that is a good point. In fact, when running example dropq calculations with this feature, I am getting very low revenue estimates compared to the full dataset and it is likely due to this issue (this is in support of a desired webapp feature ospc-org/ospc.org#274). So, in order to get some kind of reasonable approximation of a true revenue estimate with 2% of the data, I think I would have to do as follows:

sample the desired fraction for each binned group (say income group)
increase the weights in each group so that, in a weighted sense, the number of taxpayers is roughly the same.

Does that sound like a good way to implement this feature @MattHJensen? Due to this additional work, perhaps the right thing to do is to offer a weights vector that does this correctly for some pre-determined fractions (e.g. 2%, 5%, 25% of the total data)?

martinholmer · 2016-07-27T17:15:27Z

@talumbau, The basic idea from statistics is this:

If you are drawing a sample N out of a population of P, then the sampling fraction is f = N / P. If an observation is drawn in the sample and has a population weight of W, then to get the same sort of aggregate totals from the sample as from the population, you need to use new weight for that observations that is equal to W / f.

Does that make sense?

If you are going to allow TaxBrain users to do this sampling (in order to reduce execution time, I guess), you are going to have to be clear about the lack of accuracy inherent in a small sample.

@MattHJensen @feenberg @Amy-Xu @GoFroggyRun @zrisher

talumbau · 2016-07-27T19:06:35Z

If you are drawing a sample N out of a population of P, then the sampling fraction is f = N / P. If an observation is drawn in the sample and has a population weight of W, then to get the same sort of aggregate totals from the sample as from the population, you need to use new weight for that observations that is equal to W / f.

Yes, that makes sense to me. My comment above was regarding the way that the web app presents statistics on the reforms. In order to have reasonably similar results for a quick calculation, my thinking was that we would need to consider each bin as its own population, and then modify the weights as you suggest. So, we would actually have P_1, .. P_n, and we would have to make sure that we are sampling at some fraction f on each of the P_i, instead of just sampling at some fraction f across the global P of all the records. Does that idea seem like the right way to proceed?

feenberg · 2016-07-27T19:18:49Z

On Wed, 27 Jul 2016, T.J. Alumbaugh wrote:

  If you are drawing a sample N out of a population of P, then the sampling fraction is f = N / P. If an
  observation is drawn in the sample and has a population weight of W, then to get the same sort of
  aggregate totals from the sample as from the population, you need to use new weight for that observations
  that is equal to W / f.
Yes, that makes sense to me. My comment above was regarding the way that the web app presents statistics on the
reforms. In order to have reasonably similar results for a quick calculation, my thinking was that we would need to
consider each bin as its own population, and then modify the weights as you suggest. So, we would actually have P_1,
.. P_n, and we would have to make sure that we are sampling at some fraction f on each of the P_i, instead of just
sampling at some fraction f across the global P of all the records. Does that idea seem like the right way to proceed?

In my experience, taking a 2% (if mod(n,50)=0;)sample and multiplying
the weights by 50 doesn't create any seriously misleading results. This is
over many years of requests.

martinholmer · 2016-07-27T19:30:40Z

@talumbau, Let me correct the weight-adjustment formulas so that they work for a population where each observation (or filing unit) has a different weight. Earlier I said:

If you are drawing a sample N out of a population of P, then the sampling fraction is f = N / P. If an observation is drawn in the sample and has a population weight of W, then to get the same sort of aggregate totals from the sample as from the population, you need to use new weight for that observations that is equal to W / f.

The formula for f needs to be generalized for our case in which observation i in the population has a weight denoted by W_i (that is supposed to be capital W with an i subscript). So, the index i runs from 1 to P and the sum of the population weight can be denoted by SUM_pop (the sum of the W_i for i running over the range from 1 to P). In a similar fashion, if you have a sample of size N where the original population weight of sample observation j is denoted by W_j, then SUM_sam is sum of the W_j for j running over the range from 1 to N.

Then f = SUM_sam / SUM_pop and the adjusted sample weights are equal to W_j / f.

If you want to do stratified sampling, I think you can apply these concepts to each stratum.

@MattHJensen @feenberg @Amy-Xu @GoFroggyRun @zrisher

talumbau · 2016-07-28T02:25:28Z

yes, to me, the concept of stratified sampling seems like the right idea since TaxBrain users are still going to get breakdowns by decile of income (or something similar). Therefore, I wonder if we should precompute these weights and have them as part of the Tax-Calculator. Then, one could load an "approved" fraction like this:

   recs = Records(puf_df, fraction=0.02)

And then we would have a certain fixed number of allowed fractions (e.g. 2%, 5%, 10%, 25%). @MattHJensen do you have thoughts here?

feenberg · 2016-07-28T12:51:44Z

On Wed, 27 Jul 2016, T.J. Alumbaugh wrote:

yes, to me, the concept of stratified sampling seems like the right idea since
TaxBrain users are still going to get breakdowns by decile of income (or something
similar). Therefore, I wonder if we should precompute these weights and have them as
part of the Tax-Calculator. Then, one could load an "approved" fraction like this:

recs = Records(puf_df, fraction=0.02)

And then we would have a certain fixed number of allowed fractions (e.g. 2%, 5%, 10%,
25%)

A variety of fractions isn't really necessary. I run with 2% till the end,
then a final run with 100%. The answer always changes a little, but I have
never been greatly disappointed. For years when the 2% results were off I
would try with 100%, just in case it was the small sample that was causing
the trouble. It never was. If there is trouble it is always an unforeseen
interaction among the tax provisions, which is why it is important to have
lots of intermediate values available.

dan

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the
thread.[AHvQVQD5ihlmlgjG9drloA9sbW84lnYyks5qaBMZgaJpZM4JWP4z.gif]

talumbau · 2016-08-08T14:18:25Z

I believe this is now ready to merge. Any comments @martinholmer @MattHJensen @feenberg ?

MattHJensen · 2016-08-08T20:03:42Z

@talumbau, this looks good to me although I haven't given it a deep review. Since it doesn't affect TC results I'd be happy for it to be merged and then reviewed in more depth later. I do have a few follow up questions that I hope don't hang things up:

Once this is incorporated on TB, should I expect to get the same result every time
I run the same reform, or will their be some fluctuation from random samples? @feenberg, do you have a strong preference for the first iteration of this feature? My initial view is that we should start with what's easy.

I also wonder if we should add a regression test for the results of the 2% sample since we will be displaying those on TB -- could be the same tests as we have for the 100% sample. This would also allow us to track how close the 2% sample and 100% sample results are as well.. Again, I don't think this needs to be in the first iteration though.

talumbau · 2016-08-09T12:27:12Z

Once this is incorporated on TB, should I expect to get the same result every time
I run the same reform, or will their be some fluctuation from random samples? @feenberg, do you have a strong preference for the first iteration of this feature?

This is a good point. I will check. I think the desire would be to make it the same each time. This PR can't affect that though, because it is just matching up the weights (and multiplying them) after the choice has been made

I also wonder if we should add a regression test for the results of the 2% sample since we will be displaying those on TB -- could be the same tests as we have for the 100% sample. This would also allow us to track how close the 2% sample and 100% sample results are as well..

Also a very good point. I think I can add something to one of the requires_puf tests with this. When do you want to do the release?

MattHJensen · 2016-08-09T14:08:39Z

TC is good to go for the release as soon as this is merged, and I think you can merge this whenever you'd like.

talumbau · 2016-08-09T15:52:28Z

@martinholmer @MattHJensen @feenberg I added a PUF-required test to assert on the sample data accuracy. I made the margins quite wide: with a 10% sample, I get a max relative difference in combined liability estimate of 0.6%, but the test will pass on any max relative difference up to 5%. We can tighten these tolerances at your discretion.

MattHJensen · 2016-08-09T16:17:07Z

I added a PUF-required test to assert on the sample data accuracy.

Thanks @talumbau.

I made the margins quite wide: with a 10% sample, I get a max relative difference in combined liability estimate of 0.6%,

Is the plan to use the 10% sample for the quick calculation on TB or a 2% sample? I think we should test whatever is used on TB.

talumbau · 2016-08-09T16:25:32Z

That's a good idea. I changed it to a 2% fraction. The max relative difference with a two percent sample is 2.45%, but I kept the pass criterion at less than 5%.

talumbau added the in progress label Jul 27, 2016

talumbau added 2 commits August 3, 2016 11:50

scale weights by sample fraction

d7d1196

force floating point division

838d417

talumbau mentioned this pull request Aug 3, 2016

Fast calculation ospc-org/ospc.org#292

Merged

MattHJensen mentioned this pull request Aug 6, 2016

Release 0.6.6 changes #841

Closed

Add a PUF test that asserts sample data accuracy

4f0b330

change sample fraction to 2%

e59ad62

talumbau merged commit 4fb4aba into PSLmodels:master Aug 10, 2016

talumbau removed the in progress label Aug 10, 2016

This was referenced Aug 13, 2016

Release 0.6.7 changes #863

Closed

multiyear_diagnostic_table utils function in test_pufcsv.py #864

Merged

martinholmer mentioned this pull request Aug 19, 2016

Use sampling seed to standardize records subsample in test_pufcsv.py #869

Merged

martinholmer mentioned this pull request Jun 27, 2017

Treatment of weights in Records object constructor, Part 2 #1440

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sample from weights when not reading full dataset #844

Sample from weights when not reading full dataset #844

talumbau commented Jul 27, 2016

codecov-io commented Jul 27, 2016 •

edited

Loading

martinholmer commented Jul 27, 2016

talumbau commented Jul 27, 2016

martinholmer commented Jul 27, 2016

talumbau commented Jul 27, 2016

martinholmer commented Jul 27, 2016

talumbau commented Jul 27, 2016

feenberg commented Jul 27, 2016

martinholmer commented Jul 27, 2016

talumbau commented Jul 28, 2016 •

edited

Loading

feenberg commented Jul 28, 2016

talumbau commented Aug 8, 2016

MattHJensen commented Aug 8, 2016

talumbau commented Aug 9, 2016

MattHJensen commented Aug 9, 2016

talumbau commented Aug 9, 2016

MattHJensen commented Aug 9, 2016

talumbau commented Aug 9, 2016

Sample from weights when not reading full dataset #844

Sample from weights when not reading full dataset #844

Conversation

talumbau commented Jul 27, 2016

codecov-io commented Jul 27, 2016 • edited Loading

Current coverage is 98.12% (diff: 100%)

martinholmer commented Jul 27, 2016

talumbau commented Jul 27, 2016

martinholmer commented Jul 27, 2016

talumbau commented Jul 27, 2016

martinholmer commented Jul 27, 2016

talumbau commented Jul 27, 2016

feenberg commented Jul 27, 2016

martinholmer commented Jul 27, 2016

talumbau commented Jul 28, 2016 • edited Loading

feenberg commented Jul 28, 2016

talumbau commented Aug 8, 2016

MattHJensen commented Aug 8, 2016

talumbau commented Aug 9, 2016

MattHJensen commented Aug 9, 2016

talumbau commented Aug 9, 2016

MattHJensen commented Aug 9, 2016

talumbau commented Aug 9, 2016

codecov-io commented Jul 27, 2016 •

edited

Loading

talumbau commented Jul 28, 2016 •

edited

Loading