-
-
Notifications
You must be signed in to change notification settings - Fork 159
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sample from weights when not reading full dataset #844
Conversation
- There are separate data sources for tax records and the weights associated with those records. This change allows one to read in a sample from the full set of tax records and then get the corresponding weights from the full data set of weights
Current coverage is 98.12% (diff: 100%)@@ master #844 diff @@
==========================================
Files 13 13
Lines 1777 1816 +39
Methods 0 0
Messages 0 0
Branches 0 0
==========================================
+ Hits 1743 1782 +39
Misses 34 34
Partials 0 0
|
@talumbau, Just curious why the following import is added to records.py:
|
Ah this was a leftover import from some debugging printing I was doing. I'll remove it and push. |
@talumbau, When (say) reading a 10% sample are the sample weights going to be multiplied by 10? Or, when reading a 5% sample, be multiplied by 20? |
that is a good point. In fact, when running example dropq calculations with this feature, I am getting very low revenue estimates compared to the full dataset and it is likely due to this issue (this is in support of a desired webapp feature ospc-org/ospc.org#274). So, in order to get some kind of reasonable approximation of a true revenue estimate with 2% of the data, I think I would have to do as follows:
Does that sound like a good way to implement this feature @MattHJensen? Due to this additional work, perhaps the right thing to do is to offer a weights vector that does this correctly for some pre-determined fractions (e.g. 2%, 5%, 25% of the total data)? |
@talumbau, The basic idea from statistics is this: If you are drawing a sample Does that make sense? If you are going to allow TaxBrain users to do this sampling (in order to reduce execution time, I guess), you are going to have to be clear about the lack of accuracy inherent in a small sample. |
Yes, that makes sense to me. My comment above was regarding the way that the web app presents statistics on the reforms. In order to have reasonably similar results for a quick calculation, my thinking was that we would need to consider each bin as its own population, and then modify the weights as you suggest. So, we would actually have P_1, .. P_n, and we would have to make sure that we are sampling at some fraction f on each of the P_i, instead of just sampling at some fraction f across the global P of all the records. Does that idea seem like the right way to proceed? |
On Wed, 27 Jul 2016, T.J. Alumbaugh wrote:
In my experience, taking a 2% (if mod(n,50)=0;)sample and multiplying |
@talumbau, Let me correct the weight-adjustment formulas so that they work for a population where each observation (or filing unit) has a different weight. Earlier I said:
The formula for Then If you want to do stratified sampling, I think you can apply these concepts to each stratum. |
yes, to me, the concept of stratified sampling seems like the right idea since TaxBrain users are still going to get breakdowns by decile of income (or something similar). Therefore, I wonder if we should precompute these weights and have them as part of the Tax-Calculator. Then, one could load an "approved" fraction like this:
And then we would have a certain fixed number of allowed fractions (e.g. 2%, 5%, 10%, 25%). @MattHJensen do you have thoughts here? |
On Wed, 27 Jul 2016, T.J. Alumbaugh wrote:
A variety of fractions isn't really necessary. I run with 2% till the end, dan
|
I believe this is now ready to merge. Any comments @martinholmer @MattHJensen @feenberg ? |
@talumbau, this looks good to me although I haven't given it a deep review. Since it doesn't affect TC results I'd be happy for it to be merged and then reviewed in more depth later. I do have a few follow up questions that I hope don't hang things up: Once this is incorporated on TB, should I expect to get the same result every time I also wonder if we should add a regression test for the results of the 2% sample since we will be displaying those on TB -- could be the same tests as we have for the 100% sample. This would also allow us to track how close the 2% sample and 100% sample results are as well.. Again, I don't think this needs to be in the first iteration though. |
This is a good point. I will check. I think the desire would be to make it the same each time. This PR can't affect that though, because it is just matching up the weights (and multiplying them) after the choice has been made
Also a very good point. I think I can add something to one of the |
TC is good to go for the release as soon as this is merged, and I think you can merge this whenever you'd like. |
@martinholmer @MattHJensen @feenberg I added a PUF-required test to assert on the sample data accuracy. I made the margins quite wide: with a 10% sample, I get a max relative difference in combined liability estimate of 0.6%, but the test will pass on any max relative difference up to 5%. We can tighten these tolerances at your discretion. |
Thanks @talumbau.
Is the plan to use the 10% sample for the quick calculation on TB or a 2% sample? I think we should test whatever is used on TB. |
That's a good idea. I changed it to a 2% fraction. The max relative difference with a two percent sample is 2.45%, but I kept the pass criterion at less than 5%. |
associated with those records. This change allows one to read in a
sample from the full set of tax records and then get the
corresponding weights from the full data set of weights