Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve handling of non-positive incomes in tables and graphs #1902

Merged
merged 8 commits into from
Mar 8, 2018
Merged

Improve handling of non-positive incomes in tables and graphs #1902

merged 8 commits into from
Mar 8, 2018

Conversation

martinholmer
Copy link
Collaborator

@martinholmer martinholmer commented Mar 4, 2018

This pull request, which is built on #1901 and consists of commits 4aa34ce, 31bc38f, 84a453d, 4054ebf and 10b274a, attempts to resolve issue #1888. It does this by dividing the bottom decile in the distribution and difference tables into two subgroups: one containing filing units with negative or zero income (either expanded_income or AGI c00100) and the other containing filing units with positive income. The tables compute all statistics for both subgroups, leaving to Tax-Calculator users the decision about whether or not to show the statistics for the non-positive subgroup (only one of which is misleading).

The decile graph of percentage change in after-tax expanded income has been revised to allow the user to decide whether or not to show the percentage change for the bottom decile subgroup with non-positive income. The default option is to hide the percentage change of the non-positive subgroup and to scale the width of the positive subgroup's bar so that it is proportional to the weighted number of filing units in that subgroup.

The approach taken in this pull request produces tables in which the components add up to the total, which is an essential logical requirement.

@codecov-io
Copy link

codecov-io commented Mar 4, 2018

Codecov Report

Merging #1902 into master will not change coverage.
The diff coverage is 100%.

Impacted file tree graph

@@          Coverage Diff           @@
##           master   #1902   +/-   ##
======================================
  Coverage     100%    100%           
======================================
  Files          37      37           
  Lines        3371    3312   -59     
======================================
- Hits         3371    3312   -59
Impacted Files Coverage Δ
taxcalc/calculate.py 100% <100%> (ø) ⬆️
taxcalc/utils.py 100% <100%> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5b20973...eb3ce9d. Read the comment docs.

@martinholmer martinholmer added ready and removed WIP labels Mar 4, 2018
@MaxGhenis
Copy link
Contributor

Are incomes of zero included in the separate bucket to avoid dividing by zero? For this to be an issue in decile graphs, one would have to exclude benefits (set all weights to zero) and then the share of tax units with zero income would have to nearly triple (e.g. aging into the future with odd dynamic effects).

#1888 only concerns negative incomes (aside from my brief comment), which are most important because we don't think they're actually poor, and because other tax analysis groups give them special treatment. @codykallen's research did not show that other groups treat tax units with zero income differently, i.e. they're included as normal members of the bottom decile. It seems unlikely that they're not actually poor, since their business loss would have to be exactly zero, but it could be worth follow-up investigation if anyone's concerned.

If you want to offer flexibility for zero incomes, what do you think about splitting into two separate excluded groups, zero and negative?

@feenberg
Copy link
Contributor

feenberg commented Mar 4, 2018 via email

@MaxGhenis
Copy link
Contributor

MaxGhenis commented Mar 4, 2018

Summaries should be the ratio of averages.

@feenberg agreed which is why the entire bottom decile would have to have zero income in order for any zeros to be problematic. Tax units with zero expanded income currently make up ~0.7% of tax units when including benefits, or ~4% without.

This could still affect percentile plots though. Zeroing out benefits would create full percentiles with potentially infinite change. Is this the use case you're thinking of @martinholmer?

@martinholmer
Copy link
Collaborator Author

@feenberg said in a comment on pull request #1902:

If there are any summary measures that are the average of ratios, that is probably a bad idea. Summaries should be the ratio of averages. That avoids dividing by zero, and it avoid giving excess weight to observations with near-zero denominators.

As the source code shows, Tax-Calculator never constructs table statistics as "the average of ratios" and always construct statistics as "the ratio of averages [or weighted sums]". For example, the main "ratio" statistic is the difference table's percentage change in after-tax expanded income. This statistic is calculated by computing the subgroup's weighted sum of after-tax expanded income in the baseline and computing the subgroup's weighted sum of after-tax expanded income in the reform, and then uses these two numbers to compute the percentage change for the subgroup. There is never a case when Tax-Calculator computes the percentage change in after-tax expanded income for individual filing units.

@martinholmer
Copy link
Collaborator Author

martinholmer commented Mar 5, 2018

Let me provide some information about the weighted number of filing units with zero or negative expanded_income in order to inform the discussion of pull request #1902. First, I show the tabulation program, then use it to tabulate 2015 baseline policy dump results, and then offer some thoughts on the results.

These tabulations use the version of Tax-Calculator that is in pull request #1902, which includes the pension fix for non-filers in the newest puf.csv file described in #1892.

First, the SQL tabulation program:

$ head -50 neginc.sql
/*
DEFINITION OF EI:
    expanded_income = (
        e00200 +  # wage and salary income
        e00300 +  # taxable interest income
        e00400 +  # non-taxable interest income
        e00600 +  # dividends
        e00700 +  # state and local income tax refunds
        e00800 +  # alimony received
        e00900 +  # Sch C business net income/loss                           ***
        e01100 +  # capital gain distributions not reported on Sch D         ***
        e01200 +  # Form 4797 other net gain/loss                            ***
        e01400 +  # taxable IRA distributions
        e01500 +  # total pension and annuity income
        e02000 +  # Sch E total rental, ..., partnership, S-corp income/loss ***
        e02100 +  # Sch F farm net income/loss                               ***
        p22250 +  # Sch D: net short-term capital gain/loss                  ***
        p23250 +  # Sch D: net long-term capital gain/loss                   ***
        cmbtp +  # other AMT taxable income items from Form 6251
        0.5 * ptax_was +  # employer share of FICA taxes
        benefit_value_total +  # consumption value of all benefits received
        ubi  # total UBI benefit
*/
select "#m in total", round(sum(s006)*1e-6,3)
        from dump;
select "#m with EI=0", round(sum(s006)*1e-6,3)
        from dump where expanded_income = 0;
select "#m with EI<0", round(sum(s006)*1e-6,3)
        from dump where expanded_income < 0;
select "#m with EI<0 &  e00900<0", round(sum(s006)*1e-6,3)
        from dump where expanded_income < 0 and e00900 < 0;
select "#m with EI<0 &  e01200<0", round(sum(s006)*1e-6,3)
        from dump where expanded_income < 0 and e01200 < 0;
select "#m with EI<0 &  e02000<0", round(sum(s006)*1e-6,3)
        from dump where expanded_income < 0 and e02000 < 0;
select "#m with EI<0 &  e02100<0", round(sum(s006)*1e-6,3)
        from dump where expanded_income < 0 and e02100 < 0;
select "#m with EI<0 & CapGain<0", round(sum(s006)*1e-6,3)
        from dump where expanded_income < 0 and (p22250+p23250) < 0;
select "#m with EI<0 &   cmbtp<0", round(sum(s006)*1e-6,3)
        from dump where expanded_income < 0 and cmbtp < 0;

Next are 2015 PUF results (with some hand calculations after the ==>):

$ tc puf.csv 2015 --sqldb
$ cat neginc.sql | sqlite3 puf-15-#-#-#.db
#m in total|164.306
#m with EI=0|3.2                  ==> 1.95%
#m with EI<0|1.254                ==> 0.76%
#m with EI<0 &  e00900<0|0.444
#m with EI<0 &  e01200<0|0.17
#m with EI<0 &  e02000<0|0.461
#m with EI<0 &  e02100<0|0.039
#m with EI<0 & CapGain<0|0.496
#m with EI<0 &   cmbtp<0|0.103

And now the 2015 CPS results (with some hand calculations after the ==>):

$ tc cps.csv 2015 --sqldb
$ cat neginc.sql | sqlite3 cps-15-#-#-#.db
#m in total|163.196
#m with EI=0|1.229                ==> 0.75%
#m with EI<0|0.087                ==> 0.05%
#m with EI<0 &  e00900<0|0.083
#m with EI<0 &  e01200<0|
#m with EI<0 &  e02000<0|
#m with EI<0 &  e02100<0|0.007
#m with EI<0 & CapGain<0|
#m with EI<0 &   cmbtp<0|

And finally here are some of my thoughts.

My understanding of the discussion in issue #1888 is that negative expanded_income is thought to be a poor indicator of some more sensible (but unmeasurable with our data) notion of income, and therefore, filing units with negative expanded_income should be somehow separated out from those with low positive expanded_income. There is no doubt that in some cases this is true. But the above tabulation results suggest to me that others with negative expanded_income are quite similar to those with low positive expanded_income. For example, slightly more than a third of those with negative expanded_income in the PUF data have negative Schedule C income. In the #1888 discussion much was made of large loss carryforwards, but surely these (Trump-like) cases are not doing this through an unincorporated Schedule C business, are they? And then there is the much smaller group whose negative expanded_income seems to be related to farming losses. Do we not think at least some of these farmers filing Schedule F are similar to those with low positive expanded_income?

Others may differ, but my conclusion is that it is impossible to tell with our data whether those with negative expanded_income are similar or not to those with low positive expanded_income. So, I don't think there is any solid economic argument one way or the other on this matter.

However, there is a practical reason to segregate those with negative expanded_income from those with low positive expanded_income. It has to do with the misleading results that can be generated for the key percentage change in after-tax expanded income statistic if a subgroup's baseline after-tax expanded income is negative (which is common among those with negative expanded_income). This issue was first raised by @MaxGhenis in #1806. It is this practical reason that is the rationale for this pull request #1902. Note that in the 2015 PUF results only 0.72 percent of those with negative expanded_income have positive after-tax expanded income; In the 2015 CPS data that fraction is zero (in fact, every filing unit with negative expanded_income has negative after-tax expanded income).

There is an additional set of questions about filing units with exactly zero expanded_income. You can see in the above results that the group with zero expanded_income is much larger than the group with negative expanded_income in both the PUF and CPS data. In this pull request they have also been segregated from those with small positive expanded_income and grouped with those with negative expanded_income. Why? For practical reasons. When we have many expanded_income subgroups (as in the graph of percentage change in after-tax expanded income by percentile), some of them can consist of all zeros, which may leave any ratio statistic undefined. This is the reason those percentile are not shown in the graph of percentage change in after-tax expanded income by percentile. Those with zero expanded_income are treated the same way in this pull request: they are segregated away from those with low positive expanded_income and grouped with those with negative expanded_income for practical reasons.

From an economic point of view, it is difficult to determine with our data whether or not they are similar to those with low positive expanded_income. My guess, it that many with zero expanded_income have missing (or incorrectly imputed) income amounts. But others with zero expanded_income probably really do have zero annual income but are supporting their consumption by drawing down their (perhaps considerable) assets. So, again we are back to the practical reasons for segregating them.

If you want to propose an alternative way of handling those with zero expanded_income in this pull request, you need to suggest a way to construct the graph of percentage change in after-tax expanded income by percentile in a way this is consistent with your proposal in this pull request.

@MaxGhenis
Copy link
Contributor

My guess, it that many with zero expanded_income have missing (or incorrectly imputed) income amounts.

Could be, but at 0.8% it's at least a similar order of magnitude to the Survey of Income and Program Participation, which estimated 0.2% in 2012 (see PSLmodels/C-TAM#61).

But others with zero expanded_income probably really do have zero annual income but are supporting their consumption by drawing down their (perhaps considerable) assets.

True but tc reports income, not consumption. As long as there's a clear distinction made, as there must be for this group, I think it's a reasonably legitimate data point. Their assets are also unlikely to be considerable since they don't report capital gains taxes, right?

If you want to propose an alternative way of handling those with zero expanded_income in this pull request, you need to suggest a way to construct the graph of percentage change in after-tax expanded income by percentile in a way this is consistent with your proposal in this pull request.

I'd suggest including them in buckets that also contain tax units with positive income, which would be all deciles and percentiles when including benefits. This part is consistent with other tax analysis groups. Buckets that contain no tax units with positive income (e.g. some percentiles when excluding benefits) would have an undefined % change, so would not be plotted.

@feenberg
Copy link
Contributor

feenberg commented Mar 5, 2018 via email

@martinholmer
Copy link
Collaborator Author

Dan @feenberg asked in the discussion of pull request #1902:

In the PUF data, can you tell why the person with no income is filing a return? Is it EIC, withholding or AMT?

In the CPS data, are the taxpayers with no income their own household, or do they live in another's household? If they live in another's household, I would believe they have low lifetime income.

Below are some tabulations that answer some of your questions.

In the PUF data, 80% of those with zero expanded_income are filing units added to the PUF from the CPS, 60% of whom are single individuals. The puf.csv file has no information from the CPS about their living arrangements.

In the CPS data, 51% of those with zero expanded_income are single individuals, but there are many families.

It seems to me that these results suggest that not all of those with zero expanded_income are similar to those with low positive expanded_income. Such a conclusion is consistent with the practical reasons for segregating those with zero expanded_income from those with low positive expanded_income.

select "#m in total", round(sum(s006)*1e-6,3)
        from dump;

select "#m with EI=0", round(sum(s006)*1e-6,3)
        from dump where expanded_income = 0;

select "#m with EI=0 by filer", filer, round(sum(s006)*1e-6,3)
        from dump where expanded_income = 0
        group by filer;

select "#m with EI=0 by MARS", MARS, round(sum(s006)*1e-6,3)
        from dump where expanded_income = 0
        group by MARS;

select "#m with EI=0 by XTOT", XTOT, round(sum(s006)*1e-6,3)
        from dump where expanded_income = 0
        group by XTOT;

select "#m with EI=0 & MARS=1 & XTOT=1", round(sum(s006)*1e-6,3)
        from dump where expanded_income = 0 and MARS = 1 and XTOT = 1;

select "#m with EI=0 & MARS=1 & XTOT=1 by filer", filer, round(sum(s006)*1e-6,3)
        from dump where expanded_income = 0 and MARS = 1 and XTOT = 1
        group by filer;

/*
$ cat zeroinc.sql | sqlite3 puf-15-#-#-#.db
#m in total|164.306
#m with EI=0|3.2
#m with EI=0 by filer|0|2.578
#m with EI=0 by filer|1|0.621
#m with EI=0 by MARS|1|2.471
#m with EI=0 by MARS|2|0.586
#m with EI=0 by MARS|3|0.0
#m with EI=0 by MARS|4|0.142
#m with EI=0 by XTOT|0|0.082
#m with EI=0 by XTOT|1|2.109
#m with EI=0 by XTOT|2|0.629
#m with EI=0 by XTOT|3|0.228
#m with EI=0 by XTOT|4|0.092
#m with EI=0 by XTOT|5|0.047
#m with EI=0 by XTOT|6|0.009
#m with EI=0 by XTOT|7|0.003
#m with EI=0 & MARS=1 & XTOT=1|2.009
#m with EI=0 & MARS=1 & XTOT=1 by filer|0|1.554
#m with EI=0 & MARS=1 & XTOT=1 by filer|1|0.455

$ cat zeroinc.sql | sqlite3 cps-15-#-#-#.db
#m in total|163.196
#m with EI=0|1.229
#m with EI=0 by filer|0|0.611
#m with EI=0 by filer|1|0.619
#m with EI=0 by MARS|1|0.754
#m with EI=0 by MARS|2|0.467
#m with EI=0 by MARS|4|0.007
#m with EI=0 by XTOT|1|0.627
#m with EI=0 by XTOT|2|0.355
#m with EI=0 by XTOT|3|0.145
#m with EI=0 by XTOT|4|0.059
#m with EI=0 by XTOT|5|0.031
#m with EI=0 by XTOT|6|0.011
#m with EI=0 by XTOT|7|0.002
#m with EI=0 by XTOT|10|0.0
#m with EI=0 & MARS=1 & XTOT=1|0.627
#m with EI=0 & MARS=1 & XTOT=1 by filer|0|0.322
#m with EI=0 & MARS=1 & XTOT=1 by filer|1|0.305
*/

@martinholmer
Copy link
Collaborator Author

@MaxGhenis said in the discussion of pull request #1902:

I'd suggest including them [i.e., those with zero income] in buckets that also contain tax units with positive income

But when grouping filing units by percentiles, this leaves several percentiles with a percentile total for baseline after-tax expanded income of zero. In those cases the percentage change in after-tax expanded income is not defined because of the attempted division by zero. For this practical reason, it seems reasonable to segregate those with zero expanded_income from those with low positive expanded_income, which is the approach taken in pull request #1902.

@MaxGhenis
Copy link
Contributor

It seems to me that these results suggest that not all of those with zero expanded_income are similar to those with low positive expanded_income.

What in these results suggest that? Can you run the numbers for those with expanded income of, say, $1-100?

These tax units may be different in some way, but the question I think is whether they're miscategorized as being in the bottom decile. Negative income tax units are believed to not truly belong there (plus messing up the sign of %chg). Is there evidence that those with zero are actually richer than those with low positive income?

In those cases the percentage change in after-tax expanded income is not defined because of the attempted division by zero.

Understood, and those could be nulled out / not shown. If the tradeoff is nulling out a null value, vs. excluding them from buckets we believe them to belong to (unless other data shows otherwise) and deviating from other tax analysis groups, I don't see the big harm in nulling out the x/0 cases. And again, these x/0 cases don't actually exist in tc's current form.

@martinholmer
Copy link
Collaborator Author

@MaxGhenis said in the discussion of pull request #1902:

And again, these x/0 cases don't actually exist in tc's current form.

I'm not sure what you mean by the "current form" of tc, but "these x/0" percentiles do exist in the version of tc that can be made (using conda.recipe/install_local_taxcalc_package.sh) with the tip of the master branch.

@MaxGhenis
Copy link
Contributor

MaxGhenis commented Mar 7, 2018

"these x/0" percentiles do exist in the version of tc that can be made

Is that from PUF? Using CPS I see 0.73% of tax units having zero expanded income when advancing to 2018 (notebook).

@martinholmer
Copy link
Collaborator Author

@MaxGhenis asked in the discussion of #1902:

"these x/0" percentiles do exist in the version of tc that can be made

Is that from PUF?

Yes. You need to read my earlier comment in this discussion.

@MaxGhenis
Copy link
Contributor

Thanks, I see now that 1.95% of PUF tax units have zero expanded income, or 1.96% of those with nonnegative expanded income. So one or maybe two percentiles after aging would show up as null if they're included.

@MaxGhenis
Copy link
Contributor

I just read through the PR code a bit more, and see that the parameter is named hide_negative_incomes. This suggests only negative incomes are hidden, not zeros. WDYT about renaming to hide_nonpositive_incomes?

@martinholmer
Copy link
Collaborator Author

@MaxGhenis said:

I just read through the PR code a bit more, and see that the parameter is named hide_negative_incomes. This suggests only negative incomes are hidden, not zeros. WDYT about renaming to hide_nonpositive_incomes?

OK, that's a good suggestion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants