Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update data #2599

Merged
merged 4 commits into from
Jul 17, 2021
Merged

Update data #2599

merged 4 commits into from
Jul 17, 2021

Conversation

andersonfrailey
Copy link
Collaborator

This PR updates all of the data in Tax-Calc to be up to date with TaxData release 0.3.0. It also extends the projections out to 2031. I had to make a few changes to the tests to get (almost) all of them passing:

  • The tolerance for closeness of the full and subsample PUF and CPS files had to be increased
  • The growth factor (AINTS) for e00300 is now > 1 in 2015 (a result of updating the SOI estimates) so the relative value test had to be flipped
  • Pretty much all of the reform files had new results, though the differences didn't look to huge
  • The PUF now has a few new variables. See taxdata for more details there

There are still a couple errors failing that I don't know how to fix. The first is in test_benefits.py:

    @pytest.mark.benefits
    def test_benefits(tests_path, cps_fullsample):
        """
        Test CPS benefits.
        """
        # pylint: disable=too-many-locals
        benefit_names = ['ssi', 'mcare', 'mcaid', 'snap', 'wic',
                         'tanf', 'vet', 'housing']
        # write benefits_actual.csv file
        recs = Records.cps_constructor(data=cps_fullsample)
        start_year = recs.current_year
        calc = Calculator(policy=Policy(), records=recs, verbose=False)
        assert calc.current_year == start_year
        year_list = list()
        bname_list = list()
        benamt_list = list()
        bencnt_list = list()
        benavg_list = list()
        for year in range(start_year, Policy.LAST_BUDGET_YEAR + 1):
            calc.advance_to_year(year)
            size = calc.array('XTOT')
            wght = calc.array('s006')
            # compute benefit aggregate amounts and head counts and average benefit
            # (head counts include all members of filing unit receiving a benefit,
            #  which means benavg is f.unit benefit amount divided by f.unit size)
            for bname in benefit_names:
                ben = calc.array('{}_ben'.format(bname))
                benamt = round((ben * wght).sum() * 1e-9, 3)
                bencnt = round((size[ben > 0] * wght[ben > 0]).sum() * 1e-6, 3)
>               benavg = round(benamt / bencnt, 1)
E               FloatingPointError: invalid value encountered in double_scalars

And the second is in testpolicy.py:

    def test_apply_cpi_offset(self):
        """
        Test applying the parameter_indexing_CPI_offset parameter
        without any other parameters.
        """
        pol1 = Policy()
        pol1.implement_reform(
            {"parameter_indexing_CPI_offset": {2021: -0.001}}
        )
    
        pol2 = Policy()
        pol2.adjust(
            {"parameter_indexing_CPI_offset": [
                {"year": 2021, "value": -0.001}
            ]}
        )
    
        cmp_policy_objs(pol1, pol2)
    
        pol0 = Policy()
        pol0.implement_reform({"parameter_indexing_CPI_offset": {2021: 0}})
    
        init_rates = pol0.inflation_rates()
        new_rates = pol2.inflation_rates()
    
        start_ix = 2021 - pol2.start_year
    
        exp_rates = copy.deepcopy(new_rates)
        exp_rates[start_ix:] -= pol2._parameter_indexing_CPI_offset[start_ix:]
        np.testing.assert_allclose(init_rates, exp_rates)
    
        # make sure values prior to 2021 were not affected.
        cmp_policy_objs(pol0, pol2, year_range=range(pol2.start_year, 2021))
    
        pol2.set_state(year=[2022, 2023])
>       np.testing.assert_equal(
            (pol2.EITC_c[1] / pol2.EITC_c[0] - 1).round(4),
            pol0.inflation_rates(year=2022) + (-0.001),
        )
E       AssertionError: 
E       Arrays are not equal
E       
E       Mismatched elements: 4 / 4 (100%)
E       Max absolute difference: 3.46944695e-18
E       Max relative difference: 1.75224594e-16
E        x: array([0.0198, 0.0198, 0.0198, 0.0198])
E        y: array(0.0198)

I think this last one is just an issue with array sizes.

Let me know if there are any questions or tips on fixing the failing tests.

cc @MattHJensen @jdebacker

@jdebacker
Copy link
Member

@andersonfrailey Thanks for this PR!

I haven't run into these errors before, but they both look related to the data, so I'd start looking through the objects in these functions, to identify where things take on unexpected shapes/values.

@andersonfrailey
Copy link
Collaborator Author

Figured out the test failures. The benefits test failed because benefits growth factors were missing from growfactors.csv for 2031 so all the benefit variables were replaced with NaN values in that year. I've opened a PR up in taxdata to fix this issue, so technically this PR will bring us to taxdata version 0.3.1 after I push the taxdata bug fix.

The policy test failures were a rounding issue. Here's the test currently:

np.testing.assert_equal(
   (pol2.EITC_c[1] / pol2.EITC_c[0] - 1).round(4),
   pol0.inflation_rates(year=2022) + (-0.001),
)

And the actual values for each element in that comparison:

In [7]: pol0.inflation_rates(year=2022) + (-0.001)
Out[7]: 0.019799999999999998
In [9]: (pol2.EITC_c[1] / pol2.EITC_c[0] - 1).round(4)
Out[9]: array([0.0198, 0.0198, 0.0198, 0.0198])

To fix the bug, I just need to round pol0.inflation_rates(year=2022) + (-0.001) to 4 decimal places, like we already round (pol2.EITC_c[1] / pol2.EITC_c[0] - 1). This is the test with rounding:

np.testing.assert_equal(
   (pol2.EITC_c[1] / pol2.EITC_c[0] - 1).round(4),
   (pol0.inflation_rates(year=2022) + (-0.001)).round(4),
)

@codecov
Copy link

codecov bot commented Jun 18, 2021

Codecov Report

Merging #2599 (3e38302) into master (7d96420) will not change coverage.
The diff coverage is 100.00%.

Impacted file tree graph

@@           Coverage Diff           @@
##           master    #2599   +/-   ##
=======================================
  Coverage   98.46%   98.46%           
=======================================
  Files          14       14           
  Lines        2611     2611           
=======================================
  Hits         2571     2571           
  Misses         40       40           
Flag Coverage Δ
unittests 98.46% <100.00%> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
taxcalc/growdiff.py 100.00% <100.00%> (ø)
taxcalc/policy.py 100.00% <100.00%> (ø)

@jdebacker
Copy link
Member

@andersonfrailey I wanted to check in on the status of this PR -- we are waiting on this to be updated to TaxData 0.3.1, correct?

@andersonfrailey
Copy link
Collaborator Author

@jdebacker, the last commit updated it. It's ready to go now

@jdebacker
Copy link
Member

@andersonfrailey What is the reason for making CPS-specific variables such as line and sequence numbers available for taxdata_puf?

@andersonfrailey
Copy link
Collaborator Author

@jdebacker, adding those gives users the option to find the individual CPS records that have been matched to the PUF in the raw CPS file. To date, most users have wanted the identifiers to link tax units from taxdata_cps to the raw CPS, but with the TaxData refactor adding identifiers to the PUF was easy so I figured it'd be better to just have those available in the PUF as well.

@jdebacker
Copy link
Member

@andersonfrailey Thanks for the previous reply. I now understand and I think that makes sense to add those variables to the PUF. To be sure -- is the TaxData documentation clear in noting that these are statistically matched records and not direct matches of the same individual in the two datasets?

For the other variables now available in the PUF, such as housing_ben, tanf_ben, ssi_ben, vet_ben, wic_ben --- do you have results showing if aggregate amounts (using the PUF weights) match administrative totals?

@andersonfrailey
Copy link
Collaborator Author

To be sure -- is the TaxData documentation clear in noting that these are statistically matched records and not direct matches of the same individual in the two datasets?

Yes, we have documentation that explains that these are two different datasets created in two different ways.

For the other variables now available in the PUF, such as housing_ben, tanf_ben, ssi_ben, vet_ben, wic_ben --- do you have results showing if aggregate amounts (using the PUF weights) match administrative totals?

I don't offhand, but I'll get those together and post them.

@andersonfrailey
Copy link
Collaborator Author

Latest commit removes benefit variables from the PUF.

@jdebacker
Copy link
Member

@andersonfrailey Thanks for the updates to this PR. It looks good to me and I'll plan to merge tomorrow after a final review unless there are suggestions otherwise.

cc @MattHJensen @Peter-Metz

@jdebacker
Copy link
Member

Thank you for the contribution, @andersonfrailey. Merging.

@MattHJensen MattHJensen merged commit 72d8e91 into PSLmodels:master Jul 17, 2021
@jdebacker jdebacker mentioned this pull request Dec 6, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants