Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spousal age in the CPS file #225

Open
ernietedeschi opened this issue Jun 11, 2018 · 5 comments
Open

Spousal age in the CPS file #225

ernietedeschi opened this issue Jun 11, 2018 · 5 comments
Labels

Comments

@ernietedeschi
Copy link

I'm matching raw CPS microdata from 2013-15 to the tc cps.csv file using the h_seq, ffpos, and a_lineno variables. I do a second round of matching off of the a_spouse variable in the CPS microdata to bring the same variables of interest into the tc cps.csv spousal records.

As a validation cross-check, I compared the age of each record in the CPS microdata to that in the tc cps.csv

For tax unit heads, the age match between the CPS microdata and the tc cps.csv is 100%.

For spouses, the match is only 97.8% however.

In looking at the underlying misses, the cps.csv file has some implausible values for spouse age.

For example, the spouse for RECID 292373 has an age of 3. In RECID 294599, the age is 7.

Some other cps.csv spouses have ages of 0 despite being present in the matched CPS record and the unit being correctly coded as MARS = 2 in the cps.csv. RECID 292658 is an example of this.

Most of the misses are plausible value in their own right but still different from the matched CPS record, sometimes significantly.

Are these deviations intentional?

@andersonfrailey
Copy link
Collaborator

Thanks for pointing this out, @evtedeschi3. The deviations are not intentional. My intuition says this is probably caused by the scripts misidentifying the spouse when creating the record and assigning the wrong age. I'll look into this more and see if I can find the problem.

@martinholmer
Copy link
Contributor

martinholmer commented Aug 13, 2018

@andersonfrailey, I've checked the CPS spouse_age problems that @evtedeschi3 first identified in #225 using the newest CPS data. There are still problems. First I show my tabulations (by MARS) of the unzipped cps.csv.gz file from Tax-Calculator release 0.20.2, and then I offer some observations.

iMac:Tax-Calculator mrh$ ./csv_vars.sh cps.csv | grep -e age -e MARS
1 age_head
2 age_spouse
42 MARS

iMac:Tax-Calculator mrh$ awk -F, 'NR>1{t++;n[$42]++}END{for(i in n)print i,n[i];print t}' cps.csv
2 252988
4 22006
1 181471
456465

iMac:Tax-Calculator mrh$ awk -F, 'NR>1&&$42!=2{n[$2]++;t++}END{for(i in n)print i,n[i];print t}' cps.csv
0 203477
203477

iMac:Tax-Calculator mrh$ awk -F, 'NR>1&&$42==2{n[$2]++}END{for(i in n)print i,n[i]}' cps.csv | awk '{printf("%02d\t%d\n",$1,$2)}' | sort | head -20
00	8455
01	150
02	168
03	221
04	183
05	184
06	121
07	162
08	138
09	126
10	173
11	135
12	199
13	113
14	174
15	180
16	150
17	188
18	221
19	312

iMac:Tax-Calculator mrh$ awk -F, 'NR>1&&$42==2{n[$2]++}END{for(i in n)print i,n[i]}' cps.csv | awk '{printf("%02d\t%d\n",$1,$2)}' | sort | tail -20
62	5196
63	4785
64	4482
65	5288
66	4782
67	4427
68	3638
69	3050
70	3202
71	3347
72	3045
73	2351
74	1815
75	1848
76	1704
77	1506
78	1362
79	1304
80	3975
85	2616

So, we can see that spouse_age is zero in all the filing units that are not MARS==2 (married filing jointly), which is as it should be. So, everything is good so far. But when we tabulate the distribution of spouse_age for those with MARS==2, we see sensible counts for older ages, but not so sensible counts for younger ages. In particular, there are 8455 filing units with MARS==2 and spouse_age==0. And then there are more than a few filing units who have an implausibly low values for spouse_age. The lowest spouse_age value in the puf.csv data file for filing units with MARS==2 is 15 years old.

Now that you've successfully completed all the recent enhancements to the taxdata repo, it seem like fixing this CPS spouse_age problem should have a high priority. In particular, I think this CPS spouse_age problem needs to be fixed before we consider moving to a more recent CBO projection (as proposed in #180.

What are you thoughts on taxdata development? Are there any other things that need to be fixed?

@andersonfrailey
Copy link
Collaborator

@martinholmer the next biggest step in taxdata development from my perspective is replacing the SAS code to make the CPS file with Python. I put that on hold the last couple of weeks to work on PUF development, but I've reached a point where I have almost everything written and have moved on to squishing bugs that result in major differences between the current CPS and what I get from the Python scripts.

I think it will be easier to solve this problem with spouse age when we have everything running in Python. I'd say it will be at least two or three more weeks before I'm ready to open a pull request though. Would you say this issue is high enough priority to try and fix it in the SAS code?

@ernietedeschi
Copy link
Author

I assume the priority here is to get the code in Python first before worrying about integrating the 2016 and 2017 ASEC releases?

I've been correcting the spousal errors by just importing the age recorded in the CPS ASECs for those records using the household and family identifiers. That seems to work fine for the moment.

@martinholmer
Copy link
Contributor

@andersonfrailey said:

I think it will be easier to solve this problem with spouse age [in the cps.csv.gz file] when we have everything running in Python. I'd say it will be at least two or three more weeks before I'm ready to open a pull request though. Would you say this issue is high enough priority to try and fix it in the SAS code?

No. And @evtedeschi3 seems to agree, which is far more important. So, let's wait for the Python CPS-creation code to be active and then solve the spouse_age problem identified in issue #225.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants