Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add --sqldb option to tc CLI #1312

Merged
merged 4 commits into from
Apr 26, 2017
Merged

Add --sqldb option to tc CLI #1312

merged 4 commits into from
Apr 26, 2017

Conversation

martinholmer
Copy link
Collaborator

This pull request adds a new output option that writes the dump output to the dump table in a SQLite3 database. The reason for adding this option is to reduce the size of dumped output file and to provide an easier way to tabulate the dumped output. The --dump option still generates a CSV-formatted file containing the dumped output, which keeps open the possibility of users importing the dumped output into a wide variety of data-analysis software packages (ranging from SAS or R or Stata to any spreadsheet).

Here is an example that uses the new option:

$ tc puf.csv 2020 --dump --sqldb
You loaded data for 2009.
Tax-Calculator startup automatically extrapolated your data to 2020.

$ ls -l puf-20*
-rw-r--r--  1 mrh  staff  200983336 Apr 24 11:10 puf-20-#-#.csv
-rw-r--r--  1 mrh  staff  115200000 Apr 24 11:11 puf-20-#-#.db

$ cat tab.sql
select count(*),
       sum(s006)*1e-6,
       sum(iitax*s006)*1e-9,
       sum(payrolltax*s006)*1e-9
from dump;

$ sqlite3 puf-20-#-#.db <tab.sql
219814|177.802211679916|1876.83747004077|1224.89833083698

$ csvvars puf-20-#-#.csv | grep s006
140 s006
$ csvvars puf-20-#-#.csv | grep tax
51 iitax
. . .
154 payrolltax

$ awk -F, 'NR>1{n++; w=$140; sw+=w; sit+=w*$51; spt+=w*$154}
           END{print n, sw, sit, spt, mtr}' puf-20-#-#.csv
219814 1.77802e+08 1.87684e+12 1.2249e+12

Notice that the SQLite database file is only slightly more than half the size of the CSV-formatted file containing the exact same dumped output data. And that tabulating the dump output in the database eliminates the need to know the column numbers of variables and opens up a rich set of tabulation capabilities built into the structured query language (SQL) select command. SQL is a declarative (rather than a procedural) language, so a user needs only to specify the nature of the tabulated table and then the SQL database figures out the procedural program that produces the specified tabulated table.

@MattHJensen @feenberg @Amy-Xu @andersonfrailey @GoFroggyRun @codykallen @zrisher

@martinholmer martinholmer changed the title Add --sqldb option to tc CLI to Tax-Calculator Add --sqldb option to tc CLI Apr 24, 2017
@codecov-io
Copy link

codecov-io commented Apr 24, 2017

Codecov Report

Merging #1312 into master will increase coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1312      +/-   ##
==========================================
+ Coverage   99.65%   99.65%   +<.01%     
==========================================
  Files          38       38              
  Lines        2878     2890      +12     
==========================================
+ Hits         2868     2880      +12     
  Misses         10       10
Impacted Files Coverage Δ
taxcalc/taxcalcio.py 100% <100%> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update cc83284...6dbc0cd. Read the comment docs.

@martinholmer martinholmer merged commit db0f1f0 into PSLmodels:master Apr 26, 2017
@MattHJensen MattHJensen mentioned this pull request Apr 26, 2017
@martinholmer martinholmer deleted the add-sqldb-option branch April 26, 2017 14:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants