-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for multi-file Dominion format #569
Conversation
Starting with the March 2020 election, San Francisco has started publishing their Dominion CVR data as many small files named CvrExport_1.json, CvrExport_2.json, etc. rather than one big CvrExport.json file. To support this new format, first look for CvrExport.json, but if that doesn't exist, look for CvrExport_1.json. If that does exist, read all CvrExport_%d.json files in numerical sequence, until we hit one that doesn't exist.
Instead of logging for every file, log for every 50k records parsed.
The Dominion JSON format has a Sessions key at the top level that contains an array of objects. Each of these objects has a .Cards property that is typically an array of one object that corresponds to a CVR. But sometimes (~1-2% of the time) it's an array of multiple objects, each of which is a CVR. Our Dominion CVR parser was ignoring these, because it only ever looked at the first element of the array. Instead, iterate over this array, so that we also capture the CVRs stored in the second or third elements of these Cards arrays. Without this change, some votes are missing when parsing San Francisco's November 2020 data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @catrope. Thanks for the PR! I've been traveling so I apologize I'm late responding. Looks pretty good and it will be great to support the new SF format. It would be really good to have a regression test which parses the new format. Is it possible to add one? How large would the data be to use actual published SF cvr data?
We use google checkstyle plugin for IntelliJ to ensure consistent code "style". It's good practice to install that and run it before commiting anything. Now that I'm writing this I think we should add some documentation for contributors around this stuff.
Hi @catrope, we haven't forgotten about this PR! Are you interested in finishing it, or do you want us to update it? We're actually hoping to make use of this functionality so that we can use our application to verify some SF results. |
Hi! So sorry for forgetting about this last year. Yes, I am interested in finishing this, probably within the next week or so. |
The SF data is huge. The Nov 2020 election data was split over 24,940 JSON files, adding up to 4.2 GB. The Feb 2022 special election was "only" 1902 files totaling 200 MB, and that one didn't have a meaningful RCV contest on the ballot. Instead, I'll add tests with a manually created example of a multi-file election with only 3 files, including one where there are multiple Cards in one Session. |
This uses a selection of 4 files (out of 24,940) from the November 2020 election in San Francisco, some with one Card per Session, some with multiple Cards per Session. These files come from https://www.sfelections.org/results/20201103/data/20201201/CVR_Export_20201201091840.zip and were renamed as follows: - CvrExport_9231.json -> CvrExport_0.json - CvrExport_9232.json -> CvrExport_1.json - CvrExport_9233.json -> CvrExport_2.json - CvrExport_9234.json -> CvrExport_3.json - All other CvrExport_*.json files were removed
I've added a test case based on 4 files from the SF Nov 2020 election, 2 of which have the multi-Card format. Using only four files limited the damage to 3.6 MB. All pre-existing test data combined was 60 MB (now 64 MB with this addition). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Heya @catrope, thanks for working on this, and for adding the tests! Unfortunately, it looks like you were working off a fairly old version. Could you pull the latest changes from develop into your branch and sort out the merging? The biggest change that will affect you is the fact that we've moved all our test data into a separate repo. Check out PR #576, in particular here.
@catrope would you also be able to add an automated test that actually uses the data you provided in the process? |
@catrope - Code changes look solid and the new test is great. Thank you. As @HEdingfield mentions, we work off of I believe something like this should happen:
|
# Conflicts: # src/test/resources/network/brightspots/rcv/test_data
@HEdingfield I went ahead and merged the latest develop into this PR branch, including the new test files. Should be good to go but just wanted to get your eyeballs on this too :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome!
@catrope we finally pushed your work here. Thanks so much for your contribution!! We are also putting together a contributors guide which will hopefully help make things cleaner in the future: https://github.com/BrightSpots/rcv/blob/develop/contributor_guide.md |
Starting with the March 2020 election, San Francisco has started
publishing their Dominion CVR data as many small files named
CvrExport_1.json, CvrExport_2.json, etc. rather than one big
CvrExport.json file.
To support this new format, first look for CvrExport.json, but if that
doesn't exist, look for CvrExport_1.json. If that does exist, read all
CvrExport_%d.json files in numerical sequence, until we hit one that
doesn't exist.