Add support for multi-file Dominion format #569

catrope · 2021-05-18T01:31:58Z

Starting with the March 2020 election, San Francisco has started
publishing their Dominion CVR data as many small files named
CvrExport_1.json, CvrExport_2.json, etc. rather than one big
CvrExport.json file.

To support this new format, first look for CvrExport.json, but if that
doesn't exist, look for CvrExport_1.json. If that does exist, read all
CvrExport_%d.json files in numerical sequence, until we hit one that
doesn't exist.

v1.2.0

Starting with the March 2020 election, San Francisco has started publishing their Dominion CVR data as many small files named CvrExport_1.json, CvrExport_2.json, etc. rather than one big CvrExport.json file. To support this new format, first look for CvrExport.json, but if that doesn't exist, look for CvrExport_1.json. If that does exist, read all CvrExport_%d.json files in numerical sequence, until we hit one that doesn't exist.

Instead of logging for every file, log for every 50k records parsed.

The Dominion JSON format has a Sessions key at the top level that contains an array of objects. Each of these objects has a .Cards property that is typically an array of one object that corresponds to a CVR. But sometimes (~1-2% of the time) it's an array of multiple objects, each of which is a CVR. Our Dominion CVR parser was ignoring these, because it only ever looked at the first element of the array. Instead, iterate over this array, so that we also capture the CVRs stored in the second or third elements of these Cards arrays. Without this change, some votes are missing when parsing San Francisco's November 2020 data.

moldover

Hi @catrope. Thanks for the PR! I've been traveling so I apologize I'm late responding. Looks pretty good and it will be great to support the new SF format. It would be really good to have a regression test which parses the new format. Is it possible to add one? How large would the data be to use actual published SF cvr data?

We use google checkstyle plugin for IntelliJ to ensure consistent code "style". It's good practice to install that and run it before commiting anything. Now that I'm writing this I think we should add some documentation for contributors around this stuff.

src/main/java/network/brightspots/rcv/DominionCvrReader.java

tarheel · 2022-02-23T19:45:06Z

Hi @catrope, we haven't forgotten about this PR! Are you interested in finishing it, or do you want us to update it? We're actually hoping to make use of this functionality so that we can use our application to verify some SF results.

catrope · 2022-02-23T22:31:40Z

Hi! So sorry for forgetting about this last year. Yes, I am interested in finishing this, probably within the next week or so.

catrope · 2022-03-10T20:15:21Z

Looks pretty good and it will be great to support the new SF format. It would be really good to have a regression test which parses the new format. Is it possible to add one? How large would the data be to use actual published SF cvr data?

The SF data is huge. The Nov 2020 election data was split over 24,940 JSON files, adding up to 4.2 GB. The Feb 2022 special election was "only" 1902 files totaling 200 MB, and that one didn't have a meaningful RCV contest on the ballot.

Instead, I'll add tests with a manually created example of a multi-file election with only 3 files, including one where there are multiple Cards in one Session.

This uses a selection of 4 files (out of 24,940) from the November 2020 election in San Francisco, some with one Card per Session, some with multiple Cards per Session. These files come from https://www.sfelections.org/results/20201103/data/20201201/CVR_Export_20201201091840.zip and were renamed as follows: - CvrExport_9231.json -> CvrExport_0.json - CvrExport_9232.json -> CvrExport_1.json - CvrExport_9233.json -> CvrExport_2.json - CvrExport_9234.json -> CvrExport_3.json - All other CvrExport_*.json files were removed

catrope · 2022-03-10T23:14:59Z

I've added a test case based on 4 files from the SF Nov 2020 election, 2 of which have the multi-Card format. Using only four files limited the damage to 3.6 MB. All pre-existing test data combined was 60 MB (now 64 MB with this addition).

HEdingfield

Heya @catrope, thanks for working on this, and for adding the tests! Unfortunately, it looks like you were working off a fairly old version. Could you pull the latest changes from develop into your branch and sort out the merging? The biggest change that will affect you is the fact that we've moved all our test data into a separate repo. Check out PR #576, in particular here.

HEdingfield · 2022-03-12T21:30:35Z

@catrope would you also be able to add an automated test that actually uses the data you provided in the process?

moldover · 2022-03-12T23:20:52Z

@catrope - Code changes look solid and the new test is great. Thank you. As @HEdingfield mentions, we work off of develop branch now, and the latest develop code moves all test_data into a submodule -- in part so that we can accommodate larger data sets. But it also means more commits to get things where they need to be. But I can totally help with that if you'd like, or even if you just want to hand off what you've done that's cool too. Apologies for the confusion. We need to get better about communicating and onboarding. I think you are the first outside contributor!

I believe something like this should happen:

delete new test data from dominion-multi-file (make a copy first)
add test data to the test_data repo: https://github.com/artoonie/rcvtests via another PR which we can quickly approve
merge latest develop -> dominion-multi-file
pin test_data to your new
add lines into TabulatorTests.java to execute your test there
we approve and merge dominion-multi-file -> develop

# Conflicts: # src/test/resources/network/brightspots/rcv/test_data

moldover · 2022-07-11T14:59:07Z

@HEdingfield I went ahead and merged the latest develop into this PR branch, including the new test files. Should be good to go but just wanted to get your eyeballs on this too :)

HEdingfield

Awesome!

moldover · 2022-07-17T17:49:25Z

@catrope we finally pushed your work here. Thanks so much for your contribution!!

We are also putting together a contributors guide which will hopefully help make things cleaner in the future: https://github.com/BrightSpots/rcv/blob/develop/contributor_guide.md

HEdingfield and others added 3 commits December 12, 2020 20:27

Merge pull request BrightSpots#541 from BrightSpots/develop

d4ac2db

v1.2.0

Make multi-file Dominion progress logging quieter

5b5aa88

Instead of logging for every file, log for every 50k records parsed.

tarheel requested review from tarheel and moldover May 18, 2021 21:52

moldover reviewed May 30, 2021

View reviewed changes

moldover requested changes May 30, 2021

View reviewed changes

src/main/java/network/brightspots/rcv/DominionCvrReader.java Outdated Show resolved Hide resolved

src/main/java/network/brightspots/rcv/DominionCvrReader.java Show resolved Hide resolved

src/main/java/network/brightspots/rcv/DominionCvrReader.java Outdated Show resolved Hide resolved

HEdingfield changed the base branch from master to develop August 11, 2021 07:32

catrope added 2 commits March 10, 2022 12:03

Switch names of the parseCvrFile and gatherCvrsForContest methods

90d0ae2

Don't initialize two variables on the same line

64b76e5

catrope requested a review from moldover March 10, 2022 23:15

HEdingfield requested changes Mar 12, 2022

View reviewed changes

moldover added 3 commits July 11, 2022 07:21

remove test files before merging with new submodule

6ee20ac

Merge branch 'develop' into dominion-multi-file

523e2d6

# Conflicts: # src/test/resources/network/brightspots/rcv/test_data

add dominion multi file tests into new repo location

340c55b

moldover approved these changes Jul 11, 2022

View reviewed changes

moldover requested a review from HEdingfield July 11, 2022 14:57

Fix exception issue found by SpotBugs

d9ed6e5

moldover self-assigned this Jul 11, 2022

HEdingfield approved these changes Jul 17, 2022

View reviewed changes

moldover merged commit 8abc64f into BrightSpots:develop Jul 17, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for multi-file Dominion format #569

Add support for multi-file Dominion format #569

catrope commented May 18, 2021

moldover left a comment •

edited

Loading

tarheel commented Feb 23, 2022

catrope commented Feb 23, 2022

catrope commented Mar 10, 2022 •

edited

Loading

catrope commented Mar 10, 2022

HEdingfield left a comment

HEdingfield commented Mar 12, 2022

moldover commented Mar 12, 2022 •

edited

Loading

moldover commented Jul 11, 2022 •

edited

Loading

HEdingfield left a comment

moldover commented Jul 17, 2022 •

edited

Loading

Add support for multi-file Dominion format #569

Add support for multi-file Dominion format #569

Conversation

catrope commented May 18, 2021

moldover left a comment • edited Loading

Choose a reason for hiding this comment

tarheel commented Feb 23, 2022

catrope commented Feb 23, 2022

catrope commented Mar 10, 2022 • edited Loading

catrope commented Mar 10, 2022

HEdingfield left a comment

Choose a reason for hiding this comment

HEdingfield commented Mar 12, 2022

moldover commented Mar 12, 2022 • edited Loading

moldover commented Jul 11, 2022 • edited Loading

HEdingfield left a comment

Choose a reason for hiding this comment

moldover commented Jul 17, 2022 • edited Loading

moldover left a comment •

edited

Loading

catrope commented Mar 10, 2022 •

edited

Loading

moldover commented Mar 12, 2022 •

edited

Loading

moldover commented Jul 11, 2022 •

edited

Loading

moldover commented Jul 17, 2022 •

edited

Loading