-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Perform sanity check on scraped data. #5
Comments
Perhaps the best way to simply implement this is perform the following checks: If down < previous down && !first_down if SPOT on 1st down is >= than spot on previous 1st down -10. (unless penalties) if ERROR: ESPN PBP data also will recap a drive, It may help if we cross check our # of plays and yards with this data |
This will probably be the most difficult thing to fix. I like the initial logic you've come up with though. |
I think I can take care of this issue. Every time I ran into this it was from duplicate data. I was just messing around the other day, and a simple filter that removes exact duplicates from a drive was effective. I also tried to make sure # of plays in summary = number of plays. But this will require changes to the way we are counting plays. for example kicks timeouts and penalties are counted as plays in current parser, but aren't in reality. Additionally the drive summary is wrong just as often, so Its really just a guideline. |
Great, feel free to upload and merge if it's working |
I sent a pull request a few days ago. You should be able to pull this in |
Did you implement that filter in the parser? I merged a few days ago, but can't put a finger on where that is. |
I added two quick and dirty filters that don't require heavy logic, given that every error in ESPN data that i saw was due to duplicate data, not just wrong data. Not sure how you want to handle when we DO find an error First: We compare the drive summary with the # of plays we have in PBP data.
And here, if we find an exact duplicate play at the same yard line and down, we alert. The only reasonable time this should EVER happen in a game would be a Huge loss, followed by a penalty to get to the same yard line & first down. Followed by the exact same huge loss again.. 1 & 10 @ 30 - sack for loss of 20 yards. The chance of this actually happening in a game is so astronomically low. In hindsight, this does look like old code, as i don't believe set() works in this case, because the play# is different. I'll verify later today. |
I'm going to change the data to match the format a giant merged cfbstats table, and matches the parsed NCAA data, so we can cross compare |
Espn gives bad data sometimes:
http://scores.espn.go.com/ncf/playbyplay?gameId=400547677
Games such as this sometimes flood duplicate and out of order data.
We should implement a check to verify that values make sense,
ex. Incomplete pass on 4th& 2 shouldn't result in the next play being 2nd & 14 @ the 28. This is clearly duplicated.
This may require cross checking with another site's PBP data.
The text was updated successfully, but these errors were encountered: