Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Perform sanity check on scraped data. #5

Open
rsummers618 opened this issue Aug 6, 2015 · 8 comments
Open

Perform sanity check on scraped data. #5

rsummers618 opened this issue Aug 6, 2015 · 8 comments

Comments

@rsummers618
Copy link
Contributor

Espn gives bad data sometimes:
http://scores.espn.go.com/ncf/playbyplay?gameId=400547677

Games such as this sometimes flood duplicate and out of order data.

We should implement a check to verify that values make sense,
ex. Incomplete pass on 4th& 2 shouldn't result in the next play being 2nd & 14 @ the 28. This is clearly duplicated.

This may require cross checking with another site's PBP data.

@rsummers618
Copy link
Contributor Author

Perhaps the best way to simply implement this is perform the following checks:

If down < previous down && !first_down
ERROR

if SPOT on 1st down is >= than spot on previous 1st down -10. (unless penalties)
ERROR

if ERROR:
for play in drive:
#remove duplicate plays

ESPN PBP data also will recap a drive, It may help if we cross check our # of plays and yards with this data

@DylanEustice
Copy link
Contributor

This will probably be the most difficult thing to fix. I like the initial logic you've come up with though.

@rsummers618
Copy link
Contributor Author

I think I can take care of this issue. Every time I ran into this it was from duplicate data.

I was just messing around the other day, and a simple filter that removes exact duplicates from a drive was effective.

I also tried to make sure # of plays in summary = number of plays. But this will require changes to the way we are counting plays. for example kicks timeouts and penalties are counted as plays in current parser, but aren't in reality. Additionally the drive summary is wrong just as often, so Its really just a guideline.

@DylanEustice
Copy link
Contributor

Great, feel free to upload and merge if it's working

@rsummers618
Copy link
Contributor Author

I sent a pull request a few days ago. You should be able to pull this in

@DylanEustice
Copy link
Contributor

Did you implement that filter in the parser? I merged a few days ago, but can't put a finger on where that is.

@rsummers618
Copy link
Contributor Author

I added two quick and dirty filters that don't require heavy logic, given that every error in ESPN data that i saw was due to duplicate data, not just wrong data.

Not sure how you want to handle when we DO find an error

First:
play_count = 0
for play in drive.Play_List:
if play.Play_Type in ['PASS','RUSH','SACK']:
play_count += 1
if drive.Plays != play_count and drive.Plays > 0:
print "Play number mismatch"

We compare the drive summary with the # of plays we have in PBP data.
Just as a Cross correlation check. The drive summary can be incorrect as well though.. That's why i made sure the summary has >0 plays.

    new_Play_List = list(set(drive.Play_List))
    if len(new_Play_List) != len (drive.Play_List):
        print "Duplicate items in drive possible error"

And here, if we find an exact duplicate play at the same yard line and down, we alert. The only reasonable time this should EVER happen in a game would be a Huge loss, followed by a penalty to get to the same yard line & first down. Followed by the exact same huge loss again..

1 & 10 @ 30 - sack for loss of 20 yards.
2 & 30 @ 50 - defensive pass interferance. 1st down at the 30
1 & 10 @ 30 - sack for loss of 20 yards AGAIN (This would be marked as duplicate)

The chance of this actually happening in a game is so astronomically low.

In hindsight, this does look like old code, as i don't believe set() works in this case, because the play# is different. I'll verify later today.

@rsummers618
Copy link
Contributor Author

I'm going to change the data to match the format a giant merged cfbstats table, and matches the parsed NCAA data, so we can cross compare

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants