Perform sanity check on scraped data. #5

rsummers618 · 2015-08-06T21:17:15Z

Espn gives bad data sometimes:
http://scores.espn.go.com/ncf/playbyplay?gameId=400547677

Games such as this sometimes flood duplicate and out of order data.

We should implement a check to verify that values make sense,
ex. Incomplete pass on 4th& 2 shouldn't result in the next play being 2nd & 14 @ the 28. This is clearly duplicated.

This may require cross checking with another site's PBP data.

rsummers618 · 2015-08-06T21:25:41Z

Perhaps the best way to simply implement this is perform the following checks:

If down < previous down && !first_down
ERROR

if SPOT on 1st down is >= than spot on previous 1st down -10. (unless penalties)
ERROR

if ERROR:
for play in drive:
#remove duplicate plays

ESPN PBP data also will recap a drive, It may help if we cross check our # of plays and yards with this data

DylanEustice · 2015-08-12T17:41:41Z

This will probably be the most difficult thing to fix. I like the initial logic you've come up with though.

rsummers618 · 2015-08-12T19:26:37Z

I think I can take care of this issue. Every time I ran into this it was from duplicate data.

I was just messing around the other day, and a simple filter that removes exact duplicates from a drive was effective.

I also tried to make sure # of plays in summary = number of plays. But this will require changes to the way we are counting plays. for example kicks timeouts and penalties are counted as plays in current parser, but aren't in reality. Additionally the drive summary is wrong just as often, so Its really just a guideline.

DylanEustice · 2015-08-15T16:29:24Z

Great, feel free to upload and merge if it's working

rsummers618 · 2015-08-15T17:40:08Z

I sent a pull request a few days ago. You should be able to pull this in

DylanEustice · 2015-08-19T20:41:09Z

Did you implement that filter in the parser? I merged a few days ago, but can't put a finger on where that is.

rsummers618 · 2015-08-19T20:59:17Z

I added two quick and dirty filters that don't require heavy logic, given that every error in ESPN data that i saw was due to duplicate data, not just wrong data.

Not sure how you want to handle when we DO find an error

First:
play_count = 0
for play in drive.Play_List:
if play.Play_Type in ['PASS','RUSH','SACK']:
play_count += 1
if drive.Plays != play_count and drive.Plays > 0:
print "Play number mismatch"

We compare the drive summary with the # of plays we have in PBP data.
Just as a Cross correlation check. The drive summary can be incorrect as well though.. That's why i made sure the summary has >0 plays.

    new_Play_List = list(set(drive.Play_List))
    if len(new_Play_List) != len (drive.Play_List):
        print "Duplicate items in drive possible error"

And here, if we find an exact duplicate play at the same yard line and down, we alert. The only reasonable time this should EVER happen in a game would be a Huge loss, followed by a penalty to get to the same yard line & first down. Followed by the exact same huge loss again..

1 & 10 @ 30 - sack for loss of 20 yards.
2 & 30 @ 50 - defensive pass interferance. 1st down at the 30
1 & 10 @ 30 - sack for loss of 20 yards AGAIN (This would be marked as duplicate)

The chance of this actually happening in a game is so astronomically low.

In hindsight, this does look like old code, as i don't believe set() works in this case, because the play# is different. I'll verify later today.

rsummers618 · 2015-09-14T04:58:12Z

I'm going to change the data to match the format a giant merged cfbstats table, and matches the parsed NCAA data, so we can cross compare

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Perform sanity check on scraped data. #5

Perform sanity check on scraped data. #5

rsummers618 commented Aug 6, 2015

rsummers618 commented Aug 6, 2015

DylanEustice commented Aug 12, 2015

rsummers618 commented Aug 12, 2015

DylanEustice commented Aug 15, 2015

rsummers618 commented Aug 15, 2015

DylanEustice commented Aug 19, 2015

rsummers618 commented Aug 19, 2015

rsummers618 commented Sep 14, 2015

Perform sanity check on scraped data. #5

Perform sanity check on scraped data. #5

Comments

rsummers618 commented Aug 6, 2015

rsummers618 commented Aug 6, 2015

DylanEustice commented Aug 12, 2015

rsummers618 commented Aug 12, 2015

DylanEustice commented Aug 15, 2015

rsummers618 commented Aug 15, 2015

DylanEustice commented Aug 19, 2015

rsummers618 commented Aug 19, 2015

rsummers618 commented Sep 14, 2015