Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider removing "visit with visit_id xxx was interrupted error" #672

Closed
birdsarah opened this issue May 30, 2020 · 0 comments · Fixed by #696
Closed

Consider removing "visit with visit_id xxx was interrupted error" #672

birdsarah opened this issue May 30, 2020 · 0 comments · Fixed by #696

Comments

@birdsarah
Copy link
Contributor

In a crawl I ran recently I got ~50k "Visit with visit_id xxx got interrupted". I was running a crazy crawl, so that was fine. But that's a lot of events.

I was able to use the sentry api to download 2,600 of the events and compare the visit_ids with the crawl_history and incomplete tables.

A markdown version of the table is below.

For these 2,600 visits none got through to "ok" status for the "finalize" command. And none were in the incomplete table.

This all makes sense. And it's really exciting to see all this information and data captured in the crawl history table.

As the crawl history table appears to be now accurately gathering up this info, I'd like to propose removing the exception that is propagating up to sentry because it's an exception that's been well handled.

It may be that folks find it useful to see this information streaming to sentry. Just wanted to bring it up.


from dask.distributed import Client

client = Client("tcp://127.0.0.1:38879")
client

Client

Cluster

  • Workers: 7
  • Cores: 28
  • Memory: 67.27 GB
import ast
import os
import dask.dataframe as dd
import domain_utils as du
import pandas as pd
events = pd.read_csv('events_df.csv.gz', index_col=0)
events.groupID.value_counts()
8432299    2611
Name: groupID, dtype: int64
visit_ids = events.entries.apply(lambda x: ast.literal_eval(x)[0]['data']['params'][0]) 

These visit ids are from 2,611 events that were downloaded in GatherSentryEvents. There were ~50k of these events but I was only able to get these 2,611 from the api for whatever reason.

In this notebook we want to see what these visit ids correspond with - failed commands or incomplete visits or something else?

DIR = '/home/bird/Data/S3/openwpm-data/2020-05-23_jsInstrumentationTests_api_sweep/derived_datasets/'
CRAWL_HISTORY = os.path.join(DIR, 'crawl_history_dd.parquet')
INCOMPLETE = os.path.join(DIR, 'incomplete_visits_dd.parquet')
df_inc = dd.read_parquet(INCOMPLETE)
df_crawl = dd.read_parquet(CRAWL_HISTORY)
incomplete_visit_ids = df_inc.visit_id.unique().compute()
sum([visit_id in incomplete_visit_ids for visit_id in visit_ids])
0
crawl_history_bad_events = df_crawl[df_crawl.visit_id.isin(visit_ids)].compute()
crawl_history_bad_events.command.value_counts()
<class 'automation.Commands.Types.InitializeCommand'>    656
<class 'automation.Commands.Types.GetCommand'>           656
<class 'automation.Commands.Types.FinalizeCommand'>       43
Name: command, dtype: int64
crawl_history_bad_events.groupby(['command', 'command_status']).count()
crawl_id visit_id arguments retry_number error traceback
command command_status
<class 'automation.Commands.Types.FinalizeCommand'> timeout 43 43 43 43 0 0
<class 'automation.Commands.Types.GetCommand'> neterror 130 130 130 130 130 130
ok 43 43 43 43 0 0
timeout 483 483 483 483 0 0
<class 'automation.Commands.Types.InitializeCommand'> ok 656 656 656 656 0 0
events.message.values
array(['Visit with visit_id 4155575224537606 got interrupted',
       'Visit with visit_id 4155575224537606 got interrupted',
       'Visit with visit_id 2076258438298646 got interrupted', ...,
       'Visit with visit_id 4347570855194635 got interrupted',
       'Visit with visit_id 8821906423776802 got interrupted',
       'Visit with visit_id 8821906423776802 got interrupted'],
      dtype=object)
@birdsarah birdsarah changed the title Consider removing "visit_id" was interrupted error Consider removing "visit with visit_id xxx was interrupted error" May 30, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant