Restore functionality to resist temporary bad TED responses when parsing video pages #209

benoit74 · 2024-06-28T05:02:15Z

In order to retrieve video infos, TED scraper retrieves the video page with a URL like https://ted.com/talks/franco_sacchi_a_tour_of_nollywood_nigeria_s_booming_film_industry?language=nl and will look for __NEXT_DATA__ JSON inside the page, where it will find among other things the localized title and description.

This is done in extract_info_from_video_page function in scraper.py.

We currently have few recipes intermittently failing with an error An error occurred: 'NoneType' object has no attribute 'string'.

Looking at HTML content, there is no __NEXT_DATA__ JSON inside the page.

Loading again the page on my machine, there is __NEXT_DATA__ JSON.

So clearly the scraper should be more resilient to intermittent bad responses from TED server.

This was indeed the case in 2.10.0 where there was a retry logic in extract_info_from_video_page and got dropped in https://github.com/openzim/ted/pull/130/files when adapting to new DOM.

I think we should just restore this functionality by again pausing 5 secs and trying again up to 5 times, just like in 2.10.0.

The text was updated successfully, but these errors were encountered:

benoit74 · 2024-07-01T06:34:58Z

Moving this to 3.1.0, it is mostly straightforward to implement and seems to be impacting about 5-10% of the recipes randomly.

benoit74 · 2024-07-10T12:36:52Z

Since we have currently no plan on when we will be able to work on 3.1.0 and since this bug makes the success of https://farm.openzim.org/recipes/ted_topic_all mostly impossible, I'm going to make a patch release 3.0.3

benoit74 added bug good first issue labels Jun 28, 2024

benoit74 added this to the 3.1.0 milestone Jul 1, 2024

benoit74 mentioned this issue Jul 10, 2024

Add retry logic with detailled logs to extraction of video data #214

Merged

benoit74 modified the milestones: 3.1.0, 3.0.3 Jul 10, 2024

benoit74 closed this as completed in #214 Jul 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Restore functionality to resist temporary bad TED responses when parsing video pages #209

Restore functionality to resist temporary bad TED responses when parsing video pages #209

benoit74 commented Jun 28, 2024

benoit74 commented Jul 1, 2024

benoit74 commented Jul 10, 2024

Restore functionality to resist temporary bad TED responses when parsing video pages #209

Restore functionality to resist temporary bad TED responses when parsing video pages #209

Comments

benoit74 commented Jun 28, 2024

benoit74 commented Jul 1, 2024

benoit74 commented Jul 10, 2024