skip download of speaker image if url is "-" #228

elfkuzco · 2024-10-10T13:34:21Z

Rationale

Sometimes, speaker image URLs are invalid and causes the scraper to throw errors after attempting to fetch the resource. This PR ensures the hostname of the URLs are at least valid before attempting to fetch the resource.
This PR resolves #224

Changes

Add a helper function for validating URIs are valid
Ensure resource passes validator logic before attempting to fetch resource

rgaudin · 2024-10-11T08:08:03Z

@benoit74 the tests are failing due to failure to upload to codecov. I think we need to fix that and also not fail on such scenario, as we did on another scraper (youtube?)

benoit74

Shouldn't we enhance the tests to also check for:

bare URL like just -, www.host.com (without the https://)
absolute URLs like //www.host.com
At least one valid URL with file name and query string (here we have only domain names)

For some I'm not sure if these URLs should be valid or not, tbc if rest of the code is capable to handle such a URL, and I'm not sure that we might receive these as speaker URL, but I'm sure it should be clear what would happen in such a case.

I will fix the codecov issue as mentioned by rgaudin: add proper credentials and make codecov results informational.

benoit74 · 2024-10-11T10:17:50Z

Edit: codecov results are already informational, see https://github.com/openzim/ted/blob/main/codecov.yml ; here CI is failing because upload failed. Which is a good thing I think, we want the upload to be done.

codecov · 2024-10-11T10:18:16Z

Codecov Report

Attention: Patch coverage is 0% with 2 lines in your changes missing coverage. Please review.

Project coverage is 5.00%. Comparing base (19bd9c0) to head (88c69df).
Report is 2 commits behind head on main.

Files with missing lines	Patch %	Lines
src/ted2zim/scraper.py	0.00%	2 Missing ⚠️

Additional details and impacted files

@@           Coverage Diff            @@
##            main    #228      +/-   ##
========================================
- Coverage   5.00%   5.00%   -0.01%     
========================================
  Files          8       8              
  Lines       1098    1100       +2     
  Branches     238     239       +1     
========================================
  Hits          55      55              
- Misses      1042    1044       +2     
  Partials       1       1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

rgaudin · 2024-10-11T10:46:23Z

Edit: codecov results are already informational, see https://github.com/openzim/ted/blob/main/codecov.yml ; here CI is failing because upload failed. Which is a good thing I think, we want the upload to be done.

OK, maybe the repo is not properly configured then? I remember we have to add the TOKEN for some a while ago

benoit74 · 2024-10-11T11:54:29Z

OK, maybe the repo is not properly configured then? I remember we have to add the TOKEN for some a while ago

I've updated the CODECOV_TOKEN, it was only 4 months old so I'm fairly surprised it suddenly stopped working. Maybe it was just a glitch somewhere. Anyway, now it worked fine.

elfkuzco · 2024-10-11T15:26:00Z

Shouldn't we enhance the tests to also check for:
* bare URL like just `-`, `www.host.com` (without the `https://`)

* absolute URLs like `//www.host.com`

* At least one valid URL with file name and query string (here we have only domain names)
For some I'm not sure if these URLs should be valid or not, tbc if rest of the code is capable to handle such a URL, and I'm not sure that we might receive these as speaker URL, but I'm sure it should be clear what would happen in such a case.

I ran some tests locally with sample urls and for bare urls like www.host.com (without a scheme), it fails because the urlparse only recognizes a netloc only if it is properly introduced by //. It assumes the url is a relative URL and places it in the path component of the result. Absolute urls like //www.host.com work as expected.

I didn't want to consider query string and other parts seeing as the input is coming from an external source (not user input). Thus, I decided to restrict it to checking the validity of the domain names because the error was about a name resolution failure (due to invalid domain name). It's also why I named the function is_valid_uri as it checks only URI and not the full URL.

I could refactor it into a more robust url matcher though. What do you think?

EDIT: Or would it be better to use a third-party library to validate this as url regex matching can get complicated. Noticed mine doesn't cover IPv6 urls although I don't think any of the speaker images match that pattern.

benoit74 · 2024-10-14T12:42:26Z

I just rebased your branch to use latest codecov action, not sure it was the problem around failing CI, but it might be.

benoit74 · 2024-10-14T12:42:38Z

I didn't want to consider query string and other parts seeing as the input is coming from an external source (not user input). Thus, I decided to restrict it to checking the validity of the domain names because the error was about a name resolution failure (due to invalid domain name). It's also why I named the function is_valid_uri as it checks only URI and not the full URL.

I'm sorry but this makes no sense to me.

First because a URI is more generic than a URL, which is necessarily online (where a URI could be found in an XML document for instance and is not online), so I don't get the relation of naming it is_valid_uri and the fact that we check only hostname in this code. Second because the speaker URL we are dealing with is necessarily not only a hostname, but a full URL to a picture. And even if there is no querystring right now (not even sure about that), I'm 100% sure we might get some in the future. And I don't want to see this break in the future.

If checking this URL for validity is too complex, I don't mind to just ignore - URL, this is the only occurrence we have ATM. And I don't want to make TED codebase too complex / depending on too many libraries just for "what-if" situations. The idea of checking URL for validity was just a suggestion, for the case where it would be cheap.

elfkuzco · 2024-10-15T11:57:37Z

If checking this URL for validity is too complex, I don't mind to just ignore - URL, this is the only occurrence we have ATM. And I don't want to make TED codebase too complex / depending on too many libraries just for "what-if" situations. The idea of checking URL for validity was just a suggestion, for the case where it would be cheap.

I looked up existing URL matchers and I think the simple approach of checking against - works fine. I refactored to match against it and cleared up the old code. Added a #TODO comment in case one wants to read it in the future. What do you think?
Given the scraper works fine without this and it's just a logging error, I'm okay with closing the PR without merging.

rgaudin · 2024-10-15T13:46:30Z

I think this is more appropriate indeed: to only focus on the existing issue. I suggest you remove the TODO comment as it has little value and makes some QA checker complain.

benoit74

LGTM, thank you ! I indeed prefer this simple thing for now.

rgaudin requested a review from benoit74 October 11, 2024 08:07

benoit74 reviewed Oct 11, 2024

View reviewed changes

elfkuzco requested a review from benoit74 October 14, 2024 09:21

benoit74 force-pushed the ignore-invalid-download-urls branch from d8bf387 to 17ad543 Compare October 14, 2024 12:41

skip download of speaker image if url is "-"

88c69df

elfkuzco force-pushed the ignore-invalid-download-urls branch from 80f2c09 to 88c69df Compare October 15, 2024 15:40

elfkuzco changed the title ~~skip download of speaker image if url is invalid~~ skip download of speaker image if url is "-" Oct 15, 2024

benoit74 approved these changes Oct 16, 2024

View reviewed changes

benoit74 merged commit 4acb963 into openzim:main Oct 16, 2024
8 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

skip download of speaker image if url is "-" #228

skip download of speaker image if url is "-" #228

elfkuzco commented Oct 10, 2024

rgaudin commented Oct 11, 2024

benoit74 left a comment

benoit74 commented Oct 11, 2024

codecov bot commented Oct 11, 2024 •

edited

Loading

rgaudin commented Oct 11, 2024

benoit74 commented Oct 11, 2024

elfkuzco commented Oct 11, 2024 •

edited

Loading

benoit74 commented Oct 14, 2024

benoit74 commented Oct 14, 2024

elfkuzco commented Oct 15, 2024

rgaudin commented Oct 15, 2024

benoit74 left a comment

skip download of speaker image if url is "-" #228

skip download of speaker image if url is "-" #228

Conversation

elfkuzco commented Oct 10, 2024

Rationale

Changes

rgaudin commented Oct 11, 2024

benoit74 left a comment

Choose a reason for hiding this comment

benoit74 commented Oct 11, 2024

codecov bot commented Oct 11, 2024 • edited Loading

Codecov Report

rgaudin commented Oct 11, 2024

benoit74 commented Oct 11, 2024

elfkuzco commented Oct 11, 2024 • edited Loading

benoit74 commented Oct 14, 2024

benoit74 commented Oct 14, 2024

elfkuzco commented Oct 15, 2024

rgaudin commented Oct 15, 2024

benoit74 left a comment

Choose a reason for hiding this comment

codecov bot commented Oct 11, 2024 •

edited

Loading

elfkuzco commented Oct 11, 2024 •

edited

Loading