Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

skip download of speaker image if url is "-" #228

Merged
merged 1 commit into from
Oct 16, 2024

Conversation

elfkuzco
Copy link
Contributor

Rationale

Sometimes, speaker image URLs are invalid and causes the scraper to throw errors after attempting to fetch the resource. This PR ensures the hostname of the URLs are at least valid before attempting to fetch the resource.
This PR resolves #224

Changes

  • Add a helper function for validating URIs are valid
  • Ensure resource passes validator logic before attempting to fetch resource

@rgaudin rgaudin requested a review from benoit74 October 11, 2024 08:07
@rgaudin
Copy link
Member

rgaudin commented Oct 11, 2024

@benoit74 the tests are failing due to failure to upload to codecov. I think we need to fix that and also not fail on such scenario, as we did on another scraper (youtube?)

Copy link
Collaborator

@benoit74 benoit74 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we enhance the tests to also check for:

  • bare URL like just -, www.host.com (without the https://)
  • absolute URLs like //www.host.com
  • At least one valid URL with file name and query string (here we have only domain names)

For some I'm not sure if these URLs should be valid or not, tbc if rest of the code is capable to handle such a URL, and I'm not sure that we might receive these as speaker URL, but I'm sure it should be clear what would happen in such a case.

I will fix the codecov issue as mentioned by rgaudin: add proper credentials and make codecov results informational.

@benoit74
Copy link
Collaborator

Edit: codecov results are already informational, see https://github.com/openzim/ted/blob/main/codecov.yml ; here CI is failing because upload failed. Which is a good thing I think, we want the upload to be done.

Copy link

codecov bot commented Oct 11, 2024

Codecov Report

Attention: Patch coverage is 0% with 2 lines in your changes missing coverage. Please review.

Project coverage is 5.00%. Comparing base (19bd9c0) to head (88c69df).
Report is 2 commits behind head on main.

Files with missing lines Patch % Lines
src/ted2zim/scraper.py 0.00% 2 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff            @@
##            main    #228      +/-   ##
========================================
- Coverage   5.00%   5.00%   -0.01%     
========================================
  Files          8       8              
  Lines       1098    1100       +2     
  Branches     238     239       +1     
========================================
  Hits          55      55              
- Misses      1042    1044       +2     
  Partials       1       1              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@rgaudin
Copy link
Member

rgaudin commented Oct 11, 2024

Edit: codecov results are already informational, see https://github.com/openzim/ted/blob/main/codecov.yml ; here CI is failing because upload failed. Which is a good thing I think, we want the upload to be done.

OK, maybe the repo is not properly configured then? I remember we have to add the TOKEN for some a while ago

@benoit74
Copy link
Collaborator

OK, maybe the repo is not properly configured then? I remember we have to add the TOKEN for some a while ago

I've updated the CODECOV_TOKEN, it was only 4 months old so I'm fairly surprised it suddenly stopped working. Maybe it was just a glitch somewhere. Anyway, now it worked fine.

@elfkuzco
Copy link
Contributor Author

elfkuzco commented Oct 11, 2024

Shouldn't we enhance the tests to also check for:

* bare URL like just `-`, `www.host.com` (without the `https://`)

* absolute URLs like `//www.host.com`

* At least one valid URL with file name and query string (here we have only domain names)

For some I'm not sure if these URLs should be valid or not, tbc if rest of the code is capable to handle such a URL, and I'm not sure that we might receive these as speaker URL, but I'm sure it should be clear what would happen in such a case.

I ran some tests locally with sample urls and for bare urls like www.host.com (without a scheme), it fails because the urlparse only recognizes a netloc only if it is properly introduced by //. It assumes the url is a relative URL and places it in the path component of the result. Absolute urls like //www.host.com work as expected.

I didn't want to consider query string and other parts seeing as the input is coming from an external source (not user input). Thus, I decided to restrict it to checking the validity of the domain names because the error was about a name resolution failure (due to invalid domain name). It's also why I named the function is_valid_uri as it checks only URI and not the full URL.

I could refactor it into a more robust url matcher though. What do you think?

EDIT: Or would it be better to use a third-party library to validate this as url regex matching can get complicated. Noticed mine doesn't cover IPv6 urls although I don't think any of the speaker images match that pattern.

@benoit74
Copy link
Collaborator

I just rebased your branch to use latest codecov action, not sure it was the problem around failing CI, but it might be.

@benoit74
Copy link
Collaborator

I didn't want to consider query string and other parts seeing as the input is coming from an external source (not user input). Thus, I decided to restrict it to checking the validity of the domain names because the error was about a name resolution failure (due to invalid domain name). It's also why I named the function is_valid_uri as it checks only URI and not the full URL.

I'm sorry but this makes no sense to me.

First because a URI is more generic than a URL, which is necessarily online (where a URI could be found in an XML document for instance and is not online), so I don't get the relation of naming it is_valid_uri and the fact that we check only hostname in this code. Second because the speaker URL we are dealing with is necessarily not only a hostname, but a full URL to a picture. And even if there is no querystring right now (not even sure about that), I'm 100% sure we might get some in the future. And I don't want to see this break in the future.

If checking this URL for validity is too complex, I don't mind to just ignore - URL, this is the only occurrence we have ATM. And I don't want to make TED codebase too complex / depending on too many libraries just for "what-if" situations. The idea of checking URL for validity was just a suggestion, for the case where it would be cheap.

@elfkuzco
Copy link
Contributor Author

If checking this URL for validity is too complex, I don't mind to just ignore - URL, this is the only occurrence we have ATM. And I don't want to make TED codebase too complex / depending on too many libraries just for "what-if" situations. The idea of checking URL for validity was just a suggestion, for the case where it would be cheap.

I looked up existing URL matchers and I think the simple approach of checking against - works fine. I refactored to match against it and cleared up the old code. Added a #TODO comment in case one wants to read it in the future. What do you think?
Given the scraper works fine without this and it's just a logging error, I'm okay with closing the PR without merging.

@rgaudin
Copy link
Member

rgaudin commented Oct 15, 2024

I think this is more appropriate indeed: to only focus on the existing issue. I suggest you remove the TODO comment as it has little value and makes some QA checker complain.

@elfkuzco elfkuzco changed the title skip download of speaker image if url is invalid skip download of speaker image if url is "-" Oct 15, 2024
Copy link
Collaborator

@benoit74 benoit74 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thank you ! I indeed prefer this simple thing for now.

@benoit74 benoit74 merged commit 4acb963 into openzim:main Oct 16, 2024
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Do not try to download inappropriate speaker image URL
3 participants