Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add capability to blacklist some websites and redirect them to library / Github issue #124

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

benoit74
Copy link
Collaborator

@benoit74 benoit74 commented Feb 28, 2025

Fix #28
Fix #33

Changes

  • add a blacklist of URLs we do not want to process on zimit, with details (when possible) about what the users should do next

Details

  • the blacklist is stored in a blacklist.json file in the repo so that it is both simpler and subject to code reviews to catch misconfigurations (as requested by @kelson42)
  • blocking based on the blacklist is done at zimit-frontend API level (i.e. it is not possible to bypass this)
  • the blacklist is based on very basic case insensitive matching of the URL (i.e. if the configured host is present in the URL, then it's a match) and leads to 5 distinct situations detailed below. The case insensitive matching might lead to few false positive, but it is deemed acceptable since very rare edge cases which are probably not worth any effort.

already_zimed

We already have a ZIM for the URL (e.g. devdocs, freecodecamp, libretexts, wikipedia) and we want to redirect the user to the library.

Note that for websites covered by WP1, it is possible to add the WP1 hint as below (it is not shown by default). The link goes to https://wp1.openzim.org/#/selections/simple

The library link is configured in the blacklist.json

image

forbid_or_copyrighted_by_website_owner

We know there is a copyright or alike issue with this website (e.g. wikihow)

Screenshot 2025-02-28 at 09 34 59

too_big_partially_already_zimed

We cannot make a ZIM of such a big site, we have a dedicated scraper and already publish few ZIMs (e.g. youtube)

Screenshot 2025-02-28 at 09 35 14

Note that scraper URL is optional (if it is not configured, last sentence is not shown).

scraper_needed

This website cannot be zimmed with zimit, and we have a pending scraper request.

Screenshot 2025-02-28 at 09 34 11

not_possible_with_zimit

This website is known to be impossible to ZIM with zimit

Screenshot 2025-02-28 at 09 35 47

Remarks:

  • the blacklist configured in this PR covers all simple situations
    • it covers a bit more than 80% of the configured recipes in the Zimfarm ;
    • what is missing is mwoffliner recipes linked to "isolated" mediawiki instances + all zimit recipes + maybe few nautilus ones
    • for all these cases we need one rule in the blacklist + one library URL linking directly to the corresponding ZIM, and it is really a pain to configure, not even speaking about maintenance
    • I considered that covering 80% of the recipes is an acceptable first step and probably covers most of the situations we do not want to see anymore on zimit.kiwix.org
    • it is anyway only a matter of configuration / tooling
  • previous point means that we do not cover at all websites with are deemed to fail quickly
    • I don't consider it is a problem because websites which are deemed to fail quickly are not really a big pain in term of resource consumptions
    • Not adding these individual cases will simplify maintenance (in most cases we have no clue about when this website is going to work again)
    • websites which are known to be bad but take time to fail can be added in the aftermath of this PR
  • all strings are ready for i18n in TranslateWiki
  • there is some code duplication between the various Blacklistxxx.vue components, but it was deemed simpler to maintain than complex if/then/else conditions

Flow

Screen.Recording.2025-02-28.at.09.32.44.mov

Copy link

codecov bot commented Feb 28, 2025

Codecov Report

Attention: Patch coverage is 37.50000% with 5 lines in your changes missing coverage. Please review.

Project coverage is 56.58%. Comparing base (f77b192) to head (c58426e).
Report is 4 commits behind head on main.

Files with missing lines Patch % Lines
api/src/zimitfrontend/routes/requests.py 0.00% 5 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #124      +/-   ##
==========================================
- Coverage   56.66%   56.58%   -0.08%     
==========================================
  Files          12       12              
  Lines         533      539       +6     
  Branches       77       78       +1     
==========================================
+ Hits          302      305       +3     
- Misses        229      232       +3     
  Partials        2        2              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@benoit74 benoit74 marked this pull request as ready for review February 28, 2025 10:53
@benoit74 benoit74 requested a review from rgaudin February 28, 2025 10:53
@benoit74
Copy link
Collaborator Author

@Popolechien @kelson42 feedback is of course welcomed since this PR contains also quite significant "UX" parts

@rgaudin
Copy link
Member

rgaudin commented Feb 28, 2025

I disagree with the copyright stuff. zimit is for individual copies ; we are not publishing the ZIMs. We should not go down this road IMO.

If wikihow can't be zimed because of technical protections, then it should be in impossible category

@Popolechien
Copy link
Contributor

Popolechien commented Feb 28, 2025

I agree that indicating that Kiwix has been explicitely asked not to provide a zim is basically saying that we can be strongarmed into things. I'd rather stay vague. Wikihow for instance is at this stage a copyright issue for me.

@benoit74
Copy link
Collaborator Author

Not my call, feel free to suggest proper wording / configuration. I'm not particularly attached to this copyright thing at all.

@Popolechien
Copy link
Contributor

Ok, I don't really know how to input my changes into a existing commit, but I'd change forbid_or_copyrighted_by_website_owner from "This website is protected by copyrights etc." to "This website is protected by copyright. If you are the website's owner and would like to make it available, feel free to reach out to hello@kiwix.org"

@benoit74
Copy link
Collaborator Author

benoit74 commented Mar 3, 2025

Ok, I don't really know how to input my changes into a existing commit, but I'd change [...]

I will take care of this. Regarding contact, are we sure we want to spread contact email in plain text? (this is usually a good way to be caught by spam) If so, this needs to be spread everywhere in the website (we have many places where we just say "contact us").

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants