Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Zimit2: Allow deduplication of entries #199

Open
benoit74 opened this issue Mar 1, 2024 · 3 comments
Open

Zimit2: Allow deduplication of entries #199

benoit74 opened this issue Mar 1, 2024 · 3 comments
Labels
bug Something isn't working enhancement New feature or request question Further information is requested
Milestone

Comments

@benoit74
Copy link
Collaborator

benoit74 commented Mar 1, 2024

It looks like Zimcheck is complaining about quality issues in most (all?) Zimit2 files.

It already did so for Zimit1, but maybe it is time to address the problems.

The first obvious problem is that lots of content is duplicated inside the ZIM due to different URLs leading to the same content. I think this could be pretty easily addressed (even if it clearly means additional processing to deduplicate).

{
    "check": "redundant",
    "level": "WARNING",
    "message": "solar.lowtechmagazine.com/2011/01/aerial-ropeways-automatic-cargo-transport-for-a-bargain/images/dithers/aerial-ropeway-colour-2_dithered.png and solar.lowtechmagazine.com/fr/2011/01/aerial-ropeways-automatic-cargo-transport-for-a-bargain/images/dithers/aerial-ropeway-colour-2_dithered.png",
    "path1": "solar.lowtechmagazine.com/2011/01/aerial-ropeways-automatic-cargo-transport-for-a-bargain/images/dithers/aerial-ropeway-colour-2_dithered.png",
    "path2": "solar.lowtechmagazine.com/fr/2011/01/aerial-ropeways-automatic-cargo-transport-for-a-bargain/images/dithers/aerial-ropeway-colour-2_dithered.png"
},

For a website like solar.lowtechmagazine.com which is available in multiple languages, it could even make a significant difference in terms of final file size (not sure if compression achieves to cancel duplicated content like this well, at least some persons says it is not possible, e.g. https://superuser.com/a/479083).

@benoit74 benoit74 added the bug Something isn't working label Mar 1, 2024
@rgaudin
Copy link
Member

rgaudin commented Mar 1, 2024

The new alias might be of help

@kelson42 kelson42 changed the title Zimit2: Fix zimcheck issues Zimit2: Allow deduplication of entries Mar 6, 2024
@kelson42
Copy link
Contributor

kelson42 commented Mar 6, 2024

To me Zimcheck "warnings" are not a priority to treat, in particular for the moment. Should be a feature request IMO and descoped from the "Zimit2" project.

One solution proposal for this deduplication feature has been made years ago at scraperlib level.

@kelson42 kelson42 added question Further information is requested enhancement New feature or request labels Mar 6, 2024
@benoit74
Copy link
Collaborator Author

benoit74 commented Mar 7, 2024

Treating all Zimcheck "warnings" is maybe not a priority, but avoiding to create artificially big ZIMs could be considered from my PoV. I do not mind if we de-scope this.

I don't know why someone proposed a PR to fix openzim/python-scraperlib#33 but never finished the job !

I'm joking of course, I was probably very tired or angry about someone else this day. I intend to finish this PR to fix this zimit2 issue, it was not that far from being OK.

@benoit74 benoit74 added this to the 2.1.0 milestone May 17, 2024
@benoit74 benoit74 modified the milestones: 2.1.0, 2.2.0 Jun 18, 2024
@benoit74 benoit74 modified the milestones: 2.2.0, later Aug 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working enhancement New feature or request question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants