Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alien content inside ZIM file #172

Closed
Leirda01 opened this issue Nov 12, 2022 · 5 comments
Closed

Alien content inside ZIM file #172

Leirda01 opened this issue Nov 12, 2022 · 5 comments
Labels

Comments

@Leirda01
Copy link

As suggested here (fr), I opened this issue because, in some cases, the zim file returned by the https://youzim.it service is bigger than the original website size.

Here's an example with https://www.libraryofjuggling.com as a reference:

$ # What is the original website size?
$ wget -qr https://www.libraryofjuggling.com/
$ du -sh ./www.libraryofjuggling.com/
182M    www.libraryofjuggling.com/
$
$ # What if we use the zimwriterfs command?
$ zimwriterfs --version
zim-tools 3.1.2

libzim 8.0.1
+ libzstd 1.5.2
+ liblzma 5.2.7
+ libxapian 1.4.20
+ libicu 72.1.0
$ zimwriterfs \
    --welcome="Home.html" \
    --illustration="jugglebanner.jpg" \
    --language="eng" \
    --title="Library of Juggling" \
    --description="An attempt to list all of the popular (and perhaps not so popular) juggling tricks" \
    --creator="libraryofjuggling@gmail.com" \
    --publisher="Kiwix" \
    ./www.libraryofjuggling.com/ ./libraryofjuggling.zim
Resolve redirect
set index
$ du -sh ./libraryofjuggling.zim
178M    ./libraryofjuggling.zim
$
$ # What is the size of the archive I retrieved with the You Zim It service?
$ du -sh ./www.libraryofjuggling.com_5812d0d1.zim
194M    ./www.libraryofjuggling.com_5812d0d1.zim

Here's a summary for this website:

Provenance Size (M) Ratio
Website 182 1
zimwriterfs 178 0.97
You Zim It 194 1.06

As you can see, the ratio is not that much of a deal, but what if it happens on some bigger websites?
What may be the cause(s)?
How can I help?

@kelson42 kelson42 self-assigned this Nov 26, 2022
@kelson42
Copy link
Contributor

Perfectly possible, in particular if there is a lot of text. ZIM file embeds a fulltext search index.

@mgautierfr
Copy link
Contributor

Yes, but the zim file created with zimwriterfs contains the index and it is smaller than "website".

The youzimit zim file is bigger and it contains something else (headers ?) or is less compressed.

@kelson42
Copy link
Contributor

kelson42 commented Mar 8, 2023

Yes, the ZIM files contains dedicated HTTP headers ZIM articles as well.

@mgautierfr
Copy link
Contributor

The 182MB of the source content is composed of 178MB of gifs which are not compressed.
So both zimwriterfs and "you zim it" can compress only 3,5MB of data.

zimwriterfs a pretty decent job here as it as at least to keep 178MB of uncompressed data and the full size is 178MB (we lost some information here as we don't know what is the exact size, only a approximation in MB)

But with "you zim it", it seems that 16MB of data is added. Dumping the "you zim it" zim file, there is a directory A/edgedl.mo.gvt1.com/ (among ajax.googleapis.com and update.googleapis.com) which is about 16MB. It seems to contains some kind of google chrome extension.

The only two references to google in the source are:

  • <script src="//ajax.googleapis.com/ajax/libs/jquery/1.10.2/jquery.min.js"></script> in Tricks/3balltricks/3balljongliertricks
  • <a href="https://www.google.com/search?rlz=1C1CHNY_enUS399US399&amp;sourceid=chrome&amp;ie=UTF-8&amp;q=Learning+to+Juggle"> in Tricks/3balltricks/Cascade.html

I haven't found the article Tricks/3balltricks/3balljongliertricks in the source and the content in the zim file is the 404 returned content when you try to access https://www.libraryofjuggling.com/Tricks/3balltricks/3balljongliertricks. The link to this url is in https://www.libraryofjuggling.com/Tricks/3balltricks/FakeColumns.html (billi billert (demonstration))

The second one is the link for the "Many Others ..." in https://www.libraryofjuggling.com/Tricks/3balltricks/Cascade.html
I think it should be considered as a external link and should not be included in the zim file.

So it seems that "you zim it" is a bit too gready here. I'm moving the issue in zimit repository

@mgautierfr mgautierfr transferred this issue from openzim/libzim Mar 8, 2023
@kelson42 kelson42 removed their assignment Mar 8, 2023
@rgaudin
Copy link
Member

rgaudin commented Mar 10, 2023

Those files are actually Google Chrome update data. I could easily reproduce so it's not youzim.it specific.

How did those get in the ZIM? I'm not sure but I believe it's just a matter of how long the crawl lasts for. If the browser has enough time to phone home and download, them you'll find those files in the ZIM as checking other youzim.it files confirms.

If the crawl is too short, you don't get them.

It's possible that some URL triggers it but I don't think so and it's difficult to test (you need a large enough source so it spends enough time but you need to control all URLs as well).

This is clearly not content that we want and we certainly don't want to update chrome inside a running container. I'm opening a ticket upstream but also fixed it on our zimit image in 6324b7c.

@rgaudin rgaudin changed the title The zim file created by https://youzim.it/ might be bigger than the original website Alien content inside ZIM file Mar 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants