-
-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Alien content inside ZIM file #172
Comments
Perfectly possible, in particular if there is a lot of text. ZIM file embeds a fulltext search index. |
Yes, but the zim file created with zimwriterfs contains the index and it is smaller than "website". The youzimit zim file is bigger and it contains something else (headers ?) or is less compressed. |
Yes, the ZIM files contains dedicated HTTP headers ZIM articles as well. |
The 182MB of the source content is composed of 178MB of gifs which are not compressed. zimwriterfs a pretty decent job here as it as at least to keep 178MB of uncompressed data and the full size is 178MB (we lost some information here as we don't know what is the exact size, only a approximation in MB) But with "you zim it", it seems that 16MB of data is added. Dumping the "you zim it" zim file, there is a directory The only two references to google in the source are:
I haven't found the article The second one is the link for the "Many Others ..." in https://www.libraryofjuggling.com/Tricks/3balltricks/Cascade.html So it seems that "you zim it" is a bit too gready here. I'm moving the issue in zimit repository |
Those files are actually Google Chrome update data. I could easily reproduce so it's not youzim.it specific. How did those get in the ZIM? I'm not sure but I believe it's just a matter of how long the crawl lasts for. If the browser has enough time to phone home and download, them you'll find those files in the ZIM as checking other youzim.it files confirms. If the crawl is too short, you don't get them. It's possible that some URL triggers it but I don't think so and it's difficult to test (you need a large enough source so it spends enough time but you need to control all URLs as well). This is clearly not content that we want and we certainly don't want to update chrome inside a running container. I'm opening a ticket upstream but also fixed it on our zimit image in 6324b7c. |
As suggested here (fr), I opened this issue because, in some cases, the zim file returned by the https://youzim.it service is bigger than the original website size.
Here's an example with https://www.libraryofjuggling.com as a reference:
Here's a summary for this website:
zimwriterfs
As you can see, the ratio is not that much of a deal, but what if it happens on some bigger websites?
What may be the cause(s)?
How can I help?
The text was updated successfully, but these errors were encountered: