-
-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New ZIM request: Marxist Internet Archive #311
Comments
@rgaudin is this size doable? Zimit Scraper? |
@RavanJAltaie size is not a problem. But we have to ask ourself if it makes sense. Maybe splitting in subsets would be a better approach. |
Size do matter. Scraping over a TB off a third party website is resource intensive for us and for them. zimit is an uncontrolled environment and we don't have tools in place to monitor what's going on. What happens if it's not over in two weeks? a month? two months? Is it just slow or did we enter a loop? FAQ has some very interesting information:
Also, there are mirrors. @TutanotaDeletedMyEmail, you indicate you're downloading the whole archive. What method are you using? As indicated in the request, English is in the root while other languages are prefixed. An English recipe should carefully exclude all those prefixes. A better first step would be to try a smaller language first of course. |
@rgaudin Does, our side, something known stops us technically to scrape over 1TB? I have an example of 1.3TB which workef in mind. |
Not sure I fully understand the question but creating, storing and uploading a TB large ZIM file is possible, yes. I think you're referring to manioc.org ZIM. |
@benoit74 shall we tag this issue as an upstream? |
It is not upstream at all, developers won't be able to make this issue progress anyhow, there is no technical issue (as far as we know at least). To make this issue progress, we need only need a content committee decision. Maybe a discussion must happen between @kelson42 and @Popolechien regarding whether we want to support this or not. One side remark: while the whole website is maybe over 1TB, there are many languages supported so I would definitely recommend / encourage to create one ZIM per language, so that each file is smaller, as already suggested by @kelson42 |
If the content decision is whether we should produce a separate zim file for each language rather than a massive multilingual thing that nobody can download, the answer is yes (even if generating a 1TB+ zim file as a show-of-force might have some value). In fact, I am coming around the idea that we probably don't want to serve multilingual content at all (baring a very few exceptions) |
manioc.org ZIM already serves this purpose
I think I agree. Multilingual should be exceptions. But maybe reality will contradict us. |
So to conclude this, can we proceed with creating (1 zim file per language) for Marxist Internet Archive? @benoit74 |
It is not my decision, but again I do not see any technical issue if you decide to proceed. |
I've created 3 recipes for English, French and German Languages Once the resulted files are complete and good to go, i'll create the other languages |
The German version succeeded, I checked the file and pushed to the library The French version succeeded as well & the size of the file is around 23 GB but the links are not working. I tried to change the scope type twice but I get the same result The English version succeeded but the file size is 775 KB, apparently content didn't scraped. I tried to change the scope type as well with the same result |
I will have a look
Which scope values did you tested (and why)? You need to be more specific if you don't want me to advise you to do things you've already tested |
|
This makes no sense, you do not want all content inside the website but only the part corresponding to the current language This is why French is 23GB, it looks like it scrapped website content in many other languages. I will have a look at why French and English are not working as expected with the |
FR version is not working properly due to an upstream limitation, see openzim/zimit#319 (this limitation is in fact linked to a limitation in browsertrix crawler). Regarding EN, I've developed an exclude regex to filter out all other languages + admin pages: I've also updated the title from I also noted that you did not configured any delay between pages while the website FAQ explicitly states: "Note that you must limit your download to reasonable rates (request interval ca. 500ms — 1 second)." This must be fixed, I recommend to use I also noted that we did not answered at all the question regarding copyrighted content. No easy way to detect this with current scraper. Did we tried to contact them to ask for permission? I also noted that we are currently ZIMing from main webserver |
Edit: my bad, delay can only be expressed in seconds, only integer values are allowed, so I suggest to use 1 sec |
Test using mirror is very conclusive, ZIM size is mostly identical (5.61G in both cases, I didn't looked at more precision):
Task duration is even smaller on mirror (3h 50m) than on main website (4h 2min to 4h 8min on recent runs). I've relaunched the test recipe with the 1 sec delay to confirm the impact is acceptable. |
Adding the delay has acceptable impact on I've updated recipes to:
|
ES version produced succesfuly with Zimit2 |
Sounds acceptable, I'll wait for the files of the EN to check it, then will create more language. |
EN recipe just failed due to an upstream bug (openzim/warc2zim#331), but I can already say that archive is going to be about 300G. Marking this as upstream for now. |
The total MIA is around 288GBs but that includes works in 80 languages so the English would be a fraction of that (especially just the html content). I only really want the English which, unlike other languages, doesn't have it's content grouped into it's own folder (i.e. if you were to want the German works they're all under marxists.org/deutsch/ while English content is in folders in the root such as marxists.org/archive/ and marxists.org/history/), so I don't know to what extent that makes it more difficult.
Also, there is some copyrighted content on MIA, but they've been given explicit permission by the publishers to host it. It may be possible to exclude pages that are copyrighted, because those pages explicitly mention copyright information (here's an example: https://www.marxists.org/archive/guevara/1967/che-reader/index.htm) while works licensed under Creative Commons don't.
Edit: I must've misread the website. The total size of the archive is over a TB (and still downloading).
The text was updated successfully, but these errors were encountered: