New ZIM request: Marxist Internet Archive #311

TutanotaDeletedMyEmail · 2020-12-16T01:55:18Z

Website URL: https://marxists.org
License: mostly CC-by-sa; some copyrighted content used with permission
Desired ZIM Title: Marxist Internet Archive
Desired ZIM Description: Works by Marxists or relevant to Marxism
Desired ZIM Icon –png (URL or attach one): marx-eng 250x250
Language (ISO 639-3): eng
Is this a MediaWiki?: no

The total MIA is around 288GBs but that includes works in 80 languages so the English would be a fraction of that (especially just the html content). I only really want the English which, unlike other languages, doesn't have it's content grouped into it's own folder (i.e. if you were to want the German works they're all under marxists.org/deutsch/ while English content is in folders in the root such as marxists.org/archive/ and marxists.org/history/), so I don't know to what extent that makes it more difficult.

Also, there is some copyrighted content on MIA, but they've been given explicit permission by the publishers to host it. It may be possible to exclude pages that are copyrighted, because those pages explicitly mention copyright information (here's an example: https://www.marxists.org/archive/guevara/1967/che-reader/index.htm) while works licensed under Creative Commons don't.

Edit: I must've misread the website. The total size of the archive is over a TB (and still downloading).

RavanJAltaie · 2023-09-18T22:26:00Z

@rgaudin is this size doable? Zimit Scraper?

kelson42 · 2023-09-19T06:36:42Z

@RavanJAltaie size is not a problem. But we have to ask ourself if it makes sense. Maybe splitting in subsets would be a better approach.

rgaudin · 2023-09-19T08:01:20Z

Size do matter. Scraping over a TB off a third party website is resource intensive for us and for them. zimit is an uncontrolled environment and we don't have tools in place to monitor what's going on. What happens if it's not over in two weeks? a month? two months? Is it just slow or did we enter a loop?

FAQ has some very interesting information:

the whole archive is over 715GB (October 2021) and growing
we discourage attempts to download the whole archive. Instead, you should consider which parts of it you really need
you must limit your download to reasonable rates (request interval ca. 500ms — 1 second)

Also, there are mirrors.

@TutanotaDeletedMyEmail, you indicate you're downloading the whole archive. What method are you using?

As indicated in the request, English is in the root while other languages are prefixed. An English recipe should carefully exclude all those prefixes.

A better first step would be to try a smaller language first of course.

kelson42 · 2023-09-19T15:50:36Z

@rgaudin Does, our side, something known stops us technically to scrape over 1TB? I have an example of 1.3TB which workef in mind.

rgaudin · 2023-09-20T08:11:11Z

Not sure I fully understand the question but creating, storing and uploading a TB large ZIM file is possible, yes. I think you're referring to manioc.org ZIM.

RavanJAltaie · 2024-03-13T23:06:04Z

@benoit74 shall we tag this issue as an upstream?

benoit74 · 2024-03-14T07:07:30Z

It is not upstream at all, developers won't be able to make this issue progress anyhow, there is no technical issue (as far as we know at least).

To make this issue progress, we need only need a content committee decision. Maybe a discussion must happen between @kelson42 and @Popolechien regarding whether we want to support this or not.

One side remark: while the whole website is maybe over 1TB, there are many languages supported so I would definitely recommend / encourage to create one ZIM per language, so that each file is smaller, as already suggested by @kelson42

Popolechien · 2024-03-14T13:46:20Z

If the content decision is whether we should produce a separate zim file for each language rather than a massive multilingual thing that nobody can download, the answer is yes (even if generating a 1TB+ zim file as a show-of-force might have some value).

In fact, I am coming around the idea that we probably don't want to serve multilingual content at all (baring a very few exceptions)

rgaudin · 2024-03-14T14:26:23Z

generating a 1TB+ zim file as a show-of-force might have some value

manioc.org ZIM already serves this purpose

we probably don't want to serve multilingual content at all (baring a very few exceptions)

I think I agree. Multilingual should be exceptions. But maybe reality will contradict us.

RavanJAltaie · 2024-03-18T03:27:23Z

So to conclude this, can we proceed with creating (1 zim file per language) for Marxist Internet Archive? @benoit74

benoit74 · 2024-03-18T07:09:07Z

It is not my decision, but again I do not see any technical issue if you decide to proceed.

RavanJAltaie · 2024-04-04T09:58:19Z

I've created 3 recipes for English, French and German Languages
https://farm.openzim.org/recipes/marxists.org_de_all
https://farm.openzim.org/recipes/marxists.org_en_all
https://farm.openzim.org/recipes/marxists.org_fr_all

Once the resulted files are complete and good to go, i'll create the other languages

RavanJAltaie · 2024-04-08T21:54:07Z

The German version succeeded, I checked the file and pushed to the library
https://library.kiwix.org/viewer#marxists.org_de_all_2024-04

The French version succeeded as well & the size of the file is around 23 GB but the links are not working. I tried to change the scope type twice but I get the same result
https://dev.library.kiwix.org/viewer#marxists.org_fr_all_2024-04

The English version succeeded but the file size is 775 KB, apparently content didn't scraped. I tried to change the scope type as well with the same result
https://dev.library.kiwix.org/viewer#marxists.org_en_all_2024-04
@benoit74 any idea?

benoit74 · 2024-04-09T06:42:46Z

I will have a look

I tried to change the scope type twice

Which scope values did you tested (and why)? You need to be more specific if you don't want me to advise you to do things you've already tested

RavanJAltaie · 2024-04-12T21:53:04Z

Which scope values did you tested (and why)? I tried domain and host as I wanted the offliner to scrape all the content inside the website.
I did the Spanish version as well, it succeeded, I tested it and pushed to the library.
https://dev.library.kiwix.org/viewer#marxists.org_es_all_2024-04

benoit74 · 2024-04-15T06:48:25Z

I tried domain and host as I wanted the offliner to scrape all the content inside the website.

This makes no sense, you do not want all content inside the website but only the part corresponding to the current language

This is why French is 23GB, it looks like it scrapped website content in many other languages.

I will have a look at why French and English are not working as expected with the prefix scope (which is the correct one).

benoit74 · 2024-06-17T20:33:49Z

FR version is not working properly due to an upstream limitation, see openzim/zimit#319 (this limitation is in fact linked to a limitation in browsertrix crawler).

I've also updated the title from Marxists.org to The Marxists Internet Archive and the description from An all-volunteer, non-profit library (pretty very vague) to MIA, an all-volunteer non-profit library of Marxists authors, books, history, …. If this fits you, I suggest to reflect these changes in other languages (translated and adapted to match length constraints of course).

I also noted that you did not configured any delay between pages while the website FAQ explicitly states: "Note that you must limit your download to reasonable rates (request interval ca. 500ms — 1 second)." This must be fixed, I recommend to use 500ms to not slow down the recipe too much.

I also noted that we did not answered at all the question regarding copyrighted content. No easy way to detect this with current scraper. Did we tried to contact them to ask for permission?

I also noted that we are currently ZIMing from main webserver www.marxists.org. I strongly suggest that we at least try to create a ZIM from a mirror. Since many of our workers are located in the US, I cloned the "DE" recipe to use marxists.incn.su instead of www.marxists.org and publish to DEV : https://farm.openzim.org/recipes/marxists.org_de_all-tests_de_marxists.incn.su, recipe should be deleted if result is OK, names are completely sucking.

benoit74 · 2024-06-17T20:36:48Z

Edit: my bad, delay can only be expressed in seconds, only integer values are allowed, so I suggest to use 1 sec

benoit74 · 2024-06-18T06:39:26Z

Test using mirror is very conclusive, ZIM size is mostly identical (5.61G in both cases, I didn't looked at more precision):

prod ZIM using www.marxists.org: https://library.kiwix.org/viewer#marxists.org_de_all_2024-06
dev ZIM using marxists.incn.su: https://dev.library.kiwix.org/viewer#tests_de_marxists.incn.su_2024-06

Task duration is even smaller on mirror (3h 50m) than on main website (4h 2min to 4h 8min on recent runs).

I've relaunched the test recipe with the 1 sec delay to confirm the impact is acceptable.

benoit74 · 2024-06-20T08:39:29Z

Adding the delay has acceptable impact on de, going from 4h to 4h 40mins for the scrape.

I've updated recipes to:

use the mirror
add delay between all pages

benoit74 · 2024-06-24T08:44:51Z

ES version produced succesfuly with Zimit2
I just restarted the EN version recipe because I've updated the configuration to use the mirror but forgot to update the exclude parameter as well 🙈

RavanJAltaie · 2024-06-26T22:39:24Z

ES version produced succesfuly with Zimit2 I just restarted the EN version recipe because I've updated the configuration to use the mirror but forgot to update the exclude parameter as well 🙈

Sounds acceptable, I'll wait for the files of the EN to check it, then will create more language.
French will be upstream for now.

benoit74 · 2024-06-27T14:53:11Z

EN recipe just failed due to an upstream bug (openzim/warc2zim#331), but I can already say that archive is going to be about 300G. Marking this as upstream for now.

RavanJAltaie added General Information Encyclopedia-like Zimit labels Jul 6, 2023

Popolechien mentioned this issue Jun 11, 2024

New request: marxists.org #1037

Closed

benoit74 mentioned this issue Jun 17, 2024

New request: marxists.org collection #594

Open

RavanJAltaie self-assigned this Jun 26, 2024

benoit74 added the Upstream For tickets which are waiting for an upstream modification (typically scrapper or target website) label Jun 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New ZIM request: Marxist Internet Archive #311

New ZIM request: Marxist Internet Archive #311

TutanotaDeletedMyEmail commented Dec 16, 2020 •

edited

Loading

RavanJAltaie commented Sep 18, 2023

kelson42 commented Sep 19, 2023

rgaudin commented Sep 19, 2023

kelson42 commented Sep 19, 2023 •

edited

Loading

rgaudin commented Sep 20, 2023

RavanJAltaie commented Mar 13, 2024

benoit74 commented Mar 14, 2024

Popolechien commented Mar 14, 2024

rgaudin commented Mar 14, 2024

RavanJAltaie commented Mar 18, 2024

benoit74 commented Mar 18, 2024

RavanJAltaie commented Apr 4, 2024

RavanJAltaie commented Apr 8, 2024

benoit74 commented Apr 9, 2024

RavanJAltaie commented Apr 12, 2024

benoit74 commented Apr 15, 2024

benoit74 commented Jun 17, 2024

benoit74 commented Jun 17, 2024

benoit74 commented Jun 18, 2024

benoit74 commented Jun 20, 2024

benoit74 commented Jun 24, 2024

RavanJAltaie commented Jun 26, 2024

benoit74 commented Jun 27, 2024

New ZIM request: Marxist Internet Archive #311

New ZIM request: Marxist Internet Archive #311

Comments

TutanotaDeletedMyEmail commented Dec 16, 2020 • edited Loading

RavanJAltaie commented Sep 18, 2023

kelson42 commented Sep 19, 2023

rgaudin commented Sep 19, 2023

kelson42 commented Sep 19, 2023 • edited Loading

rgaudin commented Sep 20, 2023

RavanJAltaie commented Mar 13, 2024

benoit74 commented Mar 14, 2024

Popolechien commented Mar 14, 2024

rgaudin commented Mar 14, 2024

RavanJAltaie commented Mar 18, 2024

benoit74 commented Mar 18, 2024

RavanJAltaie commented Apr 4, 2024

RavanJAltaie commented Apr 8, 2024

benoit74 commented Apr 9, 2024

RavanJAltaie commented Apr 12, 2024

benoit74 commented Apr 15, 2024

benoit74 commented Jun 17, 2024

benoit74 commented Jun 17, 2024

benoit74 commented Jun 18, 2024

benoit74 commented Jun 20, 2024

benoit74 commented Jun 24, 2024

RavanJAltaie commented Jun 26, 2024

benoit74 commented Jun 27, 2024

TutanotaDeletedMyEmail commented Dec 16, 2020 •

edited

Loading

kelson42 commented Sep 19, 2023 •

edited

Loading