Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New ZIM request: Marxist Internet Archive #311

Open
TutanotaDeletedMyEmail opened this issue Dec 16, 2020 · 23 comments
Open

New ZIM request: Marxist Internet Archive #311

TutanotaDeletedMyEmail opened this issue Dec 16, 2020 · 23 comments
Assignees
Labels
General Information Encyclopedia-like Upstream For tickets which are waiting for an upstream modification (typically scrapper or target website) Zimit

Comments

@TutanotaDeletedMyEmail
Copy link

TutanotaDeletedMyEmail commented Dec 16, 2020

  • Website URL: https://marxists.org
  • License: mostly CC-by-sa; some copyrighted content used with permission
  • Desired ZIM Title: Marxist Internet Archive
  • Desired ZIM Description: Works by Marxists or relevant to Marxism
  • Desired ZIM Icon –png (URL or attach one): marx-eng 250x250
  • Language (ISO 639-3): eng
  • Is this a MediaWiki?: no

The total MIA is around 288GBs but that includes works in 80 languages so the English would be a fraction of that (especially just the html content). I only really want the English which, unlike other languages, doesn't have it's content grouped into it's own folder (i.e. if you were to want the German works they're all under marxists.org/deutsch/ while English content is in folders in the root such as marxists.org/archive/ and marxists.org/history/), so I don't know to what extent that makes it more difficult.

Also, there is some copyrighted content on MIA, but they've been given explicit permission by the publishers to host it. It may be possible to exclude pages that are copyrighted, because those pages explicitly mention copyright information (here's an example: https://www.marxists.org/archive/guevara/1967/che-reader/index.htm) while works licensed under Creative Commons don't.

Edit: I must've misread the website. The total size of the archive is over a TB (and still downloading).

@RavanJAltaie
Copy link
Contributor

@rgaudin is this size doable? Zimit Scraper?

@kelson42
Copy link
Collaborator

@RavanJAltaie size is not a problem. But we have to ask ourself if it makes sense. Maybe splitting in subsets would be a better approach.

@rgaudin
Copy link
Member

rgaudin commented Sep 19, 2023

Size do matter. Scraping over a TB off a third party website is resource intensive for us and for them. zimit is an uncontrolled environment and we don't have tools in place to monitor what's going on. What happens if it's not over in two weeks? a month? two months? Is it just slow or did we enter a loop?

FAQ has some very interesting information:

the whole archive is over 715GB (October 2021) and growing
we discourage attempts to download the whole archive. Instead, you should consider which parts of it you really need
you must limit your download to reasonable rates (request interval ca. 500ms — 1 second)

Also, there are mirrors.

@TutanotaDeletedMyEmail, you indicate you're downloading the whole archive. What method are you using?

As indicated in the request, English is in the root while other languages are prefixed. An English recipe should carefully exclude all those prefixes.

A better first step would be to try a smaller language first of course.

@kelson42
Copy link
Collaborator

kelson42 commented Sep 19, 2023

@rgaudin Does, our side, something known stops us technically to scrape over 1TB? I have an example of 1.3TB which workef in mind.

@rgaudin
Copy link
Member

rgaudin commented Sep 20, 2023

Not sure I fully understand the question but creating, storing and uploading a TB large ZIM file is possible, yes. I think you're referring to manioc.org ZIM.

@RavanJAltaie
Copy link
Contributor

@benoit74 shall we tag this issue as an upstream?

@benoit74
Copy link
Contributor

It is not upstream at all, developers won't be able to make this issue progress anyhow, there is no technical issue (as far as we know at least).

To make this issue progress, we need only need a content committee decision. Maybe a discussion must happen between @kelson42 and @Popolechien regarding whether we want to support this or not.

One side remark: while the whole website is maybe over 1TB, there are many languages supported so I would definitely recommend / encourage to create one ZIM per language, so that each file is smaller, as already suggested by @kelson42

@Popolechien
Copy link
Collaborator

If the content decision is whether we should produce a separate zim file for each language rather than a massive multilingual thing that nobody can download, the answer is yes (even if generating a 1TB+ zim file as a show-of-force might have some value).

In fact, I am coming around the idea that we probably don't want to serve multilingual content at all (baring a very few exceptions)

@rgaudin
Copy link
Member

rgaudin commented Mar 14, 2024

generating a 1TB+ zim file as a show-of-force might have some value

manioc.org ZIM already serves this purpose

we probably don't want to serve multilingual content at all (baring a very few exceptions)

I think I agree. Multilingual should be exceptions. But maybe reality will contradict us.

@RavanJAltaie
Copy link
Contributor

So to conclude this, can we proceed with creating (1 zim file per language) for Marxist Internet Archive? @benoit74

@benoit74
Copy link
Contributor

It is not my decision, but again I do not see any technical issue if you decide to proceed.

@RavanJAltaie
Copy link
Contributor

I've created 3 recipes for English, French and German Languages
https://farm.openzim.org/recipes/marxists.org_de_all
https://farm.openzim.org/recipes/marxists.org_en_all
https://farm.openzim.org/recipes/marxists.org_fr_all

Once the resulted files are complete and good to go, i'll create the other languages

@RavanJAltaie
Copy link
Contributor

The German version succeeded, I checked the file and pushed to the library
https://library.kiwix.org/viewer#marxists.org_de_all_2024-04

The French version succeeded as well & the size of the file is around 23 GB but the links are not working. I tried to change the scope type twice but I get the same result
https://dev.library.kiwix.org/viewer#marxists.org_fr_all_2024-04

The English version succeeded but the file size is 775 KB, apparently content didn't scraped. I tried to change the scope type as well with the same result
https://dev.library.kiwix.org/viewer#marxists.org_en_all_2024-04
@benoit74 any idea?

@benoit74
Copy link
Contributor

benoit74 commented Apr 9, 2024

I will have a look

I tried to change the scope type twice

Which scope values did you tested (and why)? You need to be more specific if you don't want me to advise you to do things you've already tested

@RavanJAltaie
Copy link
Contributor

Which scope values did you tested (and why)? I tried domain and host as I wanted the offliner to scrape all the content inside the website.
I did the Spanish version as well, it succeeded, I tested it and pushed to the library.
https://dev.library.kiwix.org/viewer#marxists.org_es_all_2024-04

@benoit74
Copy link
Contributor

I tried domain and host as I wanted the offliner to scrape all the content inside the website.

This makes no sense, you do not want all content inside the website but only the part corresponding to the current language

This is why French is 23GB, it looks like it scrapped website content in many other languages.

I will have a look at why French and English are not working as expected with the prefix scope (which is the correct one).

@benoit74
Copy link
Contributor

FR version is not working properly due to an upstream limitation, see openzim/zimit#319 (this limitation is in fact linked to a limitation in browsertrix crawler).

Regarding EN, I've developed an exclude regex to filter out all other languages + admin pages: marxists.org\/(?:admin|afrikaans|amharic|arabic|asamiya|azerbaycan|bangla|bulgarsky|burmese|catala|cestina|chinese|czech|dansk|deutsch|eesti|ellinika|espanol|esperanto|euskara|farsi|francais|georgian|gondi|gujarati|hausa|hayeren|hebrew|hindi|indonesia|isizulu|islenska|italiano|kannada|kazak|kernowek|kiswahili|korean|kurdi|kyrgyzcha|lietuviu|magyar|makedonski|malagasy|malayalam|marathi|mongolian|nederlands|nepali|nihon|norsk|occitan|odia|ozbekcha|pashto|polski|portugues|punjabi|quechua|romana|russkij|sakha|sardu|shqip|sindhi|sinhala|slovak|slovenian|somali|srpshrva|suomi|svenska|tagalog|tamil|telugu|thai|tibetan|tojiki|turkce|ukrainian|urdu|uyghur|vietnamese|yiddish|xlang)\/. It's pretty fragile since any new language added to the MIA will need to be added to this exclusion expression, but it's the only solution since English resources are not stored on a specific subfolder.

I've also updated the title from Marxists.org to The Marxists Internet Archive and the description from An all-volunteer, non-profit library (pretty very vague) to MIA, an all-volunteer non-profit library of Marxists authors, books, history, …. If this fits you, I suggest to reflect these changes in other languages (translated and adapted to match length constraints of course).

I also noted that you did not configured any delay between pages while the website FAQ explicitly states: "Note that you must limit your download to reasonable rates (request interval ca. 500ms — 1 second)." This must be fixed, I recommend to use 500ms to not slow down the recipe too much.

I also noted that we did not answered at all the question regarding copyrighted content. No easy way to detect this with current scraper. Did we tried to contact them to ask for permission?

I also noted that we are currently ZIMing from main webserver www.marxists.org. I strongly suggest that we at least try to create a ZIM from a mirror. Since many of our workers are located in the US, I cloned the "DE" recipe to use marxists.incn.su instead of www.marxists.org and publish to DEV : https://farm.openzim.org/recipes/marxists.org_de_all-tests_de_marxists.incn.su, recipe should be deleted if result is OK, names are completely sucking.

@benoit74
Copy link
Contributor

Edit: my bad, delay can only be expressed in seconds, only integer values are allowed, so I suggest to use 1 sec

@benoit74
Copy link
Contributor

Test using mirror is very conclusive, ZIM size is mostly identical (5.61G in both cases, I didn't looked at more precision):

Task duration is even smaller on mirror (3h 50m) than on main website (4h 2min to 4h 8min on recent runs).

I've relaunched the test recipe with the 1 sec delay to confirm the impact is acceptable.

@benoit74
Copy link
Contributor

Adding the delay has acceptable impact on de, going from 4h to 4h 40mins for the scrape.

I've updated recipes to:

  • use the mirror
  • add delay between all pages

@benoit74
Copy link
Contributor

ES version produced succesfuly with Zimit2
I just restarted the EN version recipe because I've updated the configuration to use the mirror but forgot to update the exclude parameter as well 🙈

@RavanJAltaie
Copy link
Contributor

ES version produced succesfuly with Zimit2 I just restarted the EN version recipe because I've updated the configuration to use the mirror but forgot to update the exclude parameter as well 🙈

Sounds acceptable, I'll wait for the files of the EN to check it, then will create more language.
French will be upstream for now.

@RavanJAltaie RavanJAltaie self-assigned this Jun 26, 2024
@benoit74
Copy link
Contributor

EN recipe just failed due to an upstream bug (openzim/warc2zim#331), but I can already say that archive is going to be about 300G. Marking this as upstream for now.

@benoit74 benoit74 added the Upstream For tickets which are waiting for an upstream modification (typically scrapper or target website) label Jun 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
General Information Encyclopedia-like Upstream For tickets which are waiting for an upstream modification (typically scrapper or target website) Zimit
Projects
None yet
Development

No branches or pull requests

6 participants