Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create a new platform for shamela.ws #1023

Closed
benoit74 opened this issue Oct 7, 2024 · 3 comments · Fixed by #1032
Closed

Create a new platform for shamela.ws #1023

benoit74 opened this issue Oct 7, 2024 · 3 comments · Fixed by #1032

Comments

@benoit74
Copy link
Collaborator

benoit74 commented Oct 7, 2024

For openzim/zim-requests#1172, we are going to have 40 recipes, one per category on shamela.ws to have pratical recipe duration and practical ZIM sizes.

However, there is only one upstream server, so we need to ensure that we do not run more than one task (out of these 40 recipes) per worker, or even probably only one task on the whole platform to be fair with their servers.

Shall we create a new shamela.ws platform we would assign manually to the corresponding recipes?

@rgaudin
Copy link
Member

rgaudin commented Oct 7, 2024

I am torn. On one hand, it has virtually no cost for us to do, so I'd be in favor but on the other, it's just 40 concurrent access crawling parts of the website (with little overlap) so just setting a reasonable delay should do it.
We can discuss it today.
I see there's a contact email and they've gone great lengths to make this available with android/ios/windows softwares so they seem keen on distributing it offline widely. We could simply ask.

@benoit74
Copy link
Collaborator Author

benoit74 commented Oct 7, 2024

The main reason why we are creating 40 recipes is that in total there is about 10M links to explore, and we are using zimit scraper. I already had to set worker: 4 to parallelize the recipe. And with this setting, we need about 3 months to grab the 10M links with this level of parallelism. I would prefer to be fair with their server and not run multiple recipes in parallel.

But as you found there is already another offline version based on apps, maybe it is worth asking them for other solutions to access their content and create a custom scraper.

@benoit74
Copy link
Collaborator Author

benoit74 commented Oct 8, 2024

Discuss yesterday in team meeting, we agree that we need to create a new platform.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants