Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Link crawler parallelization is hampered by session locks #193

Closed
1 of 2 tasks
brendanheywood opened this issue Feb 3, 2025 · 0 comments
Closed
1 of 2 tasks

Link crawler parallelization is hampered by session locks #193

brendanheywood opened this issue Feb 3, 2025 · 0 comments

Comments

@brendanheywood
Copy link
Contributor

brendanheywood commented Feb 3, 2025

When we spawn N adhoc tasks to crawl the site in parallel, all of these will still use the same session cookie:

https://github.com/catalyst/moodle-tool_crawler/blob/MOODLE_310_STABLE/classes/robot/crawler.php#L1092

This means that most of the time spent waiting for the request to finish is just waiting for some other crawling process to finish. If there are N adhoc tasks crawling then each should have its own independent session cookie.

  • So proposing something like tool_crawler_crawl() and crawler(); both having an optional param called 'worker' which is passed through from the adhoc custom data and each one will have its own cookie jar file.

  • On top of this, I think it's safe and better to move the cookie jar file to local temp, it doesn't matter if this gets thrown away and rebuild semi regularly

Implementing this should see the crawl rate go up by probably a factor of 10

@brendanheywood brendanheywood changed the title Link crawler parallization is hampered by session locks Link crawler parallelization is hampered by session locks Feb 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant