-
-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Die if the upstream server is not reachable #142
Comments
@kevinmcmurtrie commented on Feb 25, 2021, 7:31 AM UTC: It happened again: https://farm.openzim.org/pipeline/97fbc40e1f235962f06ae206/debug I disabled the gutenberg scraper on pixelmemory. |
@kelson42 commented on Feb 25, 2021, 9:46 AM UTC:
What let you say the aleph.gutenberg.org refuses TCP/IP connections? |
@kevinmcmurtrie commented on Feb 25, 2021, 5:04 PM UTC:
Over 1 million
The scraper is retrying at an extremely high rate that is going to trigger automated denial-of-service detection. 1141028 connection attempts were made and refused in a short period of time. Retries need to be throttled to a sane rate. |
@eshellman Does the configuration of aleph.gutenberg.org has changed? Do you have an idea what could be done? |
The service was repaired on the 21st; these error logs seem to be from the 20th. @kevinmcmurtrie Could you verify that there is still a problem? |
@eshellman Thx, I plan to close the ticket as the problem does not occur anymore. |
Was retry throttling added or are you just hoping that aleph.gutenberg.org doesn't go down any more? I'm concerned about triggering DoS detection at my own ISP so I'm not going to run it as long as it potentially has this issue. |
The questions are:
|
If it could wait 2 seconds between retries of socket errors, that would be enough for me to let it run again. |
That seems like a good idea. |
This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions. |
If slowing down is good enough, then woukd be great to implement it. |
If adding an external dependency it's not a problem |
zimscraperlib does the file download, in download.py (def save_large_file). It uses wget with parameters that make it retry 5 times, including if connections are refused:
If we remove --retry-connrefused then it will only try once, not 5 times. So the problem will remain but it will be only 20% as big. There is another parameter, --waitretry=5 which will add a 5 second delay when a download fails, before trying again. |
Also, we could insert some delays in gutenberg/download.py. For example:
becomes
|
I've made a PR on python-scraperlib, someone please take a look :) |
I think |
Could you develop why |
It looks like the logs expire. The scraper had the content list but the host with the content http://aleph.gutenberg.org/ was immediately refusing connections.
The scraper was making It's important to me that Kiwix scrapers never look like a DoS attack or bad bot. When AT&T (my host), Akamai (edge cache), Cloudflare (edge cache), and various IDS see activity like this, it causes my network to end up in private blocklists. These private blocklists are a pain the the ass to debug and resolve. Stuff just stops working. It's OK if the scraper retries a lot then quits or retries slowly. It just can't retry at full-throttle until it has gone through the whole content list. Project Gutenberg had a similar issue recently where they accidentally set the wrong IPv6 DNS address. Luckily it didn't route to anything so it was a very slow failure. |
This is very weird because it does not match my understanding of the wget
behavior. Either documentation is not up-to-date with code, something weird
is happening or I just don't get it right (and my favorite option is the
third one).
Totally aligned with your requirements, even if I cannot ensure that it
will be always feasible / working as expected. For iFixit scrapper I
implemented the following behavior for every API call (but not everywhere,
there is few web crawling calls, and this caused you an issue as well as
far as I remember):
- every API call is wrapped in a "backoff" anotation, with some errors
causing to immediately give up
- the number of items in error is compared to the count of all items to
retrieve and if the ratio is too high, the scraper is stopped
Could be done probably quite easily for Gutenberg scraper as well.
I will try to reproduce the current behavior locally to get a better
understanding and ensure that my changes are really enhancing the situation.
|
@benoit74 I believe the problem is not the calls but in the fact that they are treated independently, blindly. We are using a single target host and we have more than 130K resources for english alone. If we only look at a request's response from its own resource perspective (just to retry it) then we'll still try the same failing server for as many resources we need times the retry policy... Code should understand that the run is compromised by the server's current status and halt itself. |
@rgaudin Yes, good point! I have been experimenting with tracking overall success and failure of multiple download attempts in a dictionary. The downloads happen in concurrent threads so it is necessary to share a variable between them. Here is my code: rimu@fae7d9a. What do you think? |
@rimu just a quick note: although using CPython we know that the GIL will serialize access to the dict (so it will not be corrupted) if we update the same key from 2 thread the result is undefined if we don't synchronize the threads. |
Really, really, really, really, really, really, shouldn't be using Python for multi-threaded work. I've seen companies try to use multi-threaded Python at large scales and it's a never-ending disaster because the language itself doesn't support it and external libraries can only pretend to support it. Go or Java would be a far better choice if you want fairly high-level features and an architecture that was always designed for elegant multi-threading. You'd no longer need Redis or DNS caches. You'd no longer have problems with cloud cache latency, deadlocks, races, GIL locks, spinloops, etc. I'd even pitch in coding effort it's Java. |
@eriol yes, I wondered if that would be a problem. In this case I don't think it matters if the count of success or failure are a little wrong sometimes, we're not flying to the moon - the only thing that can will happen is a few extra requests before the script terminates. But I can easily add some locking if necessary. @kevinmcmurtrie I haven't added any threading to the architecture, it was already there when I arrived (and working well, presumably). All that is new is the sharing of a new variable between threads. |
@kevinmcmurtrie, your input is important and duly noted ; it's not the first time you're sharing this with us. While it's an highly important point to our process, changing stack is not a light decision and this input would be one in a multitude of other criteria and constraints. So I'm moving this out of the scope of this particular ticket to the general openZIM strategy. On @rimu's code, I think the approach we discussed here would be:
Key is to keep a shared state (using a lock) so that we can gracefully shut the process down, removing the temp stuff we've created. |
This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions. |
@kevinmcmurtrie commented on Feb 15, 2021, 5:27 AM UTC:
Problem
Scraper was rapidly hitting http://aleph.gutenberg.org while it was refusing TCP/IP connections. This resembles a DoS attack and could result in clients being blacklisted.
https://farm.openzim.org/pipeline/806bba557688d39ae7189206/debug
Reproducing steps
N/A.
This issue was moved by kelson42 from openzim/zimfarm#607.
The text was updated successfully, but these errors were encountered: