Die if the upstream server is not reachable #142

ghost · 2021-02-25T17:55:26Z

@kevinmcmurtrie commented on Feb 15, 2021, 5:27 AM UTC:

Location: Worker
Schedule Name/ID: gutenberg_mul_all
Task ID: 806bba557688d39ae7189206

Problem

Scraper was rapidly hitting http://aleph.gutenberg.org while it was refusing TCP/IP connections. This resembles a DoS attack and could result in clients being blacklisted.

https://farm.openzim.org/pipeline/806bba557688d39ae7189206/debug

Reproducing steps

N/A.

This issue was moved by kelson42 from openzim/zimfarm#607.

ghost · 2021-02-25T17:55:29Z

@kevinmcmurtrie commented on Feb 25, 2021, 7:31 AM UTC:

It happened again: https://farm.openzim.org/pipeline/97fbc40e1f235962f06ae206/debug

I disabled the gutenberg scraper on pixelmemory.

ghost · 2021-02-25T17:55:35Z

@kelson42 commented on Feb 25, 2021, 9:46 AM UTC:

Scraper was rapidly hitting http://aleph.gutenberg.org while it was refusing TCP/IP connections.

What let you say the aleph.gutenberg.org refuses TCP/IP connections?
What let you believe this is a problem in the scraper? How should the scraper behave differently?

ghost · 2021-02-25T17:55:38Z

@kevinmcmurtrie commented on Feb 25, 2021, 5:04 PM UTC:

Scraper was rapidly hitting http://aleph.gutenberg.org while it was refusing TCP/IP connections.

What let you say the aleph.gutenberg.org refuses TCP/IP connections?

Over 1 million Connecting to aleph.gutenberg.org (aleph.gutenberg.org)|65.50.255.20|:80... failed: Connection refused.

What let you believe this is a problem in the scraper? How should the scraper behave differently?

The scraper is retrying at an extremely high rate that is going to trigger automated denial-of-service detection. 1141028 connection attempts were made and refused in a short period of time. Retries need to be throttled to a sane rate.

kelson42 · 2021-02-25T17:56:44Z

@eshellman Does the configuration of aleph.gutenberg.org has changed? Do you have an idea what could be done?

eshellman · 2021-02-26T15:05:59Z

The service was repaired on the 21st; these error logs seem to be from the 20th. @kevinmcmurtrie Could you verify that there is still a problem?

kelson42 · 2021-02-27T13:14:30Z

@eshellman Thx, I plan to close the ticket as the problem does not occur anymore.

kevinmcmurtrie · 2021-02-27T17:48:56Z

Was retry throttling added or are you just hoping that aleph.gutenberg.org doesn't go down any more? I'm concerned about triggering DoS detection at my own ISP so I'm not going to run it as long as it potentially has this issue.

kelson42 · 2021-03-04T15:42:54Z

The questions are:

Do we want to die as soon as the remote server is not reachable because we are not able then to deliver the perfect ZIM?
Do we want to introduce a tolerance because immediatly dying would just be too strict?
Current behaviour is proper - because we don't have a real problem and trying 1M times against a server which is disconnected is not a problem anyway?

kevinmcmurtrie · 2021-03-04T17:17:37Z

If it could wait 2 seconds between retries of socket errors, that would be enough for me to let it run again.

eshellman · 2021-03-08T00:55:28Z

That seems like a good idea.

stale · 2021-06-02T17:10:50Z

This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions.

kelson42 · 2022-12-20T16:52:08Z

If slowing down is good enough, then woukd be great to implement it.

eriol · 2022-12-20T21:31:21Z

If adding an external dependency it's not a problem retrying¹ library is an easy way to accomplish it: it's already have exponential backoff as one of the possible strategies.

¹ https://pypi.org/project/retrying/

rimu · 2022-12-26T02:03:05Z

zimscraperlib does the file download, in download.py (def save_large_file). It uses wget with parameters that make it retry 5 times, including if connections are refused:

 subprocess.run(
        [
            WGET_BINARY,
            "-t",
            "5",
            "--retry-connrefused",
            "--random-wait",
            "-O",
            str(fpath),
            "-c",
            url,
        ],
        check=True,
    )

If we remove --retry-connrefused then it will only try once, not 5 times. So the problem will remain but it will be only 20% as big.

There is another parameter, --waitretry=5 which will add a 5 second delay when a download fails, before trying again.

rimu · 2022-12-26T02:07:13Z

Also, we could insert some delays in gutenberg/download.py. For example:

                if not download_file(url, zpath):
                    logger.error("ZIP file download failed: {}".format(zpath))
                    continue

becomes

                if not download_file(url, zpath):
                    logger.error("ZIP file download failed: {}".format(zpath))
                    time.sleep(1)
                    continue

fixes openzim/gutenberg#142

rimu · 2022-12-31T03:58:40Z

I've made a PR on python-scraperlib, someone please take a look :)

rgaudin · 2023-01-02T11:12:07Z

I think retrying is probably the way to go here. That's what we do on other scrapers. We uses backoff but retrying seems like a better choice.

benoit74 · 2023-01-14T21:46:23Z

Could you develop why retrying seems better than backoff (just to share understanding)?
Did someone saved the debug logs mentioned in this PR? I would like to have a look into it, I don't get why it could have performed 1 million requests in a short timeframe.

kevinmcmurtrie · 2023-01-14T22:24:24Z

Could you develop why retrying seems better than backoff (just to share understanding)? Did someone saved the debug logs mentioned in this PR? I would like to have a look into it, I don't get why it could have performed 1 million requests in a short timeframe.

It looks like the logs expire.

The scraper had the content list but the host with the content http://aleph.gutenberg.org/ was immediately refusing connections.

Connecting to aleph.gutenberg.org (aleph.gutenberg.org)|65.50.255.20|:80... failed: Connection refused.

The scraper was making article count * retries requests at full speed on a low-latency connection. It wasn't delaying retries or giving up after many failures. That's how it quickly had 1 million refused connections. It might have been more if I hadn't killed it.

It's important to me that Kiwix scrapers never look like a DoS attack or bad bot. When AT&T (my host), Akamai (edge cache), Cloudflare (edge cache), and various IDS see activity like this, it causes my network to end up in private blocklists. These private blocklists are a pain the the ass to debug and resolve. Stuff just stops working.

It's OK if the scraper retries a lot then quits or retries slowly. It just can't retry at full-throttle until it has gone through the whole content list.

Project Gutenberg had a similar issue recently where they accidentally set the wrong IPv6 DNS address. Luckily it didn't route to anything so it was a very slow failure.

benoit74 · 2023-01-15T07:45:19Z

This is very weird because it does not match my understanding of the wget behavior. Either documentation is not up-to-date with code, something weird is happening or I just don't get it right (and my favorite option is the third one). Totally aligned with your requirements, even if I cannot ensure that it will be always feasible / working as expected. For iFixit scrapper I implemented the following behavior for every API call (but not everywhere, there is few web crawling calls, and this caused you an issue as well as far as I remember): - every API call is wrapped in a "backoff" anotation, with some errors causing to immediately give up - the number of items in error is compared to the count of all items to retrieve and if the ratio is too high, the scraper is stopped Could be done probably quite easily for Gutenberg scraper as well. I will try to reproduce the current behavior locally to get a better understanding and ensure that my changes are really enhancing the situation.

rgaudin · 2023-01-15T08:56:24Z

@benoit74 I believe the problem is not the calls but in the fact that they are treated independently, blindly.

We are using a single target host and we have more than 130K resources for english alone.

If we only look at a request's response from its own resource perspective (just to retry it) then we'll still try the same failing server for as many resources we need times the retry policy...
It's more of an architecture issue than a request call one.

Code should understand that the run is compromised by the server's current status and halt itself.

rimu · 2023-01-19T03:37:00Z

@rgaudin Yes, good point!

I have been experimenting with tracking overall success and failure of multiple download attempts in a dictionary. The downloads happen in concurrent threads so it is necessary to share a variable between them.

Here is my code: rimu@fae7d9a.

What do you think?

eriol · 2023-01-19T05:41:39Z

@rimu just a quick note: although using CPython we know that the GIL will serialize access to the dict (so it will not be corrupted) if we update the same key from 2 thread the result is undefined if we don't synchronize the threads.

kevinmcmurtrie · 2023-01-19T06:43:01Z

Really, really, really, really, really, really, shouldn't be using Python for multi-threaded work. I've seen companies try to use multi-threaded Python at large scales and it's a never-ending disaster because the language itself doesn't support it and external libraries can only pretend to support it.

Go or Java would be a far better choice if you want fairly high-level features and an architecture that was always designed for elegant multi-threading. You'd no longer need Redis or DNS caches. You'd no longer have problems with cloud cache latency, deadlocks, races, GIL locks, spinloops, etc. I'd even pitch in coding effort it's Java.

rimu · 2023-01-19T07:52:50Z

@eriol yes, I wondered if that would be a problem. In this case I don't think it matters if the count of success or failure are a little wrong sometimes, we're not flying to the moon - the only thing that can will happen is a few extra requests before the script terminates. But I can easily add some locking if necessary.

@kevinmcmurtrie I haven't added any threading to the architecture, it was already there when I arrived (and working well, presumably). All that is new is the sharing of a new variable between threads.

rgaudin · 2023-01-19T13:11:44Z

@kevinmcmurtrie, your input is important and duly noted ; it's not the first time you're sharing this with us. While it's an highly important point to our process, changing stack is not a light decision and this input would be one in a multitude of other criteria and constraints. So I'm moving this out of the scope of this particular ticket to the general openZIM strategy.

On @rimu's code, I think the approach we discussed here would be:

check a global flag/object before initiating downloads.
add proper retries to download calls
on validated (retried) failures that are definitive (like connection error), change the flag (to prevent additional downloads) and request shutdown
It could also include a threshold mechanism to prevent malfunctioning servers (ex: returning only 404) from keeping the run alive.

Key is to keep a shared state (using a lock) so that we can gracefully shut the process down, removing the temp stuff we've created.

stale · 2023-05-26T16:47:21Z

This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions.

ghost added bug question labels Feb 25, 2021

ghost mentioned this issue Feb 25, 2021

gutenbergtozim DoS behavior with refused connections openzim/zimfarm#607

Closed

kelson42 self-assigned this Feb 27, 2021

kelson42 removed the bug label Feb 27, 2021

kelson42 changed the title ~~gutenbergtozim DoS behavior with refused connections~~ Die if the upstream server is not reachable Mar 4, 2021

kelson42 added the enhancement label Mar 4, 2021

kelson42 added the good first issue label Mar 8, 2021

stale bot added the stale label Jun 2, 2021

kelson42 added this to the 1.2.0 milestone Dec 17, 2022

stale bot removed the stale label Dec 17, 2022

kelson42 removed the question label Dec 20, 2022

rimu added a commit to rimu/python-scraperlib that referenced this issue Dec 31, 2022

pause for 2 seconds when a download fails

b11a10e

fixes openzim/gutenberg#142

rimu mentioned this issue Dec 31, 2022

pause for 2 seconds when a download fails openzim/python-scraperlib#91

Closed

rimu mentioned this issue Jan 14, 2023

Let's stop wget from retrying refused connections openzim/python-scraperlib#93

Closed

kelson42 modified the milestones: 2.0.0, 2.1.0 Feb 26, 2023

kelson42 removed their assignment Feb 26, 2023

stale bot added the stale label May 26, 2023

kelson42 modified the milestones: 2.1.0, 2.2.0 Aug 18, 2023

benoit74 mentioned this issue Aug 18, 2023

Fix requests call without timeout in utils.py #197

Open

benoit74 removed the good first issue label Nov 4, 2024

stale bot removed the stale label Nov 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Die if the upstream server is not reachable #142

Die if the upstream server is not reachable #142

ghost commented Feb 25, 2021

ghost commented Feb 25, 2021

ghost commented Feb 25, 2021

ghost commented Feb 25, 2021

kelson42 commented Feb 25, 2021

eshellman commented Feb 26, 2021

kelson42 commented Feb 27, 2021

kevinmcmurtrie commented Feb 27, 2021

kelson42 commented Mar 4, 2021 •

edited

Loading

kevinmcmurtrie commented Mar 4, 2021

eshellman commented Mar 8, 2021

stale bot commented Jun 2, 2021

kelson42 commented Dec 20, 2022

eriol commented Dec 20, 2022

rimu commented Dec 26, 2022 •

edited

Loading

rimu commented Dec 26, 2022 •

edited

Loading

rimu commented Dec 31, 2022

rgaudin commented Jan 2, 2023

benoit74 commented Jan 14, 2023

kevinmcmurtrie commented Jan 14, 2023

benoit74 commented Jan 15, 2023 via email

rgaudin commented Jan 15, 2023

rimu commented Jan 19, 2023

eriol commented Jan 19, 2023

kevinmcmurtrie commented Jan 19, 2023

rimu commented Jan 19, 2023

rgaudin commented Jan 19, 2023

stale bot commented May 26, 2023

Die if the upstream server is not reachable #142

Die if the upstream server is not reachable #142

Comments

ghost commented Feb 25, 2021

Problem

Reproducing steps

ghost commented Feb 25, 2021

ghost commented Feb 25, 2021

ghost commented Feb 25, 2021

kelson42 commented Feb 25, 2021

eshellman commented Feb 26, 2021

kelson42 commented Feb 27, 2021

kevinmcmurtrie commented Feb 27, 2021

kelson42 commented Mar 4, 2021 • edited Loading

kevinmcmurtrie commented Mar 4, 2021

eshellman commented Mar 8, 2021

stale bot commented Jun 2, 2021

kelson42 commented Dec 20, 2022

eriol commented Dec 20, 2022

rimu commented Dec 26, 2022 • edited Loading

rimu commented Dec 26, 2022 • edited Loading

rimu commented Dec 31, 2022

rgaudin commented Jan 2, 2023

benoit74 commented Jan 14, 2023

kevinmcmurtrie commented Jan 14, 2023

benoit74 commented Jan 15, 2023 via email

rgaudin commented Jan 15, 2023

rimu commented Jan 19, 2023

eriol commented Jan 19, 2023

kevinmcmurtrie commented Jan 19, 2023

rimu commented Jan 19, 2023

rgaudin commented Jan 19, 2023

stale bot commented May 26, 2023

kelson42 commented Mar 4, 2021 •

edited

Loading

rimu commented Dec 26, 2022 •

edited

Loading

rimu commented Dec 26, 2022 •

edited

Loading