Fix exception in task executor where task fails but doesn't set request_context.request_end #1864

ty-elastic · 2024-06-28T18:56:44Z

Under "high" rally load (~10Gbit/s using the elastic/logs track with 1000 bulk indexing clients), I can consistently get Rally to crash within some minutes. Looking at the logs, it seems to happen when the target ES momentarily cannot accommodate the load (e.g., returns 429s). The logs prior to the exception will be like:

2024-06-26 21:45:12,264 ActorAddr-(T|:41695)/PID:252988 elastic_transport.node_pool WARNING Node <RallyAiohttpHttpNode(https://221.229.133.34.bc.googleusercontent.com:9200)> has failed for 1 times in a row, putting on 1 second timeout
2024-06-26 21:45:12,790 ActorAddr-(T|:40017)/PID:252978 elastic_transport.node_pool WARNING Node <RallyAiohttpHttpNode(https://221.229.133.34.bc.googleusercontent.com:9200)> has been marked alive after a successful request
2024-06-26 21:45:12,798 ActorAddr-(T|:46313)/PID:252969 elastic_transport.node_pool WARNING Node <RallyAiohttpHttpNode(https://221.229.133.34.bc.googleusercontent.com:9200)> has been marked alive after a successful request
2024-06-26 21:45:12,853 ActorAddr-(T|:41695)/PID:252988 elastic_transport.node_pool WARNING Node <RallyAiohttpHttpNode(https://221.229.133.34.bc.googleusercontent.com:9200)> has been marked alive after a successful request
2024-06-26 21:45:12,861 ActorAddr-(T|:41695)/PID:252988 elastic_transport.node_pool WARNING Node <RallyAiohttpHttpNode(https://221.229.133.34.bc.googleusercontent.com:9200)> has been marked alive after a successful request
2024-06-26 21:45:13,373 ActorAddr-(T|:46313)/PID:252969 elastic_transport.node_pool WARNING Node <RallyAiohttpHttpNode(https://221.229.133.34.bc.googleusercontent.com:9200)> has been marked alive after a successful request
2024-06-26 21:45:13,645 ActorAddr-(T|:37579)/PID:252985 elastic_transport.node_pool WARNING Node <RallyAiohttpHttpNode(https://221.229.133.34.bc.googleusercontent.com:9200)> has failed for 1 times in a row, putting on 1 second timeout
2024-06-26 21:45:13,680 ActorAddr-(T|:43345)/PID:252987 elastic_transport.node_pool WARNING Node <RallyAiohttpHttpNode(https://221.229.133.34.bc.googleusercontent.com:9200)> has failed for 1 times in a row, putting on 1 second timeout

thereafter, Rally will stop with an exception:

TypeError: unsupported operand type(s) for -: 'NoneType' and 'float'
esrally.driver.driver ERROR Could not execute schedule

Looking at a stack trace, this exception is generated because:

  File "/rally/esrally/driver/driver.py", line 1948, in __call__
    service_time = request_end - request_start
                   ~~~~~~~~~~~~^~~~~~~~~~~~~~~

where request_end is unexpectedly None.

Looking at the code, I think the intent is that request_context.request_end should always be set, even on error scenarios. Perhaps there is some race condition with the HTTP request lib, though, where for some reason on_request_end() is never called, leaving request_context.request_end unset.

This PR doesn't try to address the root cause (why is request_context.request_end unset), but rather acknowledges that it is unset (only in already acknowledged error scenarios, i.e., request_meta_data["success"] == False), and sets it. My assumption is that because request_meta_data["success"] == False and because it can be False at this point (with request_end set to something), that some other code down the line will handle counting the error and what we need to focus on here is just not stopping overall execution (due to an unplanned exception) on an otherwise non-fatal error..

rally_log.txt
rally_config.txt

ty-elastic · 2024-06-28T19:24:10Z

alternatively, you could perhaps set request_context.request_end the first time you set request_context.request_start, since obviously the intent is that request_end can be continually updated.

ty-elastic · 2024-06-28T19:44:34Z

nm. this is correctly addressed here, I believe.

ty-elastic · 2024-07-01T12:31:06Z

fixed in #1861

ensure request_end is always set

8726994

ty-elastic force-pushed the ty-elastic/rally-fix branch from 9dd836c to 8726994 Compare June 28, 2024 19:20

ty-elastic mentioned this pull request Jun 28, 2024

Restore time measurement on request end #1861

Merged

ty-elastic closed this Jul 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix exception in task executor where task fails but doesn't set request_context.request_end #1864

Fix exception in task executor where task fails but doesn't set request_context.request_end #1864

ty-elastic commented Jun 28, 2024 •

edited

Loading

ty-elastic commented Jun 28, 2024 •

edited

Loading

ty-elastic commented Jun 28, 2024 •

edited

Loading

ty-elastic commented Jul 1, 2024

Fix exception in task executor where task fails but doesn't set request_context.request_end #1864

Fix exception in task executor where task fails but doesn't set request_context.request_end #1864

Conversation

ty-elastic commented Jun 28, 2024 • edited Loading

ty-elastic commented Jun 28, 2024 • edited Loading

ty-elastic commented Jun 28, 2024 • edited Loading

ty-elastic commented Jul 1, 2024

ty-elastic commented Jun 28, 2024 •

edited

Loading

ty-elastic commented Jun 28, 2024 •

edited

Loading

ty-elastic commented Jun 28, 2024 •

edited

Loading