Not reporting failure events for extension install/enable/disable #1396

vrdmr · 2018-11-14T22:02:24Z

The extension commands are run in launch_command call in enable(self)/disable(self)/install(self).

If it throws an ExtensionError (Which is when return code is non-0 or some other issue), the error is propogated up - eg. launch_command ->enable ->handle_enable -> handle_ext_handler(), where the except ExtensionError as e catches it, and gives it to the handle_handle_ext_handler_error method

Today handle_handle_ext_handler_error does not report report_event for all the failures - only when get_artifact_error_state.is_triggered(). This is not the right behavior - It should be sneding events for all the exceptions and not only when the get_artifact_error_state.is_triggered().

The text was updated successfully, but these errors were encountered:

jasonzio · 2018-11-15T01:03:46Z

The intention behind the "is_triggered" thing is to avoid flooding the logs when things go bad. The idea was for the event to start out "triggered" and be "reset" the first time it's seen; periodically, all events were reset (in some other thread) to "triggered". If the initial state is erroneously set (i.e. to the "reset" state) or if the periodic reset (a) never happens or (b) actually sets the trigger state incorrectly, then that's the bug.

You'll want to be very careful of the event volume you see after you've made this change. You may need to revert it in a hurry.

vrdmr · 2018-11-15T07:53:18Z

+@hglkrijger, @boumenot - If I am missing some context, please let me know.

@jasonzio: The current issue is that we have not been sending any telemetry at all since 2.2.30+ for any extension operation failure. I understand that the “is_triggered” approach was a way to introduce leaky bucket way to send telemetry - limiting the telemetry volume. But with the timer-based approach, we miss the real failures in data. Today, when looking for any operation (install/enable/etc.) failures for extensions in our telemetry, we don't find any.

Based on my investigation, starting agent 2.2.31, we stopped getting failures for any extension operation (mostly all of it) and and which is why we are reverting back to the approach which was in the v2.2.26 (before #1182 went in).

We are ready to take a hit on the volume currently than miss genuine extension failure (caught and reported here). There could be buggy extensions which could have operations failing, and could get stuck in a loop (send the events many times), but they would be genuine extension errors which need to be reported (we can in future do a send_once kind of approach).

PS: Even weird is that there are no failures in the RM table. Which we do need to fix as well.

FYI: @narrieta, @roiyz-msft, @GaneshMSAzure.

boumenot · 2018-11-15T17:49:56Z

The intent was to avoid transient errors that eventually self-mitigate. Are you concerned you are not seeing these transient errors, or do you believe these are non-transient and should be captured.

vrdmr self-assigned this Nov 14, 2018

vrdmr added this to the v2.2.34 milestone Nov 14, 2018

vrdmr mentioned this issue Nov 15, 2018

Send events when extensions fail to complete operation #1397

Merged

6 tasks

vrdmr assigned narrieta, GaneshMSAzure and roiyz-msft Nov 15, 2018

vrdmr mentioned this issue Nov 17, 2018

Release 2.2.34 of the agent #1400

Merged

6 tasks

vrdmr closed this as completed Nov 19, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Not reporting failure events for extension install/enable/disable #1396

Not reporting failure events for extension install/enable/disable #1396

vrdmr commented Nov 14, 2018

jasonzio commented Nov 15, 2018

vrdmr commented Nov 15, 2018

boumenot commented Nov 15, 2018

Not reporting failure events for extension install/enable/disable #1396

Not reporting failure events for extension install/enable/disable #1396

Comments

vrdmr commented Nov 14, 2018

jasonzio commented Nov 15, 2018

vrdmr commented Nov 15, 2018

boumenot commented Nov 15, 2018