-
Notifications
You must be signed in to change notification settings - Fork 372
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Not reporting failure events for extension install/enable/disable #1396
Comments
The intention behind the "is_triggered" thing is to avoid flooding the logs when things go bad. The idea was for the event to start out "triggered" and be "reset" the first time it's seen; periodically, all events were reset (in some other thread) to "triggered". If the initial state is erroneously set (i.e. to the "reset" state) or if the periodic reset (a) never happens or (b) actually sets the trigger state incorrectly, then that's the bug. You'll want to be very careful of the event volume you see after you've made this change. You may need to revert it in a hurry. |
+@hglkrijger, @boumenot - If I am missing some context, please let me know. @jasonzio: The current issue is that we have not been sending any telemetry at all since 2.2.30+ for any extension operation failure. I understand that the “is_triggered” approach was a way to introduce leaky bucket way to send telemetry - limiting the telemetry volume. But with the timer-based approach, we miss the real failures in data. Today, when looking for any operation (install/enable/etc.) failures for extensions in our telemetry, we don't find any. Based on my investigation, starting agent 2.2.31, we stopped getting failures for any extension operation (mostly all of it) and and which is why we are reverting back to the approach which was in the v2.2.26 (before #1182 went in). We are ready to take a hit on the volume currently than miss genuine extension failure (caught and reported here). There could be buggy extensions which could have operations failing, and could get stuck in a loop (send the events many times), but they would be genuine extension errors which need to be reported (we can in future do a send_once kind of approach). PS: Even weird is that there are no failures in the RM table. Which we do need to fix as well. FYI: @narrieta, @roiyz-msft, @GaneshMSAzure. |
The intent was to avoid transient errors that eventually self-mitigate. Are you concerned you are not seeing these transient errors, or do you believe these are non-transient and should be captured. |
The extension commands are run in
launch_command
call inenable(self)
/disable(self)
/install(self)
.If it throws an
ExtensionError
(Which is when return code is non-0 or some other issue), the error is propogated up - eg.launch_command
->enable
->handle_enable
->handle_ext_handler()
, where theexcept ExtensionError as e
catches it, and gives it to thehandle_handle_ext_handler_error
methodToday
handle_handle_ext_handler_error
does not reportreport_event
for all the failures - only whenget_artifact_error_state.is_triggered()
. This is not the right behavior - It should be sneding events for all the exceptions and not only when theget_artifact_error_state.is_triggered()
.The text was updated successfully, but these errors were encountered: