Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Not reporting failure events for extension install/enable/disable #1396

Closed
vrdmr opened this issue Nov 14, 2018 · 3 comments
Closed

Not reporting failure events for extension install/enable/disable #1396

vrdmr opened this issue Nov 14, 2018 · 3 comments
Assignees
Milestone

Comments

@vrdmr
Copy link
Member

vrdmr commented Nov 14, 2018

The extension commands are run in launch_command call in enable(self)/disable(self)/install(self).

If it throws an ExtensionError (Which is when return code is non-0 or some other issue), the error is propogated up - eg. launch_command ->enable ->handle_enable -> handle_ext_handler(), where the except ExtensionError as e catches it, and gives it to the handle_handle_ext_handler_error method

Today handle_handle_ext_handler_error does not report report_event for all the failures - only when get_artifact_error_state.is_triggered(). This is not the right behavior - It should be sneding events for all the exceptions and not only when the get_artifact_error_state.is_triggered().

@vrdmr vrdmr self-assigned this Nov 14, 2018
@vrdmr vrdmr added this to the v2.2.34 milestone Nov 14, 2018
@jasonzio
Copy link
Member

The intention behind the "is_triggered" thing is to avoid flooding the logs when things go bad. The idea was for the event to start out "triggered" and be "reset" the first time it's seen; periodically, all events were reset (in some other thread) to "triggered". If the initial state is erroneously set (i.e. to the "reset" state) or if the periodic reset (a) never happens or (b) actually sets the trigger state incorrectly, then that's the bug.

You'll want to be very careful of the event volume you see after you've made this change. You may need to revert it in a hurry.

@vrdmr
Copy link
Member Author

vrdmr commented Nov 15, 2018

+@hglkrijger, @boumenot - If I am missing some context, please let me know.

@jasonzio: The current issue is that we have not been sending any telemetry at all since 2.2.30+ for any extension operation failure. I understand that the “is_triggered” approach was a way to introduce leaky bucket way to send telemetry - limiting the telemetry volume. But with the timer-based approach, we miss the real failures in data. Today, when looking for any operation (install/enable/etc.) failures for extensions in our telemetry, we don't find any.

Based on my investigation, starting agent 2.2.31, we stopped getting failures for any extension operation (mostly all of it) and and which is why we are reverting back to the approach which was in the v2.2.26 (before #1182 went in).

We are ready to take a hit on the volume currently than miss genuine extension failure (caught and reported here). There could be buggy extensions which could have operations failing, and could get stuck in a loop (send the events many times), but they would be genuine extension errors which need to be reported (we can in future do a send_once kind of approach).

PS: Even weird is that there are no failures in the RM table. Which we do need to fix as well.

FYI: @narrieta, @roiyz-msft, @GaneshMSAzure.

@boumenot
Copy link
Member

The intent was to avoid transient errors that eventually self-mitigate. Are you concerned you are not seeing these transient errors, or do you believe these are non-transient and should be captured.

@vrdmr vrdmr mentioned this issue Nov 17, 2018
6 tasks
@vrdmr vrdmr closed this as completed Nov 19, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants