Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

o365input: Restart after fatal error #21258

Merged
merged 3 commits into from
Sep 29, 2020

Conversation

adriansr
Copy link
Contributor

@adriansr adriansr commented Sep 23, 2020

What does this PR do?

Updates o365input to restart the input after a fatal error is encountered, for example an authentication token refresh error or a parsing error.

This enables the input to be more resilient against errors.

Before this patch, the input would index an error document and terminate. Now it will index an error and restart after a fixed timeout of 5 minutes.

Why is it important?

Some users are reporting that the o365 module stops ingesting events after some days. In all cases it's been observed that the input terminated at some point due to errors contacting the Azure authentication server to refresh a token.

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • [ ] I have made corresponding changes to the documentation
  • [ ] I have made corresponding change to the default configuration files
  • [ ] I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

Author's Checklist

  • Make sure that there's no better way, i.e. some way to make an input v2 restart automatically when it returns an error.

How to test this PR locally

Testing the case of token refresh errors is difficult as they are refreshed once every ~12h. But the behavior can be tested by starting Filebeat without an internet connection.

Update the o365input to restart the input after a fatal error is
encountered, for example an authentication token refresh error or a parsing
error.

This enables the input to be more resilent against transient errors.

Before this patch, the input would index an error document and terminate.
Now it will index an error and restart after a fixed timeout of 5 minutes.
@botelastic botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Sep 23, 2020
@adriansr adriansr requested review from urso and a team September 23, 2020 15:42
@adriansr adriansr added bug needs_backport PR is waiting to be backported to other branches. review Team:SIEM labels Sep 23, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/siem (Team:SIEM)

@botelastic botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Sep 23, 2020
@adriansr
Copy link
Contributor Author

@urso can you have a quick look? Is there a better way to "restart" an input?

@elasticmachine
Copy link
Collaborator

elasticmachine commented Sep 23, 2020

💚 Build Succeeded

Pipeline View Test View Changes Artifacts preview

Expand to view the summary

Build stats

  • Build Cause: [Started by user Adrian Serrano, Replayed #2]

  • Start Time: 2020-09-24T09:50:23.500+0000

  • Duration: 58 min 45 sec

Test stats 🧪

Test Results
Failed 0
Passed 2501
Skipped 388
Total 2889

@27bharath
Copy link

27bharath commented Sep 27, 2020

Thanks for taking this and for resolving. Can you please let me know any plans when it will be roll out to new version

@adriansr adriansr merged commit 8716d98 into elastic:master Sep 29, 2020
@adriansr
Copy link
Contributor Author

@27bharath this fix will be available in the next versions, 7.9.3 and 7.10.0

v1v added a commit to v1v/beats that referenced this pull request Sep 29, 2020
* upstream/master:
  feat: prepare release pipelines (elastic#21238)
  Add IP validation to Security module (elastic#21325)
  Fixes for new 7.10 rsa2elk datasets (elastic#21240)
  o365input: Restart after fatal error (elastic#21258)
  Fix panic in cgroups monitoring (elastic#21355)
  Handle multiple upstreams in ingress-controller (elastic#21215)
  [CI] Fix runbld when workspace does not exist (elastic#21350)
  [Filebeat] Fix checkpoint (elastic#21344)
  [CI] Archive build reasons (elastic#21347)
  Add dashboard for pubsub metricset in googlecloud module (elastic#21326)
  [Elastic Agent] Allow embedding of certificate (elastic#21179)
  Adds a default for failure_cache.min_ttl (elastic#21085)
  [libbeat] Disk queue implementation (elastic#21176)
@adriansr adriansr added v7.10.0 and removed needs_backport PR is waiting to be backported to other branches. labels Sep 29, 2020
adriansr added a commit to adriansr/beats that referenced this pull request Sep 29, 2020
Update the o365input to restart the input after a fatal error is
encountered, for example an authentication token refresh error or a parsing
error.

This enables the input to be more resilient against transient errors.

Before this patch, the input would index an error document and terminate.
Now it will index an error and restart after a fixed timeout of 5 minutes.

(cherry picked from commit 8716d98)
adriansr added a commit that referenced this pull request Sep 30, 2020
Update the o365input to restart the input after a fatal error is
encountered, for example an authentication token refresh error or a parsing
error.

This enables the input to be more resilient against transient errors.

Before this patch, the input would index an error document and terminate.
Now it will index an error and restart after a fixed timeout of 5 minutes.

(cherry picked from commit 8716d98)
adriansr added a commit to adriansr/beats that referenced this pull request Oct 5, 2020
Update the o365input to restart the input after a fatal error is
encountered, for example an authentication token refresh error or a parsing
error.

This enables the input to be more resilient against transient errors.

Before this patch, the input would index an error document and terminate.
Now it will index an error and restart after a fixed timeout of 5 minutes.

(cherry picked from commit 8716d98)
adriansr added a commit that referenced this pull request Oct 6, 2020
Update the o365input to restart the input after a fatal error is
encountered, for example an authentication token refresh error or a parsing
error.

This enables the input to be more resilient against transient errors.

Before this patch, the input would index an error document and terminate.
Now it will index an error and restart after a fixed timeout of 5 minutes.

(cherry picked from commit 8716d98)
publisher.Publish(event, nil)
ctx.Logger.Errorf("Input failed: %v", err)
ctx.Logger.Infof("Restarting in %v", failureRetryInterval)
time.Sleep(failureRetryInterval)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can block proper filebeat shutdown for 5min. Better use timed.Wait that is cancellable from go-concert.

adriansr added a commit to adriansr/beats that referenced this pull request Oct 7, 2020
PR elastic#21258 introduced a restart mechanism for o365input so that it didn't
stop working once a fatal error was found. This updates the restart delay to
use a cancellation-context-aware method so that the input doesn't block
Filebeat termination.
@adriansr adriansr mentioned this pull request Oct 7, 2020
2 tasks
adriansr added a commit that referenced this pull request Oct 9, 2020
PR #21258 introduced a restart mechanism for o365input so that it didn't
stop working once a fatal error was found. This updates the restart delay to
use a cancellation-context-aware method so that the input doesn't block
Filebeat termination.
adriansr added a commit to adriansr/beats that referenced this pull request Oct 9, 2020
PR elastic#21258 introduced a restart mechanism for o365input so that it didn't
stop working once a fatal error was found. This updates the restart delay to
use a cancellation-context-aware method so that the input doesn't block
Filebeat termination.

(cherry picked from commit 1abe97b)
adriansr added a commit to adriansr/beats that referenced this pull request Oct 9, 2020
PR elastic#21258 introduced a restart mechanism for o365input so that it didn't
stop working once a fatal error was found. This updates the restart delay to
use a cancellation-context-aware method so that the input doesn't block
Filebeat termination.

(cherry picked from commit 1abe97b)
adriansr added a commit to adriansr/beats that referenced this pull request Oct 9, 2020
PR elastic#21258 introduced a restart mechanism for o365input so that it didn't
stop working once a fatal error was found. This updates the restart delay to
use a cancellation-context-aware method so that the input doesn't block
Filebeat termination.

(cherry picked from commit 1abe97b)
adriansr added a commit that referenced this pull request Oct 13, 2020
PR #21258 introduced a restart mechanism for o365input so that it didn't
stop working once a fatal error was found. This updates the restart delay to
use a cancellation-context-aware method so that the input doesn't block
Filebeat termination.

(cherry picked from commit 1abe97b)
adriansr added a commit that referenced this pull request Oct 13, 2020
PR #21258 introduced a restart mechanism for o365input so that it didn't
stop working once a fatal error was found. This updates the restart delay to
use a cancellation-context-aware method so that the input doesn't block
Filebeat termination.

(cherry picked from commit 1abe97b)
adriansr added a commit that referenced this pull request Oct 13, 2020
PR #21258 introduced a restart mechanism for o365input so that it didn't
stop working once a fatal error was found. This updates the restart delay to
use a cancellation-context-aware method so that the input doesn't block
Filebeat termination.

(cherry picked from commit 1abe97b)
leweafan pushed a commit to leweafan/beats that referenced this pull request Apr 28, 2023
elastic#21386)

Update the o365input to restart the input after a fatal error is
encountered, for example an authentication token refresh error or a parsing
error.

This enables the input to be more resilient against transient errors.

Before this patch, the input would index an error document and terminate.
Now it will index an error and restart after a fixed timeout of 5 minutes.

(cherry picked from commit c723c1e)
leweafan pushed a commit to leweafan/beats that referenced this pull request Apr 28, 2023
PR elastic#21258 introduced a restart mechanism for o365input so that it didn't
stop working once a fatal error was found. This updates the restart delay to
use a cancellation-context-aware method so that the input doesn't block
Filebeat termination.

(cherry picked from commit f2ab428)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants