-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[filebeat][azure-blob-storage] - Fix concurrency issues and flakey tests. #35983
Comments
Pinging @elastic/security-external-integrations (Team:Security-External Integrations) |
…ue (#36124) ## Type of change - Bug ## What does this PR do? This PR fixes the concurrency issues present in the azure blob storage input and the flakey tests issue. ## Why is it important? Concurrent ops were failing at scale and this fix addresses that issue. ## Checklist - [x] My code follows the style guidelines of this project - [x] I have commented my code, particularly in hard-to-understand areas ~~- [ ] I have made corresponding changes to the documentation~~ ~~- [ ] I have made corresponding change to the default configuration files~~ - [x] I have added tests that prove my fix is effective or that my feature works - [x] I have added an entry in `CHANGELOG.next.asciidoc` or `CHANGELOG-developer.next.asciidoc`. ## Author's Checklist <!-- Recommended Add a checklist of things that are required to be reviewed in order to have the PR approved --> - [ ] ## How to test this PR locally <!-- Recommended Explain here how this PR will be tested by the reviewer: commands, dependencies, steps, etc. --> ## Related issues - Relates #35983 ## Use cases <!-- Recommended Explain here the different behaviors that this PR introduces or modifies in this project, user roles, environment configuration, etc. If you are familiar with Gherkin test scenarios, we recommend its usage: https://cucumber.io/docs/gherkin/reference/ --> ## Screenshots <!-- Optional Add here screenshots about how the project will be changed after the PR is applied. They could be related to web pages, terminal, etc, or any other image you consider important to be shared with the team. --> ## Logs <!-- Recommended Paste here output logs discovered while creating this PR, such as stack traces or integration logs, or any other output you consider important to be shared with the team. -->
@ShourieG Thanks for the fix, facing the issue currently. |
@sebastienminne It should be available with the release of 8.10 on September 15th |
Not sure the fix is effective, seems it is still impossible tu have more than one max_worker getting this stack goroutine 60 [running]: goroutine 1 [chan receive, 3 minutes]: goroutine 22 [chan receive]: goroutine 10 [select]: goroutine 16 [IO wait, 3 minutes]: goroutine 82 [chan receive, 3 minutes]: goroutine 57 [select]: goroutine 58 [select]: goroutine 43 [chan receive, 3 minutes]: goroutine 59 [select]: goroutine 65 [select, 3 minutes]: goroutine 152 [chan receive]: goroutine 61 [select]: goroutine 62 [select]: goroutine 64 [syscall, 3 minutes]: goroutine 66 [select, 3 minutes]: goroutine 67 [select, 3 minutes]: goroutine 68 [select, 3 minutes]: goroutine 69 [select, 3 minutes]: goroutine 70 [select, 3 minutes]: goroutine 71 [select, 3 minutes]: goroutine 72 [select, 3 minutes]: goroutine 73 [select, 3 minutes]: goroutine 74 [semacquire, 3 minutes]: goroutine 75 [chan receive, 3 minutes]: goroutine 76 [chan send]: goroutine 77 [chan send]: goroutine 78 [select, 3 minutes]: goroutine 79 [chan receive, 3 minutes]: goroutine 150 [chan receive]: goroutine 148 [chan receive]: goroutine 149 [chan receive]: goroutine 147 [select, 3 minutes]: goroutine 151 [chan receive, 3 minutes]: goroutine 166 [chan receive]: goroutine 141 [chan receive]: goroutine 155 [chan receive]: goroutine 156 [chan receive]: goroutine 157 [chan receive]: goroutine 158 [chan receive]: goroutine 159 [chan receive]: goroutine 160 [chan receive]: goroutine 161 [chan receive]: goroutine 178 [chan receive]: goroutine 179 [chan receive]: goroutine 180 [chan receive]: goroutine 181 [chan receive]: goroutine 182 [chan receive]: goroutine 183 [chan receive]: goroutine 184 [chan receive]: goroutine 185 [chan receive]: goroutine 186 [chan receive]: goroutine 187 [chan receive]: goroutine 188 [chan receive]: goroutine 189 [chan receive]: goroutine 190 [chan receive]: goroutine 191 [chan receive]: goroutine 192 [chan receive]: goroutine 193 [chan receive]: goroutine 194 [chan receive]: goroutine 195 [chan receive]: goroutine 196 [chan receive]: goroutine 197 [chan receive]: goroutine 198 [chan receive]: goroutine 199 [chan receive]: goroutine 200 [chan receive]: goroutine 201 [chan receive]: goroutine 202 [chan receive]: goroutine 2718 [runnable]: goroutine 2705 [select]: goroutine 2752 [select]: goroutine 2779 [chan receive]: goroutine 2511 [select]: goroutine 2704 [IO wait]: goroutine 2510 [IO wait]: goroutine 2776 [sync.Mutex.Lock]: goroutine 2807 [chan receive]: goroutine 2751 [select]: goroutine 2778 [select]: |
@sebastienminne I don't see any specific error or panic in the stack trace though. What exactly seems to be the issue, or am I missing something ? |
@sebastienminne Are you getting this stack trace every time you use a worker count > 1 ? Is filebeat crashing immediately ? Could you share your config ? |
@ShourieG This happen when worker > 1, sometimes at startup sometime fews minutes after start filebeat.inputs: output.kafka:
What I can observe as well is an incredible amount of 'listBlobs' operation (43 M a day) which cost a lot of money |
Sorry I just noticed the first line of the error wasn't in the stack trace : fatal error: concurrent map iteration and map write |
@sebastienminne, extremely weird .. cause with 8.10 we even introduced concurrency tests that work with even 2000 workers after the fix. Could you clear your registry and do a clean stack update and try once more ? If it still occurs after that, could you check if its happening when using multiple containers or a single container ? |
@ShourieG test has been made after cleaning the registry |
@sebastienminne I was finally able to reproduce the issue locally, and the problem lies in how partial saves are done atm. Even with mutex locks, some resources are still being accessed concurrently which should not be possible. Either way in 8.11 partial saves are being removed in its current state, which will solve this issue for good. |
@ShourieG thanks for investigating, I'll give a try to 8.11 |
…ue (elastic#36124) ## Type of change - Bug ## What does this PR do? This PR fixes the concurrency issues present in the azure blob storage input and the flakey tests issue. ## Why is it important? Concurrent ops were failing at scale and this fix addresses that issue. ## Checklist - [x] My code follows the style guidelines of this project - [x] I have commented my code, particularly in hard-to-understand areas ~~- [ ] I have made corresponding changes to the documentation~~ ~~- [ ] I have made corresponding change to the default configuration files~~ - [x] I have added tests that prove my fix is effective or that my feature works - [x] I have added an entry in `CHANGELOG.next.asciidoc` or `CHANGELOG-developer.next.asciidoc`. ## Author's Checklist <!-- Recommended Add a checklist of things that are required to be reviewed in order to have the PR approved --> - [ ] ## How to test this PR locally <!-- Recommended Explain here how this PR will be tested by the reviewer: commands, dependencies, steps, etc. --> ## Related issues - Relates elastic#35983 ## Use cases <!-- Recommended Explain here the different behaviors that this PR introduces or modifies in this project, user roles, environment configuration, etc. If you are familiar with Gherkin test scenarios, we recommend its usage: https://cucumber.io/docs/gherkin/reference/ --> ## Screenshots <!-- Optional Add here screenshots about how the project will be changed after the PR is applied. They could be related to web pages, terminal, etc, or any other image you consider important to be shared with the team. --> ## Logs <!-- Recommended Paste here output logs discovered while creating this PR, such as stack traces or integration logs, or any other output you consider important to be shared with the team. -->
There are a couple of concurrency issues that exist inside the azure blob storage input, along with some flakey tests. This is issue describes what need fixing.
Related Issues: #35126
The text was updated successfully, but these errors were encountered: