-
Notifications
You must be signed in to change notification settings - Fork 795
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix broadcast of large messages #4669
Fix broadcast of large messages #4669
Conversation
Signed-off-by: Mariana Franco <marfram@amazon.com>
@pracucci Could you take a look on this one since you worked on the change that is going to be reverted? There is probably some work that need to be done on the KeyValuePair.Unmarshal() method before this can be enabled again. |
According to https://github.com/grafana/dskit/blob/01ce9286d7d52c7ddcfd0a63ef42d240ea4454d2/go.mod#L43 I think we also need to add similar I'm trying to apply hashicorp/memberlist@8d2a27a in my local repo and run |
After applying hashicorp/memberlist@8d2a27a and running We shall discuss path forward. |
I agree the current code in Cortex is not working. There are two ways you can handle it:
That change is not enough. In github.com/grafana/memberlist we did more changes to properly fix it. You can diff github.com/grafana/memberlist with github.com/hashicorp/memberlist to see the actual differences. |
@pracucci I will try to replace the memberlist implementation with github.com/grafana/memberlist, but I was thinking... if these large messages are causing pods that don't have this memberlist change to crash, should we also add this behind some feature flag so it can be enabled in two steps? Otherwise, this will probably cause problems during the cortex upgrade. Btw, we have not see any OOM since we deployed this change on our load test environment, but we will keep it running for a couple of days more before we can confirm that this fixes #4668 |
That's a valid concern. I think we should add it behind feature flag to allow for a smoother migration.
We've experienced the same in the past when testing memberlist. It's caused by decoding of corrupted packets. It makes sense that this PR would fix it (or switching memberlist to github.com/grafana/memberlist). |
Thanks @pracucci After trying with For release 1.12 I am thinking to just replace the hashcorp memberlist with grafana memberlist, because there was point in time where large message works, and we should keep it like that for release 1.12, I am less inclined to revert large message support for release 1.12. With that said, I am open to other thoughts. |
I'm fine to move to |
Signed-off-by: Mariana Franco <marfram@amazon.com>
…arge-messages option Signed-off-by: Mariana Franco <marfram@amazon.com>
e37350c
to
554feff
Compare
Signed-off-by: Mariana Franco <marfram@amazon.com>
@@ -3,8 +3,9 @@ | |||
## master / unreleased | |||
|
|||
* [FEATURE] Ruler: Add `external_labels` option to tag all alerts with a given set of labels. | |||
* [FEATURE] Memberlist: Add `-memberlist.enable-broadcast-of-large-messages` option to enable broadcast of messages with more than 64KB. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should release this as part of 1.12, so can you kindly:
- Make this PR's target to be the release-1.12 branch instead of master; We will merge from release branch to master later on.
- Put this line under the 1.12.0 change log.
Thank you so much for working on this feature flag,
@@ -3965,6 +3965,11 @@ The `memberlist_config` configures the Gossip memberlist. | |||
# CLI flag: -memberlist.message-history-buffer-bytes | |||
[message_history_buffer_bytes: <int> | default = 0] | |||
|
|||
# Enable the broadcast of messages with more than 64KB. This can be safely |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to briefly mention why one might not want to enable this "feature"?
@alvinlin123 @pracucci I updated this PR to replace the memberlist implementation with github.com/grafana/memberlist and also added a new option to allow for smoother migrations. |
Side note: I opened a PR to upstream the fix hashicorp/memberlist#260. |
I think I said something wrong. When I replied a couple of days ago I didn't remember the exact memberlist fix (I remember I fixed, but not the details). Today I opened a PR to upstream that fix, so I went through the code changes once again. The fix is only on the sender side, so I don't think we need any CLI flag to conditionally enable it. |
Great! I created a new PR targeting the release-1.12 branch that only replaces the memberlist implementation: #4671 |
What this PR does:
Reverting grafana/dskit#85 as pods are failing to unmarshal large messages causing
failed to unmarshal received KV Pair
errors. This can be verified by increasing the number of cortex instances on theTestSingleBinaryWithMemberlistScaling
integration test.I tried to update cortex to include grafana/memberlist#1 but the problem persisted.
This also seems to be related to pods getting OOM killed: #4668
Which issue(s) this PR fixes:
Fixes #4668 (still need to be confirmed).
Checklist
CHANGELOG.md
updated - the order of entries should be[CHANGE]
,[FEATURE]
,[ENHANCEMENT]
,[BUGFIX]