Fix broadcast of large messages #4669

marianafranco · 2022-03-08T19:57:17Z

What this PR does:
Reverting grafana/dskit#85 as pods are failing to unmarshal large messages causing failed to unmarshal received KV Pair errors. This can be verified by increasing the number of cortex instances on the TestSingleBinaryWithMemberlistScaling integration test.

I tried to update cortex to include grafana/memberlist#1 but the problem persisted.

This also seems to be related to pods getting OOM killed: #4668

Which issue(s) this PR fixes:
Fixes #4668 (still need to be confirmed).

Checklist

Tests updated
Documentation added
CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

Signed-off-by: Mariana Franco <marfram@amazon.com>

marianafranco · 2022-03-08T20:56:08Z

@pracucci Could you take a look on this one since you worked on the change that is going to be reverted? There is probably some work that need to be done on the KeyValuePair.Unmarshal() method before this can be enabled again.

alvinlin123 · 2022-03-09T02:10:37Z

According to https://github.com/grafana/dskit/blob/01ce9286d7d52c7ddcfd0a63ef42d240ea4454d2/go.mod#L43

I think we also need to add similar replace directive to Cortex's go.mod, so Cortex would have this change: hashicorp/memberlist@8d2a27a

I'm trying to apply hashicorp/memberlist@8d2a27a in my local repo and run TestSingleBinaryWithMemberlistScaling integration test to see if I see the failed to unmarshal received KV Pair errors.

alvinlin123 · 2022-03-09T04:08:54Z

After applying hashicorp/memberlist@8d2a27a and running TestSingleBinaryWithMemberlistScaling integ test, I am still seeing failed to unmarshal received KV Pair

We shall discuss path forward.

pracucci · 2022-03-09T07:36:43Z

I agree the current code in Cortex is not working. There are two ways you can handle it:

Revert the change (as done in this PR)
Replace memberlist implementation with github.com/grafana/memberlist

After applying hashicorp/memberlist@8d2a27a and running TestSingleBinaryWithMemberlistScaling integ test, I am still seeing failed to unmarshal received KV Pair

That change is not enough. In github.com/grafana/memberlist we did more changes to properly fix it. You can diff github.com/grafana/memberlist with github.com/hashicorp/memberlist to see the actual differences.

marianafranco · 2022-03-09T13:12:43Z

@pracucci I will try to replace the memberlist implementation with github.com/grafana/memberlist, but I was thinking... if these large messages are causing pods that don't have this memberlist change to crash, should we also add this behind some feature flag so it can be enabled in two steps? Otherwise, this will probably cause problems during the cortex upgrade.

Btw, we have not see any OOM since we deployed this change on our load test environment, but we will keep it running for a couple of days more before we can confirm that this fixes #4668

pracucci · 2022-03-09T13:57:23Z

should we also add this behind some feature flag so it can be enabled in two steps?

That's a valid concern. I think we should add it behind feature flag to allow for a smoother migration.

Btw, we have not see any OOM since we deployed this change on our load test environment, but we will keep it running for a couple of days more before we can confirm that this fixes #4668

We've experienced the same in the past when testing memberlist. It's caused by decoding of corrupted packets. It makes sense that this PR would fix it (or switching memberlist to github.com/grafana/memberlist).

alvinlin123 · 2022-03-10T09:04:24Z

Thanks @pracucci After trying with replace github.com/hashicorp/memberlist v0.2.4 => github.com/grafana/memberlist v0.2.5-0.20211201083710-c7bc8e9df94b The integration test does not print the unmarshal error anymore. Whether or not OOM issue persists is another story :)

For release 1.12 I am thinking to just replace the hashcorp memberlist with grafana memberlist, because there was point in time where large message works, and we should keep it like that for release 1.12, I am less inclined to revert large message support for release 1.12. With that said, I am open to other thoughts.

pracucci · 2022-03-10T09:08:24Z

I'm fine to move to github.com/grafana/memberlist: our OOM issues have been solved using it.

Signed-off-by: Mariana Franco <marfram@amazon.com>

…arge-messages option Signed-off-by: Mariana Franco <marfram@amazon.com>

Signed-off-by: Mariana Franco <marfram@amazon.com>

alvinlin123 · 2022-03-11T01:51:00Z

CHANGELOG.md

@@ -3,8 +3,9 @@
 ## master / unreleased

 * [FEATURE] Ruler: Add `external_labels` option to tag all alerts with a given set of labels.
+* [FEATURE] Memberlist: Add `-memberlist.enable-broadcast-of-large-messages` option to enable broadcast of messages with more than 64KB.


We should release this as part of 1.12, so can you kindly:

Make this PR's target to be the release-1.12 branch instead of master; We will merge from release branch to master later on.

Put this line under the 1.12.0 change log.

Thank you so much for working on this feature flag,

alvinlin123 · 2022-03-11T01:52:17Z

docs/configuration/config-file-reference.md

@@ -3965,6 +3965,11 @@ The `memberlist_config` configures the Gossip memberlist.
 # CLI flag: -memberlist.message-history-buffer-bytes
 [message_history_buffer_bytes: <int> | default = 0]

+# Enable the broadcast of messages with more than 64KB. This can be safely


Do we need to briefly mention why one might not want to enable this "feature"?

Signed-off-by: Mariana Franco <marfram@amazon.com>

marianafranco · 2022-03-11T02:02:05Z

@alvinlin123 @pracucci I updated this PR to replace the memberlist implementation with github.com/grafana/memberlist and also added a new option to allow for smoother migrations.

pracucci · 2022-03-11T14:05:56Z

Side note: I opened a PR to upstream the fix hashicorp/memberlist#260.

pracucci · 2022-03-11T14:09:17Z

should we also add this behind some feature flag so it can be enabled in two steps?

That's a valid concern. I think we should add it behind feature flag to allow for a smoother migration.

I think I said something wrong.

When I replied a couple of days ago I didn't remember the exact memberlist fix (I remember I fixed, but not the details). Today I opened a PR to upstream that fix, so I went through the code changes once again. The fix is only on the sender side, so I don't think we need any CLI flag to conditionally enable it.

marianafranco · 2022-03-11T20:05:19Z

The fix is only on the sender side, so I don't think we need any CLI flag to conditionally enable it.

Great! I created a new PR targeting the release-1.12 branch that only replaces the memberlist implementation: #4671

Revert change to allow broadcast of large messages

0911a1f

Signed-off-by: Mariana Franco <marfram@amazon.com>

pull-request-size bot added the size/S label Mar 8, 2022

marianafranco changed the title ~~Revert change to allow broadcast of large messages~~ Fix broadcast of large messages Mar 10, 2022

Fix broadcast of large messages + add new option to enable it

74d5580

Signed-off-by: Mariana Franco <marfram@amazon.com>

pull-request-size bot added size/M and removed size/S labels Mar 11, 2022

Update changelog to mention the new -memberlist.enable-broadcast-of-l…

554feff

…arge-messages option Signed-off-by: Mariana Franco <marfram@amazon.com>

marianafranco force-pushed the revert-broadcast-large-messages branch from e37350c to 554feff Compare March 11, 2022 01:01

go mod tidy

089f776

Signed-off-by: Mariana Franco <marfram@amazon.com>

alvinlin123 reviewed Mar 11, 2022

View reviewed changes

Fix docs

7f52a75

Signed-off-by: Mariana Franco <marfram@amazon.com>

marianafranco mentioned this pull request Mar 11, 2022

Fix broadcast of messages larger than 64KB #4671

Merged

3 tasks

marianafranco closed this Mar 11, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix broadcast of large messages #4669

Fix broadcast of large messages #4669

marianafranco commented Mar 8, 2022

marianafranco commented Mar 8, 2022

alvinlin123 commented Mar 9, 2022 •

edited

Loading

alvinlin123 commented Mar 9, 2022 •

edited

Loading

pracucci commented Mar 9, 2022

marianafranco commented Mar 9, 2022

pracucci commented Mar 9, 2022

alvinlin123 commented Mar 10, 2022 •

edited

Loading

pracucci commented Mar 10, 2022

alvinlin123 Mar 11, 2022

alvinlin123 Mar 11, 2022

marianafranco commented Mar 11, 2022

pracucci commented Mar 11, 2022

pracucci commented Mar 11, 2022

marianafranco commented Mar 11, 2022

Fix broadcast of large messages #4669

Fix broadcast of large messages #4669

Conversation

marianafranco commented Mar 8, 2022

marianafranco commented Mar 8, 2022

alvinlin123 commented Mar 9, 2022 • edited Loading

alvinlin123 commented Mar 9, 2022 • edited Loading

pracucci commented Mar 9, 2022

marianafranco commented Mar 9, 2022

pracucci commented Mar 9, 2022

alvinlin123 commented Mar 10, 2022 • edited Loading

pracucci commented Mar 10, 2022

alvinlin123 Mar 11, 2022

Choose a reason for hiding this comment

alvinlin123 Mar 11, 2022

Choose a reason for hiding this comment

marianafranco commented Mar 11, 2022

pracucci commented Mar 11, 2022

pracucci commented Mar 11, 2022

marianafranco commented Mar 11, 2022

alvinlin123 commented Mar 9, 2022 •

edited

Loading

alvinlin123 commented Mar 9, 2022 •

edited

Loading

alvinlin123 commented Mar 10, 2022 •

edited

Loading