Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KAFKA-18601: Assume a baseline of 3.3 for server protocol versions #18845

Open
wants to merge 24 commits into
base: trunk
Choose a base branch
from

Conversation

ijuma
Copy link
Member

@ijuma ijuma commented Feb 9, 2025

3.3.0 was the first KRaft release that was deemed production-ready and also when KIP-778 (KRaft to KRaft upgrades) landed. Given that, it's reasonable for 4.x to only support upgrades from 3.3.0 or newer.

Noteworthy changes:

  1. AlterPartition no longer includes topic names, which makes it possible to simplify AlterParitionManager logic.
  2. Metadata versions older than IBP_3_3_IV3 have been removed and IBP_3_3_IV3 is now the minimum version.
  3. MINIMUM_BOOTSTRAP_VERSION has been removed.
  4. Removed isLeaderRecoverySupported, isNoOpsRecordSupported, isKRaftSupported, isBrokerRegistrationChangeRecordSupported and isInControlledShutdownStateSupported - these are always true now. Also removed related conditional code.
  5. A number of tests were not useful anymore and have been removed.
  6. ...

TODO:

  1. Fix failing tests
  2. Update upgrade notes
  3. Explain the changes in more detail

Committer Checklist (excluded from commit message)

  • Verify design and implementation
  • Verify test coverage and CI build status
  • Verify documentation (including upgrade notes)

@ijuma ijuma requested a review from Copilot February 9, 2025 17:04
@github-actions github-actions bot added core Kafka Broker tools kraft storage Pull requests that target the storage module clients labels Feb 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot reviewed 61 out of 78 changed files in this pull request and generated no comments.

Files not reviewed (17)
  • clients/src/main/resources/common/message/AlterPartitionRequest.json: Language not supported
  • clients/src/main/resources/common/message/AlterPartitionResponse.json: Language not supported
  • core/src/main/scala/kafka/cluster/Partition.scala: Language not supported
  • core/src/main/scala/kafka/server/AlterPartitionManager.scala: Language not supported
  • core/src/main/scala/kafka/server/ControllerRegistrationManager.scala: Language not supported
  • core/src/main/scala/kafka/server/KafkaConfig.scala: Language not supported
  • core/src/main/scala/kafka/tools/StorageTool.scala: Language not supported
  • core/src/main/scala/kafka/tools/TestRaftServer.scala: Language not supported
  • core/src/test/scala/integration/kafka/coordinator/transaction/ProducerIntegrationTest.scala: Language not supported
  • core/src/test/scala/integration/kafka/server/KRaftClusterTest.scala: Language not supported
  • core/src/test/scala/integration/kafka/server/MetadataVersionIntegrationTest.scala: Language not supported
  • core/src/test/scala/kafka/server/RemoteLeaderEndPointTest.scala: Language not supported
  • core/src/test/scala/unit/kafka/cluster/PartitionLockTest.scala: Language not supported
  • core/src/test/scala/unit/kafka/cluster/PartitionTest.scala: Language not supported
  • core/src/test/scala/unit/kafka/server/AbstractApiVersionsRequestTest.scala: Language not supported
  • core/src/test/scala/unit/kafka/server/AlterPartitionManagerTest.scala: Language not supported
  • core/src/test/scala/unit/kafka/server/BrokerRegistrationRequestTest.scala: Language not supported
Comments suppressed due to low confidence (3)

clients/src/test/java/org/apache/kafka/common/requests/AlterPartitionRequestTest.java:46

  • The removal of the topicName field might lead to incomplete test coverage if any tests rely on it. Verify if topicName is essential for any test cases.
TopicData topicData = new TopicData().setTopicId(topicId);

clients/src/test/java/org/apache/kafka/common/requests/RequestResponseTest.java:1705

  • The removal of the 'topicName' field may cause unintended side effects if the 'topicName' is required elsewhere in the codebase. Please verify if 'topicName' is needed.
.setTopicId(Uuid.randomUuid())

clients/src/test/java/org/apache/kafka/common/requests/RequestResponseTest.java:1723

  • The removal of the 'topicName' field may cause unintended side effects if the 'topicName' is required elsewhere in the codebase. Please verify if 'topicName' is needed.
.setTopicId(Uuid.randomUuid())
@ijuma ijuma force-pushed the kafka-18601-assume-baseline-3.3-for-server branch from 3643409 to a7fbe00 Compare February 10, 2025 03:49
@ijuma ijuma force-pushed the kafka-18601-assume-baseline-3.3-for-server branch 3 times, most recently from 4b19a3d to 890c3b2 Compare February 10, 2025 05:52
@@ -53,7 +53,8 @@ default void print(MetadataNodePrinter printer) {
for (String name : names) {
printer.enterNode(name);
MetadataNode child = child(name);
child.print(printer);
if (child != null)
child.print(printer);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixes a NPE I noticed while debugging a failing test.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In reading the code of child, I notice that it seems the AclsImageByIdNode is misplaced.

    private static final Map<String, Function<MetadataImage, MetadataNode>> CHILDREN = Map.of(
         ...
         // AclsImageByIdNode should be replaced by AclsImageNode
        AclsImageNode.NAME, image -> new AclsImageByIdNode(image.acls()),
        ScramImageNode.NAME, image -> new ScramImageNode(image.scram()),
        DelegationTokenImageNode.NAME, image -> new DelegationTokenImageNode(image.delegationTokens())
    );

AclsImageNode.NAME, image -> new AclsImageByIdNode(image.acls()),

and AclsImage#toString has similar issue.

    @Override
    public String toString() {
        // AclsImageByIdNode should be replaced by AclsImageNode
        return new AclsImageByIdNode(this).stringify();
    }

https://github.com/apache/kafka/blob/trunk/metadata/src/main/java/org/apache/kafka/image/AclsImage.java#L78

@ijuma @mumrah WDYT?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting. Shall we file a separate JIRA ticket for that? Perhaps we can wait for @mumrah to confirm.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, it's definitely a bug, but I don't think it relates to this PR. The *Node classes are used by the metadata shell and some toString methods, so they are just informational. I think the exact side effect is that acls would appear at the wrong level of the metadata shell "tree".

I suggest we file a separate issue and fix it for 4.0.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ijuma @mumrah thanks for your response. open https://issues.apache.org/jira/browse/KAFKA-18803 for it

* Test downgrading to a MetadataVersion that doesn't support FeatureLevelRecord.
*/
@Test
public void testPremodernVersion() {
Copy link
Member Author

@ijuma ijuma Feb 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't support these anymore, so remove it.

* Test downgrading to a MetadataVersion that doesn't support inControlledShutdown.
*/
@Test
public void testPreControlledShutdownStateVersion() {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We now always have the controlled shutdown state.

@ijuma ijuma force-pushed the kafka-18601-assume-baseline-3.3-for-server branch from 890c3b2 to e654cdb Compare February 10, 2025 06:06
@ijuma ijuma force-pushed the kafka-18601-assume-baseline-3.3-for-server branch from 9622ce9 to 86d1053 Compare February 10, 2025 20:54
}

@Test
public void testKRaftVersions() {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This did not seem to add much value after the changes in this PR.

Copy link
Member

@chia7712 chia7712 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ijuma thanks for this patch. overall LGTM. some minor comments remains.

…ssume-baseline-3.3-for-server

* apache-github/trunk:
  KAFKA-18366 Remove KafkaConfig.interBrokerProtocolVersion (apache#18820)
  KAFKA-18658 add import control for examples module (apache#18812)
  MINOR: Java version and TLS documentation improvements (apache#18822)
  KAFKA-18743 Remove leader.imbalance.per.broker.percentage as it is not supported by Kraft (apache#18821)
  KAFKA-18225 ClientQuotaCallback#updateClusterMetadata is unsupported by kraft (apache#18196)
  KAFKA-18441: Remove flaky tag on KafkaAdminClientTest#testAdminClientApisAuthenticationFailure (apache#18847)
  MINOR: fix KStream#to incorrect javadoc (apache#18838)
  KAFKA-18763: changed the assertion statement for acknowledgements to include only successful acks (apache#18846)
  MINOR: Accept specifying consumer group assignors by their short names (apache#18832)
@ijuma
Copy link
Member Author

ijuma commented Feb 11, 2025

@junrao Thanks for the review. I addressed your comments and/or replied within the relevant thread. I also pushed a couple of commits for changes I had in progress: (1) Use MINIMUM_VERSION in more places (to reduce future changes) and (2) Set topic id more consistently in PartitionTest. Finally, I also merged from trunk hoping it fixes some of the test failures.

Let's see how the test results look after this.

Copy link
Contributor

@m1a2st m1a2st left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ijuma for this PR, left some commments

@ijuma
Copy link
Member Author

ijuma commented Feb 11, 2025

Besides the open discussion threads, there are two additional issues to consider/resolve:

  1. Some tests in PlaintextAdminIntegrationTestare failing in the CI runs, but not locally. Unclear why that's the case.
  2. FeaturesDelta.replay reads feature levels from the log and converts them to MetadataVersion. It's possible (but unlikely) that someone ended up with 3.3-IV0, 3.3-IV1, 3.3-IV2 in the log if using an unreleased version of Apache Kafka 3.3. We should consider automatically converting to 3.3-IV3 in such cases.

The failures have the following error:

org.apache.kafka.server.fault.FaultHandlerException: quorumTestHarnessFaultHandler: Broker configuration does not support the cluster MetadataVersion: requirement failed: Multiple log directories (aka JBOD) are not supported in the current MetadataVersion 3.3-IV3. Need 3.7-IV2 or higher

Even though there is a log message just before saying:

Formatting metadata directory /tmp/kafka-2205696658360646817 with metadata.version 4.0-IV3.

Copy link
Contributor

@junrao junrao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ijuma : Thanks for the updated PR. Made a pass of all files. A few more comments.

singletonMap(MetadataVersion.FEATURE_NAME, MetadataVersion.IBP_3_2_IV0.featureLevel()),
singletonMap(MetadataVersion.FEATURE_NAME, FeatureUpdate.UpgradeType.UNSAFE_DOWNGRADE),
true));
setFeatureLevel(MetadataVersion.IBP_3_4_IV0.featureLevel()));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we use MetadataVersion.MINIMUM_VERSION?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was previously checking that the metadata version would change and using minimum would take it backwards. But I did notice an issue where we are actually using the same version versus a newer one - fixing that.

@@ -163,7 +163,7 @@ public void testEmpty() {
assertFalse(new FeaturesImage(Collections.singletonMap("foo", (short) 1),
FeaturesImage.EMPTY.metadataVersion()).isEmpty());
assertFalse(new FeaturesImage(FeaturesImage.EMPTY.finalizedVersions(),
MetadataVersion.IBP_3_3_IV0).isEmpty());
MetadataVersion.IBP_3_4_IV0).isEmpty());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this can be any MV, could we just use MetadataVersion.MINIMUM_VERSION?

Copy link
Member Author

@ijuma ijuma Feb 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That causes the test to fail. This is related to your question of whether FeatuesImage.isEmpty should be changed not to include metadataVersion.equals(MetadataVersion.MINIMUM_VERSION).

Copy link
Contributor

@junrao junrao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, there are a few references to 3.0 in docs/ops.html that need to be changed.

Feature: metadata.version SupportedMinVersion: 3.0-IV1 SupportedMaxVersion: 3.9-IV0 FinalizedVersionLevel: 3.9-IV0 Epoch: 5

Copy link
Contributor

@m1a2st m1a2st left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for @ijuma update, left some comments

// We automatically fallback to IBP_3_3_IV3 in that case. We use explicit versions instead of `MINIMUM_VERSION` because
// we want to force an explicit decision if we change `MetadataVersion.MINIMUM_VERSION` in the future.
if (record.featureLevel() >= MINIMUM_PERSISTED_FEATURE_LEVEL && record.featureLevel() <= MetadataVersion.IBP_3_3_IV3.featureLevel())
metadataVersionChange = MetadataVersion.IBP_3_3_IV3;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@junrao @cmccabe Thoughts on this fallback? It seems like quite a rare edge case (since it requires usage of a pre-release version and for the older metadata version not to be compacted away if I understand correctly), but it seems fair to never fail while reading the log and the fallback code is pretty simple.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we only support upgrading from officially released 3.3 to 4.0, we could probably just fail MV 3_3_IV0 to 3_3_IV2 as other unsupported MVs.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it guaranteed that we won't have a record with the older metadata version in the log in that case?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no guarantee, but the expected MV is IBP_3_3_IV3 and above. If this is not the case, it seems easier to just fail instead of changing it silently.

Copy link
Contributor

@junrao junrao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ijuma : Thanks for the updated PR. A few more comments.


private static final MetadataVersionChange CHANGE_3_3_IV0_TO_3_0_IV1 =
new MetadataVersionChange(IBP_3_3_IV0, IBP_3_0_IV1);
private static final MetadataVersionChange CHANGE_3_6_IV0_TO_3_3_IV3 =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we change this to use MetadataVersion.MINIMUM_VERSION and MetadataVersion.latestProduction() too?

// We automatically fallback to IBP_3_3_IV3 in that case. We use explicit versions instead of `MINIMUM_VERSION` because
// we want to force an explicit decision if we change `MetadataVersion.MINIMUM_VERSION` in the future.
if (record.featureLevel() >= MINIMUM_PERSISTED_FEATURE_LEVEL && record.featureLevel() <= MetadataVersion.IBP_3_3_IV3.featureLevel())
metadataVersionChange = MetadataVersion.IBP_3_3_IV3;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we only support upgrading from officially released 3.3 to 4.0, we could probably just fail MV 3_3_IV0 to 3_3_IV2 as other unsupported MVs.

}
return ControllerResult.of(records, false);
},
() -> ControllerResult.of(List.of(new ApiMessageAndVersion(new NoOpRecord(), (short) 0)), false),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The javadoc above on if the MetadataVersion supports it. needs to be changed too.

@@ -56,7 +53,7 @@ public FeaturesImage(

public boolean isEmpty() {
return finalizedVersions.isEmpty() &&
metadataVersion.equals(MetadataVersion.MINIMUM_KRAFT_VERSION);
metadataVersion.equals(MetadataVersion.MINIMUM_VERSION);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isEmpty() is only used in MetadataLoaderTest. So, we can address this in a followup jira. But I am not sure if isEmpty() will ever be false given that metadata version is required in the metadata log and probably the snapshot.

Copy link
Contributor

@m1a2st m1a2st left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for @ijuma update, one more comment

@github-actions github-actions bot added the build Gradle build or GitHub Actions label Feb 13, 2025
@ijuma ijuma force-pushed the kafka-18601-assume-baseline-3.3-for-server branch 2 times, most recently from a5d0b49 to 65a7166 Compare February 13, 2025 16:56
@dajac dajac added the Blocker This pull request is identified as solving a blocker for a release. label Feb 14, 2025
@ijuma ijuma force-pushed the kafka-18601-assume-baseline-3.3-for-server branch from b6eabaa to 70667a4 Compare February 14, 2025 15:57
…ssume-baseline-3.3-for-server

* apache-github/trunk: (36 commits)
  MINOR: Add KIP-848's metric to the doc (apache#18890)
  KAFKA-18772 Define share group config defaults for Docker (apache#18899)
  KAFKA-18733: Updating share group metrics (1/N) (apache#18826)
  KAFKA-18634: Fix ELR metadata version issues (apache#18680)
  KAFKA-18298 Fix flaky testConsumerGroupsDeprecatedConsumerGroupState and testConsumerGroups in PlaintextAdminIntegrationTest (apache#18513)
  MINOR: Marking testVerifyFetchAndCloseImplicit flaky (apache#18893)
  MINOR: Adjust javadoc to reflect the correct status of standby task TopicPartition (apache#18892)
  KAFKA-17182: Consumer fetch sessions are evicted too quickly with AsyncKafkaConsumer (apache#18795)
  KAFKA-16720: Support multiple groups in DescribeShareGroupOffsets RPC (apache#18834)
  KAFKA-18654[2/2]: Transction V2 retry add partitions on the server side when handling produce request. (apache#18810)
  MINOR: fix warn log message in Kafka Streams (apache#18878)
  KAFKA-18776: Fix flaky coordinator disconnect test & fix log level (apache#18866)
  KAFKA-17298: Update upgrade notes for 4.0 KIP-848 (apache#18756)
  MINOR bump setup-gradle (apache#18879)
  MINOR: Run javadoc as part of check task (apache#18863)
  MINOR: Updated share partition manager tests to close and other fixes (apache#18862)
  MINOR: Fix typo in ClusterControlManager (apache#18886)
  KAFKA-18728 Move ListOffsetsPartitionStatus to server module (apache#18807)
  MINOR: cleanup KStream JavaDocs (14/N) - stream-globalTable-left-join (apache#18867)
  MINOR: cleanup KStream JavaDocs (13/N) - stream-stream-outer-join (apache#18865)
  ...
…ssume-baseline-3.3-for-server

* apache-github/trunk:
  MINOR: Add release notes for Transactions Server Side Defense (KIP-890) (apache#18896)
  MINOR: TransactionManager logs the epoch bump less frequently. (apache#18895)
  MINOR: Mark IBP_4_0_IV3 as production ready! (apache#18902)
@@ -100,7 +100,7 @@ public boolean hasSeenRecord() {
*/
public final void resetToImage(MetadataImage image) {
this.image = image;
this.hasSeenRecord = true;
this.hasSeenRecord = !image.isEmpty();
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was an existing issue that somehow became easier to trigger due to these changes. Thanks to @mumrah for figuring out that this was the reason for the remaining failures.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Blocker This pull request is identified as solving a blocker for a release. build Gradle build or GitHub Actions clients core Kafka Broker kraft performance storage Pull requests that target the storage module tools
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants