Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fatal error "out of order sequence number" makes idempotent producer not functional anymore #3584

Closed
7 tasks done
kenneth-jia opened this issue Oct 19, 2021 · 1 comment
Closed
7 tasks done

Comments

@kenneth-jia
Copy link
Contributor

Read the FAQ first: https://github.com/edenhill/librdkafka/wiki/FAQ

Description

We met some fatal error with idempotence producer after kafka cluster was down, -- the producer would not be able to send any more message from then on!
The problem still exists even with latest librdkafka release (v1.8.2) -- thus not the same problem with #3577

Here's the log while error occurred,

[2021-10-19 18:47:34.885129]EMERG KafkaProducer[producer-140691259498496] FATAL | [thrd:127.0.0.1:29092/bootstrap]: Fatal error: Broker: Broker received an out of order sequence number: ProduceRequest for topicForTest [0] with 1 message(s) failed due to sequence desynchronization with broker 1 (PID{Id:8000,Epoch:0}, base seq 0, idemp state change 1383467ms ago, last partition error NOT_LEADER_FOR_PARTITION (actions Refresh,MsgNotPersisted, base seq 0..0, base msgid 1, 107ms ago)
Met error: Broker: Broker received an out of order sequence number [45] fatal | ProduceRequest for topicForTest [0] with 1 message(s) failed due to sequence desynchronization with broker 1 (PID{Id:8000,Epoch:0}, base seq 0, idemp state change 1383467ms ago, last partition error NOT_LEADER_FOR_PARTITION (actions Refresh,MsgNotPersisted, base seq 0..0, base msgid 1, 107ms ago)
Exception thrown by producer: 2021-10-19 18:47:34.885264: Broker: Broker received an out of order sequence number [45] (/home/winner/Repo/modern-cpp-kafka-3/include/./kafka/KafkaProducer.h:435)

(However, it seems no related log from broker side)

How to reproduce

  1. Start the Kafka cluster (v2.8.0)
  2. Start an idempotent producer (with librdkafka v1.8.2), send a few messages and then wait
  3. Kill the zookeeper and Kafka brokers processes
  4. Wait ~20 min
  5. Re-start the Kafka cluster
  6. Try to send a message with the previous producer, and probably fail with the "out-of-sequence-number" fatal error, -- the producer is not functional anymore.

Checklist

IMPORTANT: We will close issues where the checklist has not been completed.

Please provide the following information:

  • librdkafka version: 1.8.2
  • Apache Kafka version: kafka_2.13-2.8.0
  • librdkafka client configuration: enable.idempotence=true
  • Operating system: ubuntu20.04 x64
  • Provide logs (with debug=.. as necessary) from librdkafka
  • Provide broker log excerpts
  • Critical issue
@Ralfers
Copy link

Ralfers commented Sep 29, 2022

Some of my own investigations lead me to believe that this is an issue with librdkafka. I did the same steps to reproduce as listed above, making 8 successful producer messages before the Kafka broker downtime.

I've attached a producer log snippet (producer_log_1.txt) and i also found this single error message on the kafka broker:
[2022-09-29 18:07:26,016] ERROR [ReplicaManager broker=0] Error processing append operation on partition <truncated>-1 (kafka.server.ReplicaManager) org.apache.kafka.common.errors.OutOfOrderSequenceException: Out of order sequence number for producer 5000 at offset 25 in partition<truncated>-1: 0 (incoming seq. number), 8 (current end sequence number)

After the ~20 minutes of downtime, it seems that librdkafka triggered some internal timer, indicating that the PID should be invalidated on the next producer message. When producing after the downtime, in the log snippet, you can see that librdkafka managed to change the PID to 5000 and assume that the message sequence number has been reset to 1:
changed PID{Invalid} -> PID{Id:5000,Epoch:0} with base MsgId 1.

However, the PID=5000 was also assigned to this producer before the downtime (not shown in the log snippets), and, in the Kafka broker error log above, you can clearly see that the broker still expected to continue the sequence numbers from 8 instead of 0 for the PID=5000.

So librdkafka, presumably, using some internal timer, has kept the old PID, but reset the sequence number while the broker has retained the original value after the downtime. I'm not really familiar with the librdkafka source code, so i'm struggling to follow this flow there.

Edit: When the Kafka broker downtime is less (~1 minute), then librdkafka successfully reconnects, obtaining the same PID and continuing from the sequence number it left off.

Some digging through the source code makes me believe that it falls into this if statement when it shouldn't, leading to a call to rd_kafka_toppar_reset_base_msgid, which resets the MsgId.

The only real way it can be true with the same PID is if the epoch's are different: https://github.com/edenhill/librdkafka/blob/158e83aea9416f7bd665c9429a262378f6b65ec1/src/rdkafka_proto.h#L586-L589

emasab added a commit that referenced this issue Sep 18, 2023
@emasab emasab closed this as completed Aug 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants