-
Notifications
You must be signed in to change notification settings - Fork 467
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Lots of error messages like: Cancelling subscription, and marking self as failed
with Invalid StartingSequenceNumber
since upgrading to 2.0
#391
Comments
That should not be normally happening. That indicates the sequence number being used isn't for the shard. Is it consistently reporting the sequence is for the other shard e.g. 301 is always report 349?
If it fails it means you have invalid sequence numbers in your lease table. If it doesn't fail it would indicate that somehow the KCL is getting sequence numbers mixed up internally. |
If I run It's consistent with the mismatch for a specific kinesis stream across multiple application restarts - but it's not consistent between environments. We have another 80-shard kinesis stream that is in a different aws account running the same code that has mismatches between I dug through our log aggregation a bit more to identify the count per each unique ecs instance ID and also get the vpc IP address of the ECS host (though the IP address part shouldn't matter at all) It shows that some of the time the lease complaining happens to just one specific instance of the application, but other times multiple instances will have that same complaint. Here are the results from scanning over our logs for the past 7 days on two different AWS accounts using the same code with two similar kinesis streams (both 80 shards) From our first environment / aws account
From our second environment / aws account
One setting I have enabled in our scheduler is:
I can turn that off if you think it might be involved - though it seems like that should only be involved at startup |
You're not using the same application name for different streams? |
No, we have only one application using the KCL and it's only reading from the one stream. (Eventually we may add more applications processing this stream) We have this same set up running in two different accounts - and there is a different application name (and kinesis stream) used on each of the aws accounts running this. |
Skip shard sync won't cause this sort of issue. What version of Java are you using? The stack traces look like a newer JVM. A newer version of Java shouldn't cause something like this, but it's at least one dimension. Does the same behavior happen if you run the 1.9.1 version of the KCL against the stream? |
We're using Java 1.10. I could try to swap it out with 1.9 or 1.8 if that would help. We saw some issues in 1.9.1 where it seemed like the client binary was crashing / restarting at odd points which is part of what motivated us to go to 2.0 - the error messages we saw on 1.9.1 were more along the lines of:
but they weren't nearly as frequent. I think there were some other errors in our logs for clients prior to 2.0 about how the kcl processor was shutting down. I'll have to dig around more to find the specific log messages |
If trying out Java 1.8 is easy that would be the best. I don't know why Java 1.10 would be causing issues, and I can look at testing it in the future. The error you're seeing can occur on 2.0 as well. It means the worker/scheduler no longer holds the lease. The 1.x version of the KCL unfortunately overloaded the
This was confusing so 2.0 broke it out to two different methods:
|
An update on this issue. We haven't had a chance to deploy out with java 1.8 yet, but yesterday we did reshard one of the streams. The error was gone this morning, but looking at the logs the error went away around six or seven hours after we re-sharded so it seems unlikely resharding it was what made the error go away. I checked the logs on our other account where we did not reshard it, and on that account the error also went away without us doing any changes on our side. One the account that we didn't reshard, it went away a few hours before it went away on the other account. Maybe something changed on the AWS side with regard to those streams? We certainly didn't change anything on our side yesterday. |
Another update: I re-sharded that account that we didn't reshard yesterday and then the errors came back on that account but not on the other one. We're investigating some issues with messages not getting through our pipeline and this particular issue is our prime suspect. Looking at the shard-level metrics in cloudwatch, the shard that gets complained about ("It encodes X, while it was used in a call to a shard with Y") - shard Y shows it has incoming bytes but no outgoing bytes. Every other shard has both incoming and outgoing bytes. At the point this error has happened, what mitigation strategy do we have? I'm assuming it wouldn't have advanced any checkpointing forward - if we kill / restart any instance that shows those errors would another one get that lease and read those records? |
Are there any messages emitted from |
Seeing the same thing in our application since updating to use KCL 2.0. We have multiple shards that have data coming in, but none coming out. When we redeploy to our nodes, and the application is restarted, some of those shards will begin reading again from the last checkpoint. However, the issue will often reappear on different shards. Here is a graph showing one of the shards not reading: Below is the error we are seeing from the FanOutRecordsPublisher:
|
Just opened a PR that will add some logging around the initialization phase. I will merge the PR in a bit, if possible could you try out the new version. If you do you need to enable debug logging for the following classes:
This will log the sequence number returned from DynamoDB, handled by the InitializeTask, and provided to the FanOutRecordsPublisher for initialization. |
My short term workaround was going to be adding a |
Are you comfortable building from the source or do you need a release to be available in Central? |
I can build from source and put it into my internal artifact server, no big deal |
The change is now in master. If you can build it. You will need to enable debug logging for the classes I listed above. This should log the sequence number retrieved from DynamoDB, and used to initialize the You should see something like:
When the lease is first acquired. If the problem is occurring at that point you should get the error messages shortly afterwards. Thanks for your help in tracking this down. |
@pfifer would it be possible to get this available through Central? |
We normally don't publish snapshots to Central, but I can investigate doing so tomorrow. Otherwise it will need to be a full release, and I would take a little bit more time. |
Here are some logs: (Note that the stacktraces might be in a weird order - I exported the logs from my log aggregation tool and it puts them in reverse chronological order - so I used a tail -r on the exported logs to get them back in normal chronological order but that probably did strange things to multi-line messages like stack traces.)
|
The value stored in the lease table is a sequence number for shard 438, and not for 488. We would need to find the last successful checkpoint that stored the sequence number to figure out who stored it. Do you run the checkpoint on the processing thread, or do you execute the checkpoint on another thread? I'm looking to see if we can add some additional validation to checkpointing. |
If you can check is that the sequence number in the DynamoDB table. If it isn't it makes me think the state lease state in the KCL is getting mixed up. |
I'm not explicitly calling it on another thread, inside my
it calls: checkpointIfNecessary(processRecordsInput.checkpointer(), record); passing in the checkpointer from the supplied into this method: private void checkpointIfNecessary(RecordProcessorCheckpointer checkpointer, KinesisClientRecord record) {
if (System.currentTimeMillis() > nextCheckpointTimeInMillis) {
checkpoint(checkpointer, record);
nextCheckpointTimeInMillis = System.currentTimeMillis() + CHECKPOINT_INTERVAL_MILLIS;
}
} /**
* Checkpoint with retries.
*
* @param checkpointer
*/
private void checkpoint(RecordProcessorCheckpointer checkpointer, KinesisClientRecord record) {
log.info("Checkpointing shard {} at sequenceNumber {}, subSequenceNumber {}", kinesisShardId, record.sequenceNumber(), record.subSequenceNumber());
for (int i = 0; i < NUM_RETRIES; i++) {
try {
checkpointer.checkpoint(record.sequenceNumber(), record.subSequenceNumber());
break;
} catch (ShutdownException se) {
// Ignore checkpoint if the processor instance has been shutdown (fail over).
log.info("Caught shutdown exception, skipping checkpoint.", se);
break;
} catch (ThrottlingException e) {
// Backoff and re-attempt checkpoint upon transient failures
if (i >= (NUM_RETRIES - 1)) {
log.error("Checkpoint failed after " + (i + 1) + "attempts.", e);
break;
} else {
log.info("Transient issue when checkpointing - attempt " + (i + 1) + " of "
+ NUM_RETRIES, e);
}
} catch (InvalidStateException e) {
// This usually indicates an issue with the DynamoDB table (check for table, provisioned IOPS).
log.error("Cannot save checkpoint to the DynamoDB table used by the Amazon Kinesis Client Library.", e);
break;
}
try {
Thread.sleep(BACKOFF_TIME_IN_MILLIS);
} catch (InterruptedException e) {
log.debug("Interrupted sleep", e);
}
}
} The biggest difference between the checkpointing that I have there and the sample is I'm explicitly passing in a sequence number to the checkpointer - but the checkpointer I'm using came from The code also does checkpoints at the shardEnd / shutdownRequested points: @Override
public void shardEnded(ShardEndedInput shardEndedInput) {
try {
log.info("Shard ended");
shardEndedInput.checkpointer().checkpoint();
} catch (ShutdownException | InvalidStateException e) {
log.error("Error checkpointing after shard ended input", e);
}
}
@Override
public void shutdownRequested(ShutdownRequestedInput shutdownRequestedInput) {
try {
shutdownRequestedInput.checkpointer().checkpoint();
} catch (ShutdownException | InvalidStateException e) {
log.error("Error checkpointing after shutdown was requested", e);
}
} |
In dynamo for that shardId-000000000488 I have: {
"checkpoint": {
"S": "49588160296506554238249968084872546083315933702822828898"
},
"checkpointSubSequenceNumber": {
"N": "0"
},
"leaseCounter": {
"N": "1637"
},
"leaseKey": {
"S": "shardId-000000000488"
},
"leaseOwner": {
"S": "fa7c2ba8-8eca-4a6b-b7fc-d132afa3e4cc"
},
"ownerSwitchesSinceCheckpoint": {
"N": "5"
},
"parentShardId": {
"SS": [
"shardId-000000000330"
]
}
} which matches what was in that line in the logs earlier today:
(nothing has checkpointed it to a newer value even though the app is still running / processing things) |
That sequence number is for shard 438. That means it did checkpoint on the wrong sequence number. I'm not sure how that happened at this time. I'm working on a change that will validate sequence numbers before allowing a checkpoint to occur. If you happened to log the sequence number when you checkpoint that would help to track this down. Concurrently I'm looking to add a validator that should prevent/detect checkpointing with a sequence number that isn't part of the shard. |
I do have logs when it was checkpointing - first showing it checkpointing 488 with that sequence number and then showing it checkpointing 438 with the same sequence number:
then later:
followed by:
It seems that's the same thread (shardProcessor-0065) but with different shardIds. At first this seemed odd to me, but looking at the logs that thread seems to use several different ShardProcessors - and checkpoints other shards like:
|
FWIW, we recreated our stack last night and restarted the KCL 2.0 application for the stream. Everything was working as expected until we performed another deployment and the application was restarted again. Shortly after restart, we saw warnings and errors similar to what @ryangardner is seeing:
|
Yes, I'm going to start looking at it right now. Edit: Adjusted grammar, since apparently I'm still tired. |
…cribed in awslabs/amazon-kinesis-client#391" This reverts commit 1b2458d.
…cribed in awslabs/amazon-kinesis-client#391" This reverts commit 1b2458d.
The SDK team has released 2.0.6 of the SDK, which I've been testing since last night. We will be doing some additional validation today, and a release of the KCL should be coming soon. |
We've just released version 2.0.3 of the Amazon Kinesis Client which should solve this issue. That release includes the AWS SDK 2.0.6 release, along with other fixes for this issue. Can you try it out, and let us know if you're still seeing issues? |
@pfifer I will push out a release tonight or tomorrow. Is the following configuration still needed/recommended for the Kinesis Async Client?
|
@athielen2 Using If you need to configure the |
@pfifer been running around 36 hours and have not seen any errors dealing with invalid sequence numbers. Thank you for your work on the issue |
I'm still seeing this message with version 2.0.3: {"level":"ERROR","logger_name":"software.amazon.kinesis.lifecycle.ShardConsumer","message":"shardId-000000000000: Last request was dispatched at 2018-10-11T16:24:46.671Z, but no response as of 2018-10-11T16:25:21.902Z (PT35.231S). Cancelling subscription, and restarting.","logger_timestamp":"2018-10-11T16:25:21.902Z"} |
@vanhoale That indicates that data didn't arrive after 35 seconds, so the next question is did something go wrong ahead of it. Couple of questions:
|
@pfifer there is no message from FanOutRecordsPublisher before the message I posted, and I'm running with KCL 2.0.3 and AWS SDK 2.0.6 |
I also got those messages: {"level":"ERROR","logger_name":"com.amazonaws.services.kinesis.producer.LogInputStreamReader","message":"[2018-10-11 12:57:59.222613] [0x00001336][0x000070000c266000] [error] AWS Log: ERRORCurl returned error code 28","logger_timestamp":"2018-10-11T17:57:59.222Z"} |
never mind, that is kinesis producer |
Hi folks, interestingly I stumbled on to this post myself as I am also experiencing a similar issue as described above... Pulling down 2.0.5 of KCL to test tonight. The issue occurs several hours in to running. Unlike what is/was described above, I use the Enhanced Fan out of KCL with only 1 shard... Throughout its run, I get these errors sporadically: 18/11/15 23:26:38 WARN channel.DefaultChannelPipeline: An exceptionCaught() event was fired, and it reached at the tail of the pipeline. It usually means the last handler in the pipeline did not handle the exception. But, about 8 hours in to running I get a series of errors like so; 18/11/15 23:26:38 WARN channel.DefaultChannelPipeline: An exceptionCaught() event was fired, and it reached at the tail of the pipeline. It usually means the last handler in the pipeline did not handle the exception. Going to try 2.0.5 tonight as I need to plan for an outage. My KCL configuration is rather simple, it just follows the KCL 1.0 to 2.0 migration guide in that it has; new Scheduler( Any tips or tricks much appreciated. Chris |
Hi @pfifer , I'm on
and also
and finally:
|
The version of the underlying aws sdk is also important. 2.0.5 transitively
brings in the correct version with the fix.. which version of the 2.0 aws
sdk are you using?
…On Fri, Nov 16, 2018, 4:57 PM Majid Fatemian ***@***.***> wrote:
Hi @pfifer <https://github.com/pfifer> , I'm on 2.0.5 and I still get
lots of those warnings in my tests.
I have 4 shards in my stream and my application is running in 2 ECS
instances benefiting from enhanced fanout.
{
"timestamp": "2018-11-16T21:15:11.016Z",
"logger": "software.amazon.kinesis.lifecycle.ShardConsumer",
"exceptionClass": "software.amazon.kinesis.retrieval.RetryableRetrievalException",
"stackTrace": "software.amazon.kinesis.retrieval.RetryableRetrievalException: ReadTimeout
at software.amazon.kinesis.retrieval.fanout.FanOutRecordsPublisher.errorOccurred(FanOutRecordsPublisher.java: 142)
at software.amazon.kinesis.retrieval.fanout.FanOutRecordsPublisher.access$700(FanOutRecordsPublisher.java: 51)
at software.amazon.kinesis.retrieval.fanout.FanOutRecordsPublisher$RecordFlow.exceptionOccurred(FanOutRecordsPublisher.java: 516)
at software.amazon.awssdk.services.kinesis.DefaultKinesisAsyncClient.lambda$subscribeToShard$1(DefaultKinesisAsyncClient.java: 2102)
at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java: 760)
at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java: 736)
at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java: 474)
at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java: 1977)
at software.amazon.awssdk.core.internal.http.pipeline.stages.AsyncRetryableStage$RetryExecutor.handle(AsyncRetryableStage.java: 155)
at software.amazon.awssdk.core.internal.http.pipeline.stages.AsyncRetryableStage$RetryExecutor.lambda$execute$0(AsyncRetryableStage.java: 121)
at java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java: 822)
at java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java: 797)
at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java: 474)
at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java: 1977)
at software.amazon.awssdk.core.internal.http.pipeline.stages.MakeAsyncHttpRequestStage$Completable.lambda$completeExceptionally$1(MakeAsyncHttpRequestStage.java: 208)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java: 1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java: 624)
at java.lang.Thread.run(Thread.java: 748)
Caused by: software.amazon.awssdk.core.exception.SdkClientException
at software.amazon.awssdk.core.exception.SdkClientException$BuilderImpl.build(SdkClientException.java: 97)
at software.amazon.awssdk.core.internal.http.pipeline.stages.AsyncRetryableStage$RetryExecutor.handle(AsyncRetryableStage.java: 143)
... 9 more
Caused by: io.netty.handler.timeout.ReadTimeoutException",
"thread": "ShardRecordProcessor-0002",
"exceptionMessage": "ReadTimeout",
"message": "ShardConsumer: shardId-000000000003: onError(). Cancelling subscription, and marking self as failed.",
"logLevel": "WARN"
}
and also
{
"timestamp": "2018-11-16T21:19:03.243Z",
"logger": "software.amazon.awssdk.http.nio.netty.internal.RunnableRequest",
"exceptionClass": "io.netty.channel.ConnectTimeoutException",
"stackTrace": "io.netty.channel.ConnectTimeoutException: connection timed out: monitoring.us-east-1.amazonaws.com/ip.ip.ip.ip:443
at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe$1.run(AbstractNioChannel.java:267)
at io.netty.util.concurrent.PromiseTask$RunnableAdapter.call(PromiseTask.java:38)
at io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:127)
at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)
at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:404)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:464)
at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:884)
at java.lang.Thread.run(Thread.java:748)",
"thread": "aws-java-sdk-NettyEventLoop-2-0",
"exceptionMessage": "connection timed out: monitoring.us-east-1.amazonaws.com/ip.ip.ip.ip:443",
"message": "RunnableRequest: Failed to create connection to https://monitoring.us-east-1.amazonaws.com",
"logLevel": "ERROR"
}
and finally:
{
"timestamp": "2018-11-16T21:16:08.397Z",
"logger": "software.amazon.kinesis.retrieval.fanout.FanOutRecordsPublisher",
"thread": "ShardRecordProcessor-0001",
"message": "FanOutRecordsPublisher: shardId-000000000003: (FanOutRecordsPublisher/Subscription#request) - Rejected an attempt to request(6), because subscribers don't match.",
"logLevel": "WARN"
}
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#391 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAOXVQlbycb0z836pxn8novbEu2O8u-xks5uvzTigaJpZM4WZVDM>
.
|
Upgrading KCL to 2.0.5 and AWS libraries (to preview 13) to seems to have removed most of the errors since rebooting...
Except for about 10 or so of the following errors: |
@ryangardner That is correct. I'm using AWS SDK Bundle 2.0.6:
|
@cdfleischmann, I'm experiencing the same issue here: aws/aws-sdk-java-v2#1039 |
I'm getting cancelled subscriptions from 500 errors followed seconds later by rate limit errors then Scheduler Thread dies and I stop consuming. Setup: Dependencies: dependencies {
implementation 'org.slf4j:slf4j-api:1.7.25'
implementation 'software.amazon.awssdk:s3:2.4.0'
implementation 'software.amazon.kinesis:amazon-kinesis-client:2.1.0'
implementation 'joda-time:joda-time:2.10.1'
implementation 'io.projectreactor:reactor-core'
implementation 'ch.qos.logback:logback-classic:1.2.3'
dependencyManagement {
imports {
mavenBom 'software.amazon.awssdk:bom:2.4.0'
mavenBom 'io.projectreactor:reactor-bom:Californium-SR4'
}
}
} Logged Errors/Warnings: 21:18:44.260 [aws-java-sdk-NettyEventLoop-1-2] WARN s.a.k.r.f.FanOutRecordsPublisher - shardId-000000000000: [SubscriptionLifetime] - (FanOutRecordsPublisher#errorOccurred) @ 2019-02-05T21:15:46.446Z id: shardId-000000000000-12 -- software.amazon.awssdk.services.kinesis.model.InternalFailureException: Internal Service Error (Service: kinesis, Status Code: 500, Request ID: e2b1fe5b-c084-8615-b662-0119421a89d8) software.amazon.awssdk.services.kinesis.model.InternalFailureException: Internal Service Error (Service: kinesis, Status Code: 500, Request ID: e2b1fe5b-c084-8615-b662-0119421a89d8)
21:18:46.781 [ShardRecordProcessor-0001] WARN s.a.k.lifecycle.ShardConsumer - shardId-000000000000: onError(). Cancelling subscription, and marking self as failed. Caused by: software.amazon.awssdk.services.kinesis.model.LimitExceededException: Rate exceeded for consumer arn:<ARN REDACTED> and shard shardId-000000000000 (Service: Kinesis, Status Code: 400, Request ID: d8c8448d-855f-b67b-8c1b-bb7b07c1b9b6)
at software.amazon.awssdk.services.kinesis.model.LimitExceededException$BuilderImpl.build(LimitExceededException.java:118)
at software.amazon.awssdk.services.kinesis.model.LimitExceededException$BuilderImpl.build(LimitExceededException.java:78)
at software.amazon.awssdk.protocols.json.internal.unmarshall.AwsJsonProtocolErrorUnmarshaller.unmarshall(AwsJsonProtocolErrorUnmarshaller.java:86)
at software.amazon.awssdk.protocols.json.internal.unmarshall.AwsJsonProtocolErrorUnmarshaller.handle(AwsJsonProtocolErrorUnmarshaller.java:62)
at software.amazon.awssdk.protocols.json.internal.unmarshall.AwsJsonProtocolErrorUnmarshaller.handle(AwsJsonProtocolErrorUnmarshaller.java:41)
at software.amazon.awssdk.core.internal.http.async.SyncResponseHandlerAdapter.lambda$prepare$0(SyncResponseHandlerAdapter.java:85)
at java.util.concurrent.CompletableFuture.uniCompose(CompletableFuture.java:952)
at java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:926)
at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
at java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1962)
at software.amazon.awssdk.core.internal.http.async.SyncResponseHandlerAdapter$BaosSubscriber.onComplete(SyncResponseHandlerAdapter.java:127)
at software.amazon.awssdk.http.nio.netty.internal.ResponseHandler.runAndLogError(ResponseHandler.java:164)
at software.amazon.awssdk.http.nio.netty.internal.ResponseHandler.access$700(ResponseHandler.java:64)
at software.amazon.awssdk.http.nio.netty.internal.ResponseHandler$PublisherAdapter$1.onComplete(ResponseHandler.java:274)
.... truncated ... Caused by: software.amazon.awssdk.services.kinesis.model.LimitExceededException: Rate exceeded for consumer arn:<ARN REDACTED> and shard shardId-000000000000 (Service: Kinesis, Status Code: 400, Request ID: d8c8448d-855f-b67b-8c1b-bb7b07c1b9b6)
at software.amazon.awssdk.services.kinesis.model.LimitExceededException$BuilderImpl.build(LimitExceededException.java:118)
at software.amazon.awssdk.services.kinesis.model.LimitExceededException$BuilderImpl.build(LimitExceededException.java:78)
at software.amazon.awssdk.protocols.json.internal.unmarshall.AwsJsonProtocolErrorUnmarshaller.unmarshall(AwsJsonProtocolErrorUnmarshaller.java:86)
at software.amazon.awssdk.protocols.json.internal.unmarshall.AwsJsonProtocolErrorUnmarshaller.handle(AwsJsonProtocolErrorUnmarshaller.java:62)
at software.amazon.awssdk.protocols.json.internal.unmarshall.AwsJsonProtocolErrorUnmarshaller.handle(AwsJsonProtocolErrorUnmarshaller.java:41)
at software.amazon.awssdk.core.internal.http.async.SyncResponseHandlerAdapter.lambda$prepare$0(SyncResponseHandlerAdapter.java:85)
at java.util.concurrent.CompletableFuture.uniCompose(CompletableFuture.java:952)
at java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:926)
at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
at java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1962)
at software.amazon.awssdk.core.internal.http.async.SyncResponseHandlerAdapter$BaosSubscriber.onComplete(SyncResponseHandlerAdapter.java:127)
at software.amazon.awssdk.http.nio.netty.internal.ResponseHandler.runAndLogError(ResponseHandler.java:164)
at software.amazon.awssdk.http.nio.netty.internal.ResponseHandler.access$700(ResponseHandler.java:64)
at software.amazon.awssdk.http.nio.netty.internal.ResponseHandler$PublisherAdapter$1.onComplete(ResponseHandler.java:274)
... truncated ... What is the recommended way to reliably handle issues like this for production code? I have to restart my service when this happens, which is currently in a testing environment. Never had issues like this before trying enhanced fanout with new AWS SDK 2.0. |
The original issue has been resolved with 2.0.3 version of the KCL. Resolving the issue. Please feel free to open new issues for newer bugs seen. |
Release 2.2.2 has a fix for Invalid StartingSequenceNumber. Continue reading for other fixes in this release.
|
In the last three hours, I've had these errors / warnings with the following combinations of encoded shardId's and the shard ID it claims it was used with:
(all shardId's are with the format
shardId-000000000301
- so I'm omitting the first part)My stream has not been resharded recently and has 80 shards.
Question: Is this something that happens normally as part of the KCL client behavior, or does this indicate something I should be concerned about? (And If it is considered to be normal behavior, can the logs be changed to something other than warn-level and the full stacktrace not be logged?)
If this is something I should be worried about - what might be causing this?
The text was updated successfully, but these errors were encountered: