Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tuning AWS Rust SDK HTTP request settings for latency-aware Amazon DynamoDB applications #558

Closed
ymwjbxxq opened this issue Jun 19, 2022 · 20 comments
Assignees
Labels
bug This issue is a bug.

Comments

@ymwjbxxq
Copy link

ymwjbxxq commented Jun 19, 2022

Describe the bug

As per discussion:
#557

The DynamoDB SDK does not retry based on the with_max_attempts, but its exits at the first failure.

The workaround is to put the queries inside a custom code:

let mut retry_count = 1;
while retry_count < 5 {
     let result = RunQuery().await;
     if let Ok(output) = result {
                //do something or break
      } else {
          retry_count += 1;
      }
}

Expected Behavior

If I set a timeout at 30ms and 5 retries, for example

I expect the Lambda Duration to be between > 30ms and 150ms based on retries and not exiting immediately.

Current Behavior

The SDK does not retry at all. I think it is for all the clients and not just DynamoDB

Reproduction Steps

set the following configuration:

let timeout_config = timeout::Config::new()
        .with_api_timeouts(
            timeout::Api::new()
                // This timeout acts at the "Request to a service" level. When the SDK makes a request to a
                // service, that "request" can contain several HTTP requests. This way, you can retry
                // failures that are likely spurious, or refresh credentials.
                .with_call_timeout(TriState::Set(Duration::from_millis(30)))
                // This timeout acts at the "HTTP request" level and sets a separate timeout for each
                // HTTP request made as part of a "service request."
                .with_call_attempt_timeout(TriState::Set(Duration::from_millis(30))),
        )
        .with_http_timeouts(
            timeout::Http::new()
                // A limit on the amount of time an application takes to attempt to read the first byte over
                // an established, open connection after a write request.
                // Also known as the "time to first byte" timeout.
                .with_read_timeout(TriState::Set(Duration::from_millis(10)))
                // A time limit for completing the connect-handshake. The time starts when
                // making an initial connect attempt on a socket.
                .with_connect_timeout(TriState::Set(Duration::from_millis(10))),
        );
    let config = aws_config::from_env()
        .retry_config(RetryConfig::new().with_max_attempts(10))
        .timeout_config(timeout_config)
        .load()
        .await;

    let dynamodb_client = aws_sdk_dynamodb::Client::new(&config);

you will get this message:
request has timed out: API call (single attempt) timeout occurred after 30ms.

But you will notice that the Lambda Duration is 32ms, so the SDK did not retry.

Possible Solution

Implement retries with exponential backoff like the other AWS-SDK

Additional Information/Context

Libs:

aws-sdk-dynamodb = "0.13.0"
lambda_http = "0.5.0"
lambda_runtime = "0.5.1"
rustc 1.61.0 

Version

cargo tree | grep aws- ├── aws-config v0.13.0 │   ├── aws-http v0.13.0 │   │   ├── aws-smithy-http v0.43.0 │   │   │   ├── aws-smithy-types v0.43.0 │   │   ├── aws-smithy-types v0.43.0 (*) │   │   ├── aws-types v0.13.0 │   │   │   ├── aws-smithy-async v0.43.0 │   │   │   ├── aws-smithy-client v0.43.0 │   │   │   │   ├── aws-smithy-async v0.43.0 (*) │   │   │   │   ├── aws-smithy-http v0.43.0 (*) │   │   │   │   ├── aws-smithy-http-tower v0.43.0 │   │   │   │   │   ├── aws-smithy-http v0.43.0 (*) │   │   │   │   ├── aws-smithy-protocol-test v0.43.0 │   │   │   │   ├── aws-smithy-types v0.43.0 (*) │   │   │   ├── aws-smithy-http v0.43.0 (*) │   │   │   ├── aws-smithy-types v0.43.0 (*) │   ├── aws-sdk-sso v0.13.0 │   │   ├── aws-endpoint v0.13.0 │   │   │   ├── aws-smithy-http v0.43.0 (*) │   │   │   ├── aws-types v0.13.0 (*) │   │   ├── aws-http v0.13.0 (*) │   │   ├── aws-sig-auth v0.13.0 │   │   │   ├── aws-sigv4 v0.13.0 │   │   │   │   ├── aws-smithy-http v0.43.0 (*) │   │   │   ├── aws-smithy-http v0.43.0 (*) │   │   │   ├── aws-types v0.13.0 (*) │   │   ├── aws-smithy-async v0.43.0 (*) │   │   ├── aws-smithy-client v0.43.0 (*) │   │   ├── aws-smithy-http v0.43.0 (*) │   │   ├── aws-smithy-http-tower v0.43.0 (*) │   │   ├── aws-smithy-json v0.43.0 │   │   │   └── aws-smithy-types v0.43.0 (*) │   │   ├── aws-smithy-types v0.43.0 (*) │   │   ├── aws-types v0.13.0 (*) │   ├── aws-sdk-sts v0.13.0 │   │   ├── aws-endpoint v0.13.0 (*) │   │   ├── aws-http v0.13.0 (*) │   │   ├── aws-sig-auth v0.13.0 (*) │   │   ├── aws-smithy-async v0.43.0 (*) │   │   ├── aws-smithy-client v0.43.0 (*) │   │   ├── aws-smithy-http v0.43.0 (*) │   │   ├── aws-smithy-http-tower v0.43.0 (*) │   │   ├── aws-smithy-query v0.43.0 │   │   │   ├── aws-smithy-types v0.43.0 (*) │   │   ├── aws-smithy-types v0.43.0 (*) │   │   ├── aws-smithy-xml v0.43.0 │   │   ├── aws-types v0.13.0 (*) │   ├── aws-smithy-async v0.43.0 (*) │   ├── aws-smithy-client v0.43.0 (*) │   ├── aws-smithy-http v0.43.0 (*) │   ├── aws-smithy-http-tower v0.43.0 (*) │   ├── aws-smithy-json v0.43.0 (*) │   ├── aws-smithy-types v0.43.0 (*) │   ├── aws-types v0.13.0 (*) ├── aws-sdk-cloudfront v0.13.0 │   ├── aws-endpoint v0.13.0 (*) │   ├── aws-http v0.13.0 (*) │   ├── aws-sig-auth v0.13.0 (*) │   ├── aws-smithy-async v0.43.0 (*) │   ├── aws-smithy-client v0.43.0 (*) │   ├── aws-smithy-http v0.43.0 (*) │   ├── aws-smithy-http-tower v0.43.0 (*) │   ├── aws-smithy-types v0.43.0 (*) │   ├── aws-smithy-xml v0.43.0 (*) │   ├── aws-types v0.13.0 (*) ├── aws-sdk-dynamodb v0.13.0 │   ├── aws-endpoint v0.13.0 (*) │   ├── aws-http v0.13.0 (*) │   ├── aws-sig-auth v0.13.0 (*) │   ├── aws-smithy-async v0.43.0 (*) │   ├── aws-smithy-client v0.43.0 (*) │   ├── aws-smithy-http v0.43.0 (*) │   ├── aws-smithy-http-tower v0.43.0 (*) │   ├── aws-smithy-json v0.43.0 (*) │   ├── aws-smithy-types v0.43.0 (*) │   ├── aws-types v0.13.0 (*) ├── aws-sdk-sqs v0.13.0 │   ├── aws-endpoint v0.13.0 (*) │   ├── aws-http v0.13.0 (*) │   ├── aws-sig-auth v0.13.0 (*) │   ├── aws-smithy-async v0.43.0 (*) │   ├── aws-smithy-client v0.43.0 (*) │   ├── aws-smithy-http v0.43.0 (*) │   ├── aws-smithy-http-tower v0.43.0 (*) │   ├── aws-smithy-query v0.43.0 (*) │   ├── aws-smithy-types v0.43.0 (*) │   ├── aws-smithy-xml v0.43.0 (*) │   ├── aws-types v0.13.0 (*) ├── aws-smithy-http v0.43.0 (*) ├── aws-smithy-types v0.43.0 (*) ├── aws-types v0.13.0 (*) ├── aws-smithy-client v0.43.0 (*)

Environment details (OS name and version, etc.)

Mac

Logs

ERROR lambda_runtime: SdkError("request has timed out: API call (single attempt) timeout occurred after 30ms")

@ymwjbxxq ymwjbxxq added bug This issue is a bug. needs-triage This issue or PR still needs to be triaged. labels Jun 19, 2022
@rcoh
Copy link
Contributor

rcoh commented Jun 20, 2022

interesting! what operation are you running and what error are you seeing? The SDK doesn't consider every error to be retryable, it's possible that that is the issue? Seeing more logs would also be helpful, could you include the output of aws_smithy_client=trace?

@rcoh rcoh removed the needs-triage This issue or PR still needs to be triaged. label Jun 20, 2022
@ymwjbxxq
Copy link
Author

interesting! what operation are you running and what error are you seeing? The SDK doesn't consider every error to be retryable, it's possible that that is the issue? Seeing more logs would also be helpful, could you include the output of aws_smithy_client=trace?

I am using a GetItem and GetBatchItem.

Where should I put aws_smithy_client=trace? It is a Lambda :)

I intercept the DynamoDB error in this way:

impl<E> From<SdkError<E>> for ApplicationError
where
    E: error::Error,
{
    fn from(value: SdkError<E>) -> ApplicationError {
        ApplicationError::SdkError(format!("{}", value))
    }
}

So the message that is coming out is:
"request has timed out: API call (single attempt) timeout occurred after XXms"

@rcoh
Copy link
Contributor

rcoh commented Jun 20, 2022

ah I see—timeouts are not currently considered retryable. Let me follow up and see if that's considered correct or not. Generally we are very hesitant to retry something that could introduce widespread service degradation by increasing the load.

@ymwjbxxq
Copy link
Author

ah, I see—timeouts are not currently considered retryable. Let me follow up and see if that's considered correct or not. Generally, we are very hesitant to retry something that could introduce widespread service degradation by increasing the load.

Other SDKs do it like GO, Node I think all supported retries. Could you tell me what kind of retries are supported?

@rcoh
Copy link
Contributor

rcoh commented Jun 20, 2022

The SDK will retry if DynamoDB, e.g. responds with something like ThrottlingError–the difference here is that you aren't getting an error from dynamo, you're making a client side timeout and immediately trying again

@ymwjbxxq
Copy link
Author

ymwjbxxq commented Jun 20, 2022

The best practice from AWS is to set a lower timeout to force the SDK to retry to avoid degradation in performance that you are getting out of the box using the default timeout.

It is one of the principal for getting better performance.

Many SDK for other runtime have all default timeouts and retries.
The SDK timeout is often a 2minutes, which is a degradation of performance. AWS in fact advice to set a lower timeout and let the SDK retry

Node:
https://docs.aws.amazon.com/AWSJavaScriptSDK/latest/AWS/Config.html
Golang:
https://aws.github.io/aws-sdk-go-v2/docs/configuring-sdk/retries-timeouts/

I am sure this retry works out of the box. I used it all the time :)

I am expecting that timeout is also included because it is pretty logical...AWS network has variable latency, and I know that DynamoDB should respond under the 20ms, so instead of waiting 300ms, I force the timeout for the HTTP request that SDK does to get a quick response.

I think this will not bring any surprise to anyone, but maybe if you guys are afraid maybe, you can add new:
.retry_config specifically for Timeout case or something to include it.

Currently, you do not timeout any call. Instead, you just let it run until maybe the Lambda is timing out (this will hide the real problem), and so this is worse than not having anything.
For example, as default, you can setup 2 minutes and 3 retries like the other SDK, and as a user, I can setup my client.

In node, for example, is like this:

const options: ServiceConfigurationOptions = {
  region: process.env.AWS_REGION,
  maxRetries: 10,
  httpOptions: {
    timeout: 100, //The number of milliseconds a request can take before automatically being terminated. Defaults to two minutes (120000).
  },
};

@rcoh
Copy link
Contributor

rcoh commented Jun 20, 2022

you're right—I think this is exactly how the Java SDK behaves today as well. This is definitely a bug, and one I hope to get a fix out for shortly

@rcoh rcoh moved this to In Progress in AWS Rust SDK Public Roadmap Jun 21, 2022
@rcoh rcoh added feature-request A feature should be added or improved. bug This issue is a bug. and removed bug This issue is a bug. feature-request A feature should be added or improved. labels Jun 23, 2022
@rcoh rcoh closed this as completed Jun 23, 2022
Repository owner moved this from In Progress to Done in AWS Rust SDK Public Roadmap Jun 23, 2022
@github-actions
Copy link

⚠️COMMENT VISIBILITY WARNING⚠️

Comments on closed issues are hard for our team to see.
If you need more assistance, please either tag a team member or open a new issue that references this one.
If you wish to keep having a conversation with other community members under this issue feel free to do so.

@ymwjbxxq
Copy link
Author

ymwjbxxq commented Jun 26, 2022

@rcoh

I just tested it out and still failing (aws-sdk-dynamodb = "0.14.0")
ERROR lambda_runtime: SdkError("request has timed out: API call (all attempts including retries) timeout occurred after 30ms")

Notice the Lambda Duration:
REPORT RequestId: ebb193e5-9609-4579-864e-1638cc4610e4 Duration: 32.26 ms

@rcoh
Copy link
Contributor

rcoh commented Jun 26, 2022

I think you need to remove the "with call timeout and only keep the call attempt" timeout

The intention is that the timeout on the entire operation including timeouts is not retryable–this is a hard limit including retries

@rcoh
Copy link
Contributor

rcoh commented Jun 26, 2022

(You'll note the error message has changed as well vs. the original)

We could improve these errors though to clearly describe the action you'd need to take

@ymwjbxxq
Copy link
Author

ymwjbxxq commented Jun 27, 2022

Hi @rcoh,

after a good 9h of testing, I can tell this:

  • the fix works

But the overall results are much worse now because, for example, it retries 10 times each 30ms.

I have my doubts about the with_call_timeout. You said that this is t the limit and it is not retrievable.

The documentation said:

// This timeout acts at the "Request to a service" level. When the SDK requests to a
// service, that "request" can contain several HTTP requests. This way, **you can retry**
// failures that are likely spurious or refresh credentials.
with_call_timeout 

If I have a BatchOperation like a Get, it can retry ten times for each Get, but I would like to set if several HTTP requests (with_call_timeout) timeout overall at 100ms, retry.

It seems to me that you should implement the retries also for with_call_timeout.

Moreover, there is no way to track the retries now. It would be helpful to enable some parameters DEBUG: true to see logs for the retries. For example:

  • retry 1
  • retry 2

I would be curious to know how many retries when I have high latency there are and the timestamp of the request because I think Retries with exponential backoff is also not implemented in the SDK

@jdisanti
Copy link
Contributor

Retry information is currently logged via the tracing library. If you have a subscriber at the debug level, you should be able to observe this event occurring when there is a retry: https://github.com/awslabs/smithy-rs/blob/a0539e20b069a7de021c84521d8f3c7ba098ad6d/rust-runtime/aws-smithy-client/src/retry.rs#L317-L322

An easy way to observe this is by using the tracing-subscriber with the fmt feature enabled, adding tracing_subscriber::fmt::init(); at the beginning of your program, and setting the RUST_LOG environment variable to debug.

The call timeout is intended to be a non-retryable timeout for the overall operation. Basically, it means, "stop everything after this time no matter what." The attempt timeout is intended to be retried. There may be several attempts for a single call.

Retry with exponential backoff is implemented, and I've observed it working in a real world project (I wasn't using timeouts, though).

But the overall results are much worse now because, for example, it retries 10 times each 30ms.

Can you explain what you mean by this a little more?

@ymwjbxxq
Copy link
Author

ymwjbxxq commented Jun 27, 2022

Retry information is currently logged via the tracing library. If you have a subscriber at the debug level, you should be able to observe this event occurring when there is a retry: https://github.com/awslabs/smithy-rs/blob/a0539e20b069a7de021c84521d8f3c7ba098ad6d/rust-runtime/aws-smithy-client/src/retry.rs#L317-L322.

An easy way to observe this is by using the tracing-subscriber with the fmt feature enabled, adding tracing_subscriber::fmt::init(); at the beginning of your program, and setting the RUST_LOG environment variable to debug.

The call timeout is intended to be a non-retryable timeout for the overall operation. Basically, it means, "stop everything after this time no matter what." The attempt timeout is intended to be retried. There may be several attempts for a single call.

Retry with exponential backoff is implemented, and I've observed it working in a real world project (I wasn't using timeouts, though).

But the overall results are much worse now because, for example, it retries 10 times each 30ms.

Can you explain what you mean by this a little more?

First of thank you for the answer.

I had already RUST_LOG in my Lambda, so good to know that I can do the debug. I will let it run and see if I will see something in CW

tracing_subscriber::fmt()
        .with_ansi(false)
        .without_time()
        .init();

The idea behind setting the timeout is to avoid DynamoDB to run for seconds to return something (bytes)

So without the with_call_timeout when (often) DynamoDB does not respond quickly, I saw the duration of Lambda going so high, like > 5s, to load bytes from a query made with PK and SK.

The idea behind it is for me to return as fast as possible, and so I need the with_call_timeout, and in fact, I end up having:

let timeout_config = timeout::Config::new()
       .with_api_timeouts(
           timeout::Api::new()
               .with_call_timeout(TriState::Set(Duration::from_millis(95)))
               .with_call_attempt_timeout(TriState::Set(Duration::from_millis(30))),
       )
       .with_http_timeouts(
           timeout::Http::new()
               .with_read_timeout(TriState::Set(Duration::from_millis(10)))
               .with_connect_timeout(TriState::Set(Duration::from_millis(10))),
       );
   let config = aws_config::from_env()
       .retry_config(RetryConfig::new().with_max_attempts(3))
       .timeout_config(timeout_config)
       .load()
       .await;

After the 95ms mark timeout and I manually retry (while loop of 3), this is faster 285ms (max) overall than letting DynamoDB run until it wants.

From the high of my ignorance, I would say that all the timeouts functions should be retriable. Still, if you said the opposite and this is the exact behaviour of the other AWS SDK, I will accept it, and I would suggest putting better comments or examples so that this would be clear to the next one.

@rcoh
Copy link
Contributor

rcoh commented Jun 27, 2022

I expect that the difference is your retry is possibly without backoff?

@ymwjbxxq
Copy link
Author

I tested in 2 ways:

  • without backoff - sometimes works, and when it does not, I fall back in a different region
  • with backoff - most of the time, works

For the backoff I use thread::sleep, and I multiply retry_count * 20ms

@ymwjbxxq
Copy link
Author

@jdisanti

I am sorry to keep going, but I am unsure if retries are happening or if the logs in debug mode are present.

#[tokio::main]
async fn main() -> Result<(), Error> {
    tracing_subscriber::fmt()
        .with_ansi(false)
        .without_time()
        .init();

    let timeout_config = timeout::Config::new()
        .with_api_timeouts(
            timeout::Api::new()
                .with_call_timeout(TriState::Set(Duration::from_millis(95)))
                .with_call_attempt_timeout(TriState::Set(Duration::from_millis(30))),
        )
        .with_http_timeouts(
            timeout::Http::new()
                .with_read_timeout(TriState::Set(Duration::from_millis(10)))
                .with_connect_timeout(TriState::Set(Duration::from_millis(10))),
        );
    let config = aws_config::from_env()
        .retry_config(RetryConfig::new().with_max_attempts(3))
        .timeout_config(timeout_config)
        .load()
        .await;
 ...

In my Environment variables I have:

RUST_LOG: debug
RUST_BACKTRACE: 1

Inside CW, I see:

INFO send_operation{operation="GetItem" service="dynamodb"}:provide_credentials{provider=default_chain}: aws_config::meta::credentials::chain: loaded credentials provider=Environment
ERROR lambda_runtime: SdkError("request has timed out: API call (all attempts including retries) timeout occurred after 95ms")
REPORT RequestId: xxxx   Duration: 98.26 ms....

I infer that the debug logs for the retries are not showing up.

GetItem operation perhaps worked:

  • attempt timeout 30ms
  • retry 3

This result is 90ms, and usually, the Lambda run in 8ms, so the total 98ms duration match unless it is one of the BatchGetItem operations, but the INFO logs do not tell.

Do I miss something, or I have found something else?

@ymwjbxxq
Copy link
Author

ymwjbxxq commented Jun 28, 2022

@jdisanti @rcoh
With the Lambda RUST_LOG, env variable does not do anything.

To see the debugs this is the correct way
tracing_subscriber::fmt()
.with_ansi(false)
.with_max_level(tracing_subscriber::filter::LevelFilter::DEBUG)
.init();

@jdisanti
Copy link
Contributor

Ah, I keep forgetting they moved the environment variable functionality into a separate crate feature. Sorry about that.

It sounds like you're seeing the retries occur now?

@ymwjbxxq
Copy link
Author

So for Lambda is you want to write somewhere the correct configuration is:
tracing-subscriber = { version = "0.3", features = ["fmt", "json"] }

#[tokio::main]
async fn main() -> Result<(), Error> {
    tracing_subscriber::fmt()
        .with_ansi(false)
        .without_time()
        .with_max_level(tracing_subscriber::filter::LevelFilter::DEBUG)
        .init();

RUST_LOG is not needed at all.

gstreamer-github pushed a commit to sdroege/gst-plugin-rs that referenced this issue Sep 13, 2022
When call_timeout is triggered, request will fail
irrespective of the retry setting. call_timeout define
max time request can take along with retry.
It can be solved by either setting call_timeout to
retry * call_attempt_timeout or not setting the call_timeout.

As per thread call_attempt and rety setting is enough.
awslabs/aws-sdk-rust#558
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug This issue is a bug.
Projects
Archived in project
Development

No branches or pull requests

3 participants