Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Race condition in AWS image writer? #1244

Closed
jasonrichardsmith opened this issue Jan 25, 2022 · 9 comments · Fixed by #1268
Closed

Race condition in AWS image writer? #1244

jasonrichardsmith opened this issue Jan 25, 2022 · 9 comments · Fixed by #1268
Labels

Comments

@jasonrichardsmith
Copy link

When writing large images (Python 3) to AWS the image create seems be having a race issue and just hangs at "Creating Snapshot".
I added a panic here:

https://github.com/nanovms/ops/blob/master/aws/aws_image.go#L242

And received this message:
{
BlockData: buffer(%!p(aws.ReaderSeekerCloser={0xc010b83ad0})),
BlockIndex: 507,
Checksum: "B4VNL+8pega6gWheZgwzLeNtXRjVRpJ9MNqtbX/aFUE=",
ChecksumAlgorithm: "SHA256",
DataLength: 524288,
SnapshotId: "snap-05a19a721f34cc460"
}

{
RespMetadata: {
StatusCode: 400,
RequestID: "b157f530-aafe-42c8-b71a-c1fb915adf41"
},
Message_: "Failed to read block data",
Reason: "INVALID_PARAMETER_VALUE"
}

when the go routine here is removed:
https://github.com/nanovms/ops/blob/master/aws/aws_image.go#L242

image creation appears to be much slower but works. The last block for either is 507 but without the go routine it does not ever finish.

I used the python 3.8.6 ops-example for my tests.

@francescolavra
Copy link
Member

I noticed this issue too, apparently the EBS backend doesn't like it when there is a large number of concurrent "PutSnapshotBlock" requests during a snapshot creation. In these cases, the backend responds with a "Failed to read block data" error for one or more random blocks, and the snapshot creation process gets stuck.
I tried adding a retry for errored blocks, but that doesn't seem to help.
As you mentioned, removing the goroutine as in the below diff solves the issue.

diff --git a/aws/aws_image.go b/aws/aws_image.go
index 3f0ecde..292bb63 100644
--- a/aws/aws_image.go
+++ b/aws/aws_image.go
@@ -196,7 +196,7 @@ func (p *AWS) createSnapshot(imagePath string) (snapshotID string, err error) {
 
                wg.Add(1)
 
-               go p.writeToBlock(putSnapshotBlockInput, &wg, awsErrors)
+               p.writeToBlock(putSnapshotBlockInput, &wg, awsErrors)
 
                blockIndex++
        }

@eyberg
Copy link
Contributor

eyberg commented Jan 25, 2022

@jasonrichardsmith how large of an image are we talking? - i think without adding anything that base python3.6.8 is ~233 meg

it might be good to figure out the number of requests that fails when done concurrently vs serial cause it'd be great to keep as much of that concurrency as possible

@jasonrichardsmith
Copy link
Author

I am not sure about the size.

ops image create --package python_3.8.6 -i opsexample -c config.json --show-debug -t aws -n

I basically ran that against this example:
https://github.com/nanovms/ops-examples/tree/master/python/python3.8

@eyberg eyberg added the aws label Jan 25, 2022
@eyberg
Copy link
Contributor

eyberg commented Jan 27, 2022

so there are 2 issues here: 1) the race and 2) hitting RequestThrottledException from the PutSnapshotBlock call;

for the latter we should send the max or near the max number of requests that we can and then enqueue the rest to wait for completion https://docs.aws.amazon.com/apigateway/latest/developerguide/limits.html#http-api-quotas

@eyberg
Copy link
Contributor

eyberg commented Jan 27, 2022

#1253

@eyberg
Copy link
Contributor

eyberg commented Feb 3, 2022

@jasonrichardsmith && @francescolavra it sounds like there is a default retry mechanism in the client https://github.com/aws/aws-sdk-go/blob/e2d6cb448883e4f4fcc5246650f89bde349041ec/aws/client/default_retryer.go - with the race fix #1253 are you still seeing this behavior or did it go away?

@ponokys
Copy link
Contributor

ponokys commented Feb 3, 2022

@francescolavra
Copy link
Member

I'm still seeing the issue: using Ops version 0.1.29, I tried uploading a 54 MB image 3 times, but all 3 attempts failed:

francesco@ubuntu:~/Documents/NanoVMs/go$ ops image create -n -t aws -c config-aws.json gtest1
  96% |██████████████████████████████████████  |  [3m12s:8s]ValidationException: Failed to read block data
{
  RespMetadata: {
    StatusCode: 400,
    RequestID: "2e0641c3-d70d-48aa-b516-3c09de220194"
  },
  Message_: "Failed to read block data",
  Reason: "INVALID_PARAMETER_VALUE"
}
francesco@ubuntu:~/Documents/NanoVMs/go$ ops image create -n -t aws -c config-aws.json gtest1
  88% |███████████████████████████████████     |  [2m56s:24s]RequestError: send request failed
caused by: Put "https://ebs.us-west-1.amazonaws.com/snapshots/snap-0f24668e581d1f96b/blocks/88": dial tcp: lookup ebs.us-west-1.amazonaws.com on 127.0.1.1:53: read udp 127.0.0.1:60181->127.0.1.1:53: i/o timeout
francesco@ubuntu:~/Documents/NanoVMs/go$ ops image create -n -t aws -c config-aws.json gtest1
  94% |█████████████████████████████████████   |  [3m8s:12s]ValidationException: Failed to read block data
{
  RespMetadata: {
    StatusCode: 400,
    RequestID: "6c3360fa-5501-443c-94f8-d148c73ba5ed"
  },
  Message_: "Failed to read block data",
  Reason: "INVALID_PARAMETER_VALUE"
}

The errors at the first and third attempt are the same as the ones I was seeing before #1253, while the error at the second attempt is something I hadn't seen before. In all 3 cases I ended up with a snapshot stuck in "Creating" status.

@eyberg
Copy link
Contributor

eyberg commented Feb 4, 2022

ok i'm replicating this as well

➜  zz ops image create -c config.json -t aws zz
  11% |████                                    |  [22s:2m58s]RequestError: send request failed
caused by: Put "https://ebs.us-west-1.amazonaws.com/snapshots/snap-0a4d96930532d0a5d/blocks/792": write tcp 192.168.1.65:53180->3.101.160.241:443: write: no buffer space available
➜  zz cat config.json
{
  "BaseVolumeSz": "500m",
  "CloudConfig" :{
        "Zone": "us-west-1",
        "BucketName":"bucket"
    }
}
➜  zz cat main.go
package main

import (
        "fmt"
)

func main() {
        fmt.Println("test")
}
➜

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants