-
Notifications
You must be signed in to change notification settings - Fork 133
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Race condition in AWS image writer? #1244
Comments
I noticed this issue too, apparently the EBS backend doesn't like it when there is a large number of concurrent "PutSnapshotBlock" requests during a snapshot creation. In these cases, the backend responds with a "Failed to read block data" error for one or more random blocks, and the snapshot creation process gets stuck.
|
@jasonrichardsmith how large of an image are we talking? - i think without adding anything that base python3.6.8 is ~233 meg it might be good to figure out the number of requests that fails when done concurrently vs serial cause it'd be great to keep as much of that concurrency as possible |
I am not sure about the size. ops image create --package python_3.8.6 -i opsexample -c config.json --show-debug -t aws -n I basically ran that against this example: |
so there are 2 issues here: 1) the race and 2) hitting RequestThrottledException from the PutSnapshotBlock call; for the latter we should send the max or near the max number of requests that we can and then enqueue the rest to wait for completion https://docs.aws.amazon.com/apigateway/latest/developerguide/limits.html#http-api-quotas |
@jasonrichardsmith && @francescolavra it sounds like there is a default retry mechanism in the client https://github.com/aws/aws-sdk-go/blob/e2d6cb448883e4f4fcc5246650f89bde349041ec/aws/client/default_retryer.go - with the race fix #1253 are you still seeing this behavior or did it go away? |
|
I'm still seeing the issue: using Ops version 0.1.29, I tried uploading a 54 MB image 3 times, but all 3 attempts failed:
The errors at the first and third attempt are the same as the ones I was seeing before #1253, while the error at the second attempt is something I hadn't seen before. In all 3 cases I ended up with a snapshot stuck in "Creating" status. |
ok i'm replicating this as well
|
When writing large images (Python 3) to AWS the image create seems be having a race issue and just hangs at "Creating Snapshot".
I added a panic here:
https://github.com/nanovms/ops/blob/master/aws/aws_image.go#L242
And received this message:
{
BlockData: buffer(%!p(aws.ReaderSeekerCloser={0xc010b83ad0})),
BlockIndex: 507,
Checksum: "B4VNL+8pega6gWheZgwzLeNtXRjVRpJ9MNqtbX/aFUE=",
ChecksumAlgorithm: "SHA256",
DataLength: 524288,
SnapshotId: "snap-05a19a721f34cc460"
}
{
RespMetadata: {
StatusCode: 400,
RequestID: "b157f530-aafe-42c8-b71a-c1fb915adf41"
},
Message_: "Failed to read block data",
Reason: "INVALID_PARAMETER_VALUE"
}
when the go routine here is removed:
https://github.com/nanovms/ops/blob/master/aws/aws_image.go#L242
image creation appears to be much slower but works. The last block for either is 507 but without the go routine it does not ever finish.
I used the python 3.8.6 ops-example for my tests.
The text was updated successfully, but these errors were encountered: