Race condition in AWS image writer? #1244

jasonrichardsmith · 2022-01-25T10:17:26Z

When writing large images (Python 3) to AWS the image create seems be having a race issue and just hangs at "Creating Snapshot".
I added a panic here:

https://github.com/nanovms/ops/blob/master/aws/aws_image.go#L242

And received this message:
{
BlockData: buffer(%!p(aws.ReaderSeekerCloser={0xc010b83ad0})),
BlockIndex: 507,
Checksum: "B4VNL+8pega6gWheZgwzLeNtXRjVRpJ9MNqtbX/aFUE=",
ChecksumAlgorithm: "SHA256",
DataLength: 524288,
SnapshotId: "snap-05a19a721f34cc460"
}

{
RespMetadata: {
StatusCode: 400,
RequestID: "b157f530-aafe-42c8-b71a-c1fb915adf41"
},
Message_: "Failed to read block data",
Reason: "INVALID_PARAMETER_VALUE"
}

when the go routine here is removed:
https://github.com/nanovms/ops/blob/master/aws/aws_image.go#L242

image creation appears to be much slower but works. The last block for either is 507 but without the go routine it does not ever finish.

I used the python 3.8.6 ops-example for my tests.

francescolavra · 2022-01-25T10:38:32Z

I noticed this issue too, apparently the EBS backend doesn't like it when there is a large number of concurrent "PutSnapshotBlock" requests during a snapshot creation. In these cases, the backend responds with a "Failed to read block data" error for one or more random blocks, and the snapshot creation process gets stuck.
I tried adding a retry for errored blocks, but that doesn't seem to help.
As you mentioned, removing the goroutine as in the below diff solves the issue.

diff --git a/aws/aws_image.go b/aws/aws_image.go
index 3f0ecde..292bb63 100644
--- a/aws/aws_image.go
+++ b/aws/aws_image.go
@@ -196,7 +196,7 @@ func (p *AWS) createSnapshot(imagePath string) (snapshotID string, err error) {
 
                wg.Add(1)
 
-               go p.writeToBlock(putSnapshotBlockInput, &wg, awsErrors)
+               p.writeToBlock(putSnapshotBlockInput, &wg, awsErrors)
 
                blockIndex++
        }

eyberg · 2022-01-25T16:00:34Z

@jasonrichardsmith how large of an image are we talking? - i think without adding anything that base python3.6.8 is ~233 meg

it might be good to figure out the number of requests that fails when done concurrently vs serial cause it'd be great to keep as much of that concurrency as possible

jasonrichardsmith · 2022-01-25T16:31:27Z

I am not sure about the size.

ops image create --package python_3.8.6 -i opsexample -c config.json --show-debug -t aws -n

I basically ran that against this example:
https://github.com/nanovms/ops-examples/tree/master/python/python3.8

eyberg · 2022-01-27T23:04:01Z

so there are 2 issues here: 1) the race and 2) hitting RequestThrottledException from the PutSnapshotBlock call;

for the latter we should send the max or near the max number of requests that we can and then enqueue the rest to wait for completion https://docs.aws.amazon.com/apigateway/latest/developerguide/limits.html#http-api-quotas

eyberg · 2022-01-27T23:04:25Z

#1253

eyberg · 2022-02-03T19:15:14Z

@jasonrichardsmith && @francescolavra it sounds like there is a default retry mechanism in the client https://github.com/aws/aws-sdk-go/blob/e2d6cb448883e4f4fcc5246650f89bde349041ec/aws/client/default_retryer.go - with the race fix #1253 are you still seeing this behavior or did it go away?

ponokys · 2022-02-03T19:42:49Z

testcase with image: 388.8MB - 733blocks
The race fix fix race condtion for unbuffer channel #1253 had fixed this problem.

francescolavra · 2022-02-04T09:26:47Z

I'm still seeing the issue: using Ops version 0.1.29, I tried uploading a 54 MB image 3 times, but all 3 attempts failed:

francesco@ubuntu:~/Documents/NanoVMs/go$ ops image create -n -t aws -c config-aws.json gtest1
  96% |██████████████████████████████████████  |  [3m12s:8s]ValidationException: Failed to read block data
{
  RespMetadata: {
    StatusCode: 400,
    RequestID: "2e0641c3-d70d-48aa-b516-3c09de220194"
  },
  Message_: "Failed to read block data",
  Reason: "INVALID_PARAMETER_VALUE"
}
francesco@ubuntu:~/Documents/NanoVMs/go$ ops image create -n -t aws -c config-aws.json gtest1
  88% |███████████████████████████████████     |  [2m56s:24s]RequestError: send request failed
caused by: Put "https://ebs.us-west-1.amazonaws.com/snapshots/snap-0f24668e581d1f96b/blocks/88": dial tcp: lookup ebs.us-west-1.amazonaws.com on 127.0.1.1:53: read udp 127.0.0.1:60181->127.0.1.1:53: i/o timeout
francesco@ubuntu:~/Documents/NanoVMs/go$ ops image create -n -t aws -c config-aws.json gtest1
  94% |█████████████████████████████████████   |  [3m8s:12s]ValidationException: Failed to read block data
{
  RespMetadata: {
    StatusCode: 400,
    RequestID: "6c3360fa-5501-443c-94f8-d148c73ba5ed"
  },
  Message_: "Failed to read block data",
  Reason: "INVALID_PARAMETER_VALUE"
}

The errors at the first and third attempt are the same as the ones I was seeing before #1253, while the error at the second attempt is something I hadn't seen before. In all 3 cases I ended up with a snapshot stuck in "Creating" status.

eyberg · 2022-02-04T16:55:41Z

ok i'm replicating this as well

➜  zz ops image create -c config.json -t aws zz
  11% |████                                    |  [22s:2m58s]RequestError: send request failed
caused by: Put "https://ebs.us-west-1.amazonaws.com/snapshots/snap-0a4d96930532d0a5d/blocks/792": write tcp 192.168.1.65:53180->3.101.160.241:443: write: no buffer space available
➜  zz cat config.json
{
  "BaseVolumeSz": "500m",
  "CloudConfig" :{
        "Zone": "us-west-1",
        "BucketName":"bucket"
    }
}
➜  zz cat main.go
package main

import (
        "fmt"
)

func main() {
        fmt.Println("test")
}
➜

eyberg added the aws label Jan 25, 2022

ponokys mentioned this issue Feb 8, 2022

putsnapshot to EBS handler #1268

Merged

eyberg closed this as completed in #1268 Feb 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Race condition in AWS image writer? #1244

Race condition in AWS image writer? #1244

jasonrichardsmith commented Jan 25, 2022

francescolavra commented Jan 25, 2022

eyberg commented Jan 25, 2022 •

edited

Loading

jasonrichardsmith commented Jan 25, 2022

eyberg commented Jan 27, 2022

eyberg commented Jan 27, 2022

eyberg commented Feb 3, 2022

ponokys commented Feb 3, 2022

francescolavra commented Feb 4, 2022

eyberg commented Feb 4, 2022

Race condition in AWS image writer? #1244

Race condition in AWS image writer? #1244

Comments

jasonrichardsmith commented Jan 25, 2022

francescolavra commented Jan 25, 2022

eyberg commented Jan 25, 2022 • edited Loading

jasonrichardsmith commented Jan 25, 2022

eyberg commented Jan 27, 2022

eyberg commented Jan 27, 2022

eyberg commented Feb 3, 2022

ponokys commented Feb 3, 2022

francescolavra commented Feb 4, 2022

eyberg commented Feb 4, 2022

eyberg commented Jan 25, 2022 •

edited

Loading