Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Source Index storage authentication failure #2974

Open
3 tasks
jeffschwMSFT opened this issue May 30, 2024 · 14 comments
Open
3 tasks

Source Index storage authentication failure #2974

jeffschwMSFT opened this issue May 30, 2024 · 14 comments

Comments

@jeffschwMSFT
Copy link
Member

jeffschwMSFT commented May 30, 2024

Build

https://dev.azure.com/dnceng/internal/_build/results?buildId=2462759

Status: 403 (This request is not authorized to perform this operation.)
ErrorCode: AuthorizationFailure

Content:
<?xml version="1.0" encoding="utf-8"?><Error><Code>AuthorizationFailure</Code><Message>This request is not authorized to perform this operation.
RequestId:9c862d9a-501e-0093-2a0e-b2337f000000
Time:2024-05-29T21:26:58.8871758Z</Message></Error>

Headers:
Server: Microsoft-HTTPAPI/2.0
x-ms-request-id: 9c862d9a-501e-0093-2a0e-b2337f000000
x-ms-client-request-id: f924d565-53ef-40a6-9d9a-c7b2ecd56976
x-ms-error-code: AuthorizationFailure
Date: Wed, 29 May 2024 21:26:58 GMT
Content-Length: 246
Content-Type: application/xml

   at Azure.Storage.Blobs.BlockBlobRestClient.UploadAsync(Int64 contentLength, Stream body, Nullable`1 timeout, Byte[] transactionalContentMD5, String blobContentType, String blobContentEncoding, String blobContentLanguage, Byte[] blobContentMD5, String blobCacheControl, IDictionary`2 metadata, String leaseId, String blobContentDisposition, String encryptionKey, String encryptionKeySha256, Nullable`1 encryptionAlgorithm, String encryptionScope, Nullable`1 tier, Nullable`1 ifModifiedSince, Nullable`1 ifUnmodifiedSince, String ifMatch, String ifNoneMatch, String ifTags, String blobTagsString, Nullable`1 immutabilityPolicyExpiry, Nullable`1 immutabilityPolicyMode, Nullable`1 legalHold, Byte[] transactionalContentCrc64, CancellationToken cancellationToken)
   at Azure.Storage.Blobs.Specialized.BlockBlobClient.UploadInternal(Stream content, BlobHttpHeaders blobHttpHeaders, IDictionary`2 metadata, IDictionary`2 tags, BlobRequestConditions conditions, Nullable`1 accessTier, BlobImmutabilityPolicy immutabilityPolicy, Nullable`1 legalHold, IProgress`1 progressHandler, UploadTransferValidationOptions transferValidationOverride, String operationName, Boolean async, CancellationToken cancellationToken)
   at Azure.Storage.Blobs.Specialized.BlockBlobClient.<>c__DisplayClass65_0.<<GetPartitionedUploaderBehaviors>b__0>d.MoveNext()
--- End of stack trace from previous location ---
   at Azure.Storage.PartitionedUploader`2.UploadInternal(Stream content, Nullable`1 expectedContentLength, TServiceSpecificData args, IProgress`1 progressHandler, Boolean async, CancellationToken cancellationToken)
   at Azure.Storage.Blobs.BlobClient.StagedUploadInternal(Stream content, BlobUploadOptions options, Boolean async, CancellationToken cancellationToken)
   at Azure.Storage.Blobs.BlobClient.UploadAsync(Stream content)
   at UploadIndexStage1.Program.Main(String[] args) in D:\a\_work\1\s\src\UploadIndexStage1\Program.cs:line 130
   at UploadIndexStage1.Program.Main(String[] args) in D:\a\_work\1\s\src\UploadIndexStage1\Program.cs:line 131
   at UploadIndexStage1.Program.<Main>(String[] args)
##[error]Cmd.exe exited with code '57005'.

Build leg reported

No response

Pull Request

No response

Known issue core information

Fill out the known issue JSON section by following the step by step documentation on how to create a known issue

 {
    "ErrorMessage" : "at UploadIndexStage1.Program.Main",
    "BuildRetry": false,
    "ErrorPattern": "",
    "ExcludeConsoleLog": false
 }

@dotnet/dnceng

Release Note Category

  • Feature changes/additions
  • Bug fixes
  • Internal Infrastructure Improvements

Release Note Description

Additional information about the issue reported

No response

Known issue validation

Build: 🔎 https://dev.azure.com/dnceng/internal/_build/results?buildId=2462759
Error message validated: [at UploadIndexStage1.Program.Main]
Result validation: ✅ Known issue matched with the provided build.
Validation performed at: 5/30/2024 4:07:44 PM UTC

Report

Summary

24-Hour Hit Count 7-Day Hit Count 1-Month Count
0 0 0
@riarenas
Copy link
Member

@directhex was working on this.

@directhex
Copy link
Member

I'm gonna have to give a big ol' shrug emoji here. According to the Azure activity log, no config changes have been made to the storage account in the last week. Subsequent and prior builds were fine, e.g. https://dev.azure.com/dnceng/internal/_build/results?buildId=2463257&view=logs&j=316d5c15-0c50-544e-8051-e6b14a1ab674&t=657dbd0b-e93c-5e59-4ba8-77889f010efc

Looking at it, it seems the common element is failed builds take longer than 30 seconds to do the upload. Possibly we're only getting an incredibly short lease, and it's expiring?

https://dev.azure.com/dnceng/internal/_build/results?buildId=2462759&view=logs&j=316d5c15-0c50-544e-8051-e6b14a1ab674&t=657dbd0b-e93c-5e59-4ba8-77889f010efc took 36s
https://dev.azure.com/dnceng/internal/_build/results?buildId=2463071&view=logs&j=316d5c15-0c50-544e-8051-e6b14a1ab674&t=657dbd0b-e93c-5e59-4ba8-77889f010efc took 31s

The "good" builds are in the low twenties of seconds.

@lewing
Copy link
Member

lewing commented Jun 12, 2024

Should we change BuildRetry to true for this? If retry is likely to succeed it would help lessen the impact

@directhex
Copy link
Member

IMHO adding either a retry, or a "mark failure as success so the build as a whole succeeds" is fine - absolute worst case is stale indices if an index upload fails.

@ericstj
Copy link
Member

ericstj commented Jun 13, 2024

Looking at it, it seems the common element is failed builds take longer than 30 seconds to do the upload.

Sounds like a solid theory. @directhex maybe we can refresh our access token right before we upload, after we've tar'ed everything?
https://learn.microsoft.com/en-us/cli/azure/authenticate-azure-cli#refresh-tokens

https://github.com/dotnet/source-indexer/blob/4207666f77be68741745420ea6576ca28baf4742/src/UploadIndexStage1/Program.cs#L130C37-L130C48

Alternatively we split-out the step that creates the tarball so that we can login (or refresh) separately, right before we upload.

Alternatively/additionally - we might also be able to speed up the tar'ing phase of this by replacing SharpZipLib with the API's @carlossanlop added - which I would expect are faster since they use our zlib under the hood vs SharpZipLib which has an non-vectorized managed implementation.

@ericstj
Copy link
Member

ericstj commented Jun 24, 2024

Noting here that @directhex is working on a fix for this but I'm unable to assign this issue to him.

@directhex
Copy link
Member

This might accidentally be fixed by dotnet/arcade#14912

@ericstj
Copy link
Member

ericstj commented Jul 16, 2024

Seems like the hit count here is high enough to warrant us taking steps to mitigate it, @directhex. What do you suggest we do - add more logging, make a change to how we auth, retries, something else?

@riarenas
Copy link
Member

A shot in the dark here, Did we add firewall rules to this storage account's network access? There were a lot of issues with the instructions on how to do that, since it didn't account for all possible IPs, and there's an entire scenario where firewall rules don't work for traffic from within the same region, so that could explain why it seems flaky.

@directhex
Copy link
Member

We did add the firewall rules which were recommended at one point, and yes I think it's feasible that it's a firewall issue.

We don't have functional replacement guidance on network isolation for storage containers yet

@riarenas
Copy link
Member

riarenas commented Jul 18, 2024

Would it be worth reverting those changes while we wait for the new updated guidance and for the NSP features to light up? Whether it helps with reliability or not would still help get some data.

@directhex
Copy link
Member

OK I've flipped the switch to "Public network access: Enabled from all networks" on the storage container in Azure.

@directhex
Copy link
Member

I guess I can't trust the 24h figure from yesterday, due to the Azure outage

@riarenas
Copy link
Member

well, we're down to 0 hits in last 7 days so it seems to have worked?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants