Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[query] Unhandled transient error for GCS 503s #13937

Closed
daniel-goldstein opened this issue Oct 27, 2023 · 2 comments · Fixed by #14022
Closed

[query] Unhandled transient error for GCS 503s #13937

daniel-goldstein opened this issue Oct 27, 2023 · 2 comments · Fixed by #14022

Comments

@daniel-goldstein
Copy link
Contributor

What happened?

GCS library throws a StorageException: Unknown Error on 503s resulting in the below stacktrace. Such a transient error should be gracefully retried.

Version

0.2.124

Relevant log output

hail.utils.java.FatalError: NullPointerException: null

Java stack trace:
is.hail.relocated.com.google.cloud.storage.StorageException: Unknown Error
	|> PUT https://storage.googleapis.com/upload/storage/v1/b/aou_analysis/o?name=250k/data/utils/aou_variant_qc_250k.ht/index/part-57205-e0113aa0-c1e8-43fc-af14-ccb68d989bd5.idx/index&uploadType=resumable&upload_id=ABPtcPrw7n_weAuHvL4cEyCdL-JKVVX-HaG7fnwAjTgRn4Uxm0JdIcWYasCHyuvK36Fc1UgVJkDC8kvlFgWcDkBcEy-_jxjQZpEFxJb2W8gLRkOavA
	|> content-range: bytes 0-50129/50130
	|> x-goog-gcs-idempotency-token: 5e36e53c-5dce-4690-844b-2cfd6f553861
	|  
	|< HTTP/1.1 503 Service Unavailable
	|< content-length: 0
	|< content-type: text/plain; charset=utf-8
	|< x-guploader-uploadid: ABPtcPrw7n_weAuHvL4cEyCdL-JKVVX-HaG7fnwAjTgRn4Uxm0JdIcWYasCHyuvK36Fc1UgVJkDC8kvlFgWcDkBcEy-_jxjQZpEFxJb2W8gLRkOavA
	|  
	at is.hail.relocated.com.google.cloud.storage.JsonResumableSessionFailureScenario.toStorageException(JsonResumableSessionFailureScenario.java:185)
	at is.hail.relocated.com.google.cloud.storage.JsonResumableSessionFailureScenario.toStorageException(JsonResumableSessionFailureScenario.java:117)
	at is.hail.relocated.com.google.cloud.storage.JsonResumableSessionFailureScenario.toStorageException(JsonResumableSessionFailureScenario.java:106)
	at is.hail.relocated.com.google.cloud.storage.JsonResumableSessionPutTask.call(JsonResumableSessionPutTask.java:224)
	at is.hail.relocated.com.google.cloud.storage.JsonResumableSession.lambda$put$0(JsonResumableSession.java:81)
	at is.hail.relocated.com.google.cloud.storage.Retrying.lambda$run$0(Retrying.java:102)
	at com.google.api.gax.retrying.DirectRetryingExecutor.submit(DirectRetryingExecutor.java:103)
	at is.hail.relocated.com.google.cloud.RetryHelper.run(RetryHelper.java:76)
	at is.hail.relocated.com.google.cloud.RetryHelper.runWithRetries(RetryHelper.java:50)
	at is.hail.relocated.com.google.cloud.storage.Retrying.run(Retrying.java:99)
	at is.hail.relocated.com.google.cloud.storage.JsonResumableSession.put(JsonResumableSession.java:68)
	at is.hail.relocated.com.google.cloud.storage.ApiaryUnbufferedWritableByteChannel.internalWrite(ApiaryUnbufferedWritableByteChannel.java:114)
	at is.hail.relocated.com.google.cloud.storage.ApiaryUnbufferedWritableByteChannel.writeAndClose(ApiaryUnbufferedWritableByteChannel.java:65)
	at is.hail.relocated.com.google.cloud.storage.UnbufferedWritableByteChannelSession$UnbufferedWritableByteChannel.writeAndClose(UnbufferedWritableByteChannelSession.java:40)
	at is.hail.relocated.com.google.cloud.storage.DefaultBufferedWritableByteChannel.close(DefaultBufferedWritableByteChannel.java:167)
	at is.hail.relocated.com.google.cloud.storage.StorageByteChannels$SynchronizedBufferedWritableByteChannel.close(StorageByteChannels.java:119)
	at is.hail.relocated.com.google.cloud.storage.StorageException.wrapIOException(StorageException.java:179)
	at is.hail.relocated.com.google.cloud.storage.BaseStorageWriteChannel.close(BaseStorageWriteChannel.java:84)
	at is.hail.io.fs.GoogleStorageFS$$anon$2.$anonfun$close$2(GoogleStorageFS.scala:312)
	at is.hail.io.fs.GoogleStorageFS$$anon$2.doHandlingRequesterPays(GoogleStorageFS.scala:282)
	at is.hail.io.fs.GoogleStorageFS$$anon$2.$anonfun$close$1(GoogleStorageFS.scala:312)
	at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
	at is.hail.services.package$.retryTransientErrors(package.scala:182)
	at is.hail.io.fs.GoogleStorageFS$$anon$2.close(GoogleStorageFS.scala:310)
	at java.io.FilterOutputStream.close(FilterOutputStream.java:159)
	at is.hail.utils.richUtils.ByteTrackingOutputStream.close(ByteTrackingOutputStream.scala:23)
	at is.hail.io.index.IndexWriterUtils.close(IndexWriter.scala:225)
	at __C1756collect_distributed_array_table_native_writer.apply_region99_120(Unknown Source)
	at __C1756collect_distributed_array_table_native_writer.apply_region5_223(Unknown Source)
	at __C1756collect_distributed_array_table_native_writer.apply(Unknown Source)
	at __C1756collect_distributed_array_table_native_writer.apply(Unknown Source)
	at is.hail.backend.BackendUtils.$anonfun$collectDArray$16(BackendUtils.scala:91)
	at is.hail.utils.package$.using(package.scala:657)
	at is.hail.annotations.RegionPool.scopedRegion(RegionPool.scala:162)
	at is.hail.backend.BackendUtils.$anonfun$collectDArray$15(BackendUtils.scala:90)
	at is.hail.backend.service.Worker$.$anonfun$main$9(Worker.scala:172)
	at is.hail.services.package$.retryTransientErrors(package.scala:182)
	at is.hail.backend.service.Worker$.$anonfun$main$8(Worker.scala:171)
	at is.hail.utils.package$.using(package.scala:657)
	at is.hail.backend.service.Worker$.main(Worker.scala:169)
	at is.hail.backend.service.Main$.main(Main.scala:14)
	at is.hail.backend.service.Main.main(Main.scala)
	at sun.reflect.GeneratedMethodAccessor63.invoke(Unknown Source)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at is.hail.JVMEntryway$1.run(JVMEntryway.java:119)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:750)

java.lang.NullPointerException: null
	at is.hail.relocated.com.google.cloud.storage.JsonResumableSessionPutTask.call(JsonResumableSessionPutTask.java:201)
	at is.hail.relocated.com.google.cloud.storage.JsonResumableSession.lambda$put$0(JsonResumableSession.java:81)
	at is.hail.relocated.com.google.cloud.storage.Retrying.lambda$run$0(Retrying.java:102)
	at com.google.api.gax.retrying.DirectRetryingExecutor.submit(DirectRetryingExecutor.java:103)
	at is.hail.relocated.com.google.cloud.RetryHelper.run(RetryHelper.java:76)
	at is.hail.relocated.com.google.cloud.RetryHelper.runWithRetries(RetryHelper.java:50)
	at is.hail.relocated.com.google.cloud.storage.Retrying.run(Retrying.java:99)
	at is.hail.relocated.com.google.cloud.storage.JsonResumableSession.put(JsonResumableSession.java:68)
	at is.hail.relocated.com.google.cloud.storage.ApiaryUnbufferedWritableByteChannel.internalWrite(ApiaryUnbufferedWritableByteChannel.java:114)
	at is.hail.relocated.com.google.cloud.storage.ApiaryUnbufferedWritableByteChannel.writeAndClose(ApiaryUnbufferedWritableByteChannel.java:65)
	at is.hail.relocated.com.google.cloud.storage.UnbufferedWritableByteChannelSession$UnbufferedWritableByteChannel.writeAndClose(UnbufferedWritableByteChannelSession.java:40)
	at is.hail.relocated.com.google.cloud.storage.DefaultBufferedWritableByteChannel.close(DefaultBufferedWritableByteChannel.java:167)
	at is.hail.relocated.com.google.cloud.storage.StorageByteChannels$SynchronizedBufferedWritableByteChannel.close(StorageByteChannels.java:119)
	at is.hail.relocated.com.google.cloud.storage.StorageException.wrapIOException(StorageException.java:179)
	at is.hail.relocated.com.google.cloud.storage.BaseStorageWriteChannel.close(BaseStorageWriteChannel.java:84)
	at is.hail.io.fs.GoogleStorageFS$$anon$2.$anonfun$close$2(GoogleStorageFS.scala:312)
	at is.hail.io.fs.GoogleStorageFS$$anon$2.doHandlingRequesterPays(GoogleStorageFS.scala:282)
	at is.hail.io.fs.GoogleStorageFS$$anon$2.$anonfun$close$1(GoogleStorageFS.scala:312)
	at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
	at is.hail.services.package$.retryTransientErrors(package.scala:182)
	at is.hail.io.fs.GoogleStorageFS$$anon$2.close(GoogleStorageFS.scala:310)
	at java.io.FilterOutputStream.close(FilterOutputStream.java:159)
	at is.hail.utils.richUtils.ByteTrackingOutputStream.close(ByteTrackingOutputStream.scala:23)
	at is.hail.io.index.IndexWriterUtils.close(IndexWriter.scala:225)
	at __C1756collect_distributed_array_table_native_writer.apply_region99_120(Unknown Source)
	at __C1756collect_distributed_array_table_native_writer.apply_region5_223(Unknown Source)
	at __C1756collect_distributed_array_table_native_writer.apply(Unknown Source)
	at __C1756collect_distributed_array_table_native_writer.apply(Unknown Source)
	at is.hail.backend.BackendUtils.$anonfun$collectDArray$16(BackendUtils.scala:91)
	at is.hail.utils.package$.using(package.scala:657)
	at is.hail.annotations.RegionPool.scopedRegion(RegionPool.scala:162)
	at is.hail.backend.BackendUtils.$anonfun$collectDArray$15(BackendUtils.scala:90)
	at is.hail.backend.service.Worker$.$anonfun$main$9(Worker.scala:172)
	at is.hail.services.package$.retryTransientErrors(package.scala:182)
	at is.hail.backend.service.Worker$.$anonfun$main$8(Worker.scala:171)
	at is.hail.utils.package$.using(package.scala:657)
	at is.hail.backend.service.Worker$.main(Worker.scala:169)
	at is.hail.backend.service.Main$.main(Main.scala:14)
	at is.hail.backend.service.Main.main(Main.scala)
	at sun.reflect.GeneratedMethodAccessor63.invoke(Unknown Source)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at is.hail.JVMEntryway$1.run(JVMEntryway.java:119)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:750)




Hail version: 0.2.125-6e6f46797aed
Error summary: NullPointerException: null
@danking
Copy link
Contributor

danking commented Nov 21, 2023

This is a bug in the Google storage Java API client library. It was introduced in 2.25.0 by googleapis/java-storage@4c2f44e and fixed in 2.29.1 by googleapis/java-storage@9b4bb82

@danking
Copy link
Contributor

danking commented Nov 21, 2023

Fix is to update to 2.29.1.

danking pushed a commit to danking/hail that referenced this issue Nov 21, 2023
CHANGELOG: Fix hail-is#13937 caused by faulty library code in the Google Cloud Storage API Java client library.
danking added a commit that referenced this issue Nov 21, 2023
CHANGELOG: Fix #13937 caused by faulty library code in the Google Cloud
Storage API Java client library.
danking pushed a commit to danking/hail that referenced this issue Dec 7, 2023
CHANGELOG: Fix hail-is#13979, affecting Query-on-Batch and manifesting most frequently as "com.github.luben.zstd.ZstdException: Corrupted block detected".

This PR upgrades google-cloud-storage from 2.29.1 to 2.30.1. The google-cloud-storage java library
has a bug present at least since 2.29.0 in which simply incorrect data was
returned. googleapis/java-storage#2301 . The issue seems related to their
use of multiple intremediate ByteBuffers. As far as I can tell, this is what could happen:

1. If there's no channel, open a new channel with the current position.
2. Read *some* data from the input ByteChannel into an intermediate ByteBuffer.
3. While attempting to read more data into a subsequent intermediate ByteBuffer, an retryable exception occurs.
4. The exception bubbles to google-cloud-storage's error handling, which frees the channel and loops back to (1)

The key bug is that the intermediate buffers have data but the `position` hasn't been updated. When
we recreate the channel we will jump to the wrong position and re-read some data. Lucky for us,
between Zstd and our assertions, this usually crashes the program instead of silently returning bad
data.

This is the third bug we have found in Google's cloud storage java library. The previous two:

1. hail-is#13721
2. hail-is#13937

Be forewarned: the next time we see bizarre networking or data corruption issues, check if updating
google-cloud-storage fixes the problem.
danking added a commit that referenced this issue Dec 7, 2023
CHANGELOG: Fix #13979, affecting Query-on-Batch and manifesting most
frequently as "com.github.luben.zstd.ZstdException: Corrupted block
detected".

This PR upgrades google-cloud-storage from 2.29.1 to 2.30.1. The
google-cloud-storage java library has a bug present at least since
2.29.0 in which simply incorrect data was returned.
googleapis/java-storage#2301 . The issue seems
related to their use of multiple intremediate ByteBuffers. As far as I
can tell, this is what could happen:

1. If there's no channel, open a new channel with the current position.
2. Read *some* data from the input ByteChannel into an intermediate
ByteBuffer.
3. While attempting to read more data into a subsequent intermediate
ByteBuffer, an retryable exception occurs.
4. The exception bubbles to google-cloud-storage's error handling, which
frees the channel and loops back to (1)

The key bug is that the intermediate buffers have data but the
`position` hasn't been updated. When we recreate the channel we will
jump to the wrong position and re-read some data. Lucky for us, between
Zstd and our assertions, this usually crashes the program instead of
silently returning bad data.

This is the third bug we have found in Google's cloud storage java
library. The previous two:

1. #13721
2. #13937

Be forewarned: the next time we see bizarre networking or data
corruption issues, check if updating google-cloud-storage fixes the
problem.
danking pushed a commit to danking/hail that referenced this issue Dec 16, 2023
CHANGELOG: Fix hail-is#13979, affecting Query-on-Batch and manifesting most frequently as "com.github.luben.zstd.ZstdException: Corrupted block detected".

This PR upgrades google-cloud-storage from 2.29.1 to 2.30.1. The google-cloud-storage java library
has a bug present at least since 2.29.0 in which simply incorrect data was
returned. googleapis/java-storage#2301 . The issue seems related to their
use of multiple intremediate ByteBuffers. As far as I can tell, this is what could happen:

1. If there's no channel, open a new channel with the current position.
2. Read *some* data from the input ByteChannel into an intermediate ByteBuffer.
3. While attempting to read more data into a subsequent intermediate ByteBuffer, an retryable exception occurs.
4. The exception bubbles to google-cloud-storage's error handling, which frees the channel and loops back to (1)

The key bug is that the intermediate buffers have data but the `position` hasn't been updated. When
we recreate the channel we will jump to the wrong position and re-read some data. Lucky for us,
between Zstd and our assertions, this usually crashes the program instead of silently returning bad
data.

This is the third bug we have found in Google's cloud storage java library. The previous two:

1. hail-is#13721
2. hail-is#13937

Be forewarned: the next time we see bizarre networking or data corruption issues, check if updating
google-cloud-storage fixes the problem.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants