Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[query] fix #13998 #14057

Merged
merged 1 commit into from
Nov 30, 2023
Merged

[query] fix #13998 #14057

merged 1 commit into from
Nov 30, 2023

Conversation

danking
Copy link
Contributor

@danking danking commented Nov 30, 2023

CHANGELOG: Fix #13998 which appeared in 0.2.58 and prevented reading from a networked filesystem mounted within the filesystem of the worker node for certain pipelines (those that did not trigger "lowering").

We use the IndexReader in PartitionNativeIntervalReader, PartitionNativeReaderIndexed, and PartitionZippedIndexedNativeReader.

  1. PartitionNativeIntervalReader is only used by query_table.

  2. PartitionNativeReaderIndexed is only used by IndexedRVDSpec2.readTableStage which is used by TableNativeReader when there is a new partitioner.

  3. PartitionZippedIndexedNativeReader is only sued by AbstractRVDSpec.readZippedLowered when there is a new partitioner.

Two is for tables, three is for matrix tables. In readZippedLowered we explicitly drop the file protocol:

        val absPathLeft = removeFileProtocol(pathLeft)
        val absPathRight = removeFileProtocol(pathRight)

We have done this, by various names, since this lowered code path was added. I added removeFileProtocol because stripping the protocol in Query-on-Batch prevented the reading and writing of gs:// URIs, the only URIs I could read in QoB. uriPath (the function whose use I replaced with removeFileProtocol) was added by Cotton a very long time ago. It seems he added it so that he could use HDFS to generate a temporary file path on the local filesystem but pass the file path to binary tools that know nothing of HDFS and file:// URIs.

#9522 added the lowered code path and thus introduced this bug. It attempted to mirror the extant code in readIndexedPartitions which does not strip any protocols from the path.

This has gone undetected because we never try to read data through the OS's filesystem. We always use gs://, Azure, or s3:// because we do not test in environments that have a networked file system mounted in the OS's filesystem. To replicate this bug (and add a test for it), we would need a cluster with a lustre file system (or another networked filesystem). This would be a fairly large lift. The fix is trivial: just never intentionally strip the protocol!

CHANGELOG: Fix hail-is#13998 which appeared in 0.2.58 and prevented reading from a networked filesystem mounted within the filesystem of the worker node for certain pipelines (those that did not trigger "lowering").

We use the IndexReader in `PartitionNativeIntervalReader`, `PartitionNativeReaderIndexed`, and
`PartitionZippedIndexedNativeReader`.

1. `PartitionNativeIntervalReader` is only used by `query_table`.

2. `PartitionNativeReaderIndexed` is only used by `IndexedRVDSpec2.readTableStage` which is used by
   `TableNativeReader` when there is a new partitioner.

3. `PartitionZippedIndexedNativeReader` is only sued by `AbstractRVDSpec.readZippedLowered` when
   there is a new partitioner.

Two is for tables, three is for matrix tables. In `readZippedLowered` we explicitly [drop the file
protocol](https://github.com/hail-is/hail/blob/1dedf3c63f9aabf1b6ce538165360056f82f76e4/hail/src/main/scala/is/hail/rvd/AbstractRVDSpec.scala#L154-L155):

```
        val absPathLeft = removeFileProtocol(pathLeft)
        val absPathRight = removeFileProtocol(pathRight)
```

We have done this, by various names, since this lowered code path was added. I added
`removeFileProtocol` because stripping the protocol in Query-on-Batch prevented the reading and
writing of gs:// URIs, the only URIs I could read in QoB. `uriPath` (the function whose use I
replaced with `removeFileProtocol`) was added by Cotton [a very long time
ago](hail-is@92a9936). It seems he
added it so that he could use HDFS to generate a temporary file path on the local filesystem but
pass the file path to binary tools that know nothing of HDFS and file:// URIs.

[`readIndexedPartitions`](https://github.com/hail-is/hail/blob/2b0aded9206849252b453dd80710cea8d2156793/hail/src/main/scala/is/hail/HailContext.scala#L421-L440)
which *does not* strip any protocols from the path.

This has gone undetected because we never try to read data through the OS's filesystem. We always
use gs://, Azure, or s3:// because we do not test in environments that have a networked file system
mounted in the OS's filesystem. To replicate this bug (and add a test for it), we would need a
cluster with a lustre file system (or another networked filesystem). This would be a fairly large
lift. The fix is trivial: just never intentionally strip the protocol!
@danking danking merged commit 1f11d2d into hail-is:main Nov 30, 2023
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[query] hail does not preserve schemes on URLs when using the HadoopFS in the SparkBackend
2 participants