Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[query] fix #13998 #14057

Merged
merged 1 commit into from
Nov 30, 2023
Merged

[query] fix #13998 #14057

merged 1 commit into from
Nov 30, 2023

Commits on Nov 30, 2023

  1. [query] fix hail-is#13998

    CHANGELOG: Fix hail-is#13998 which appeared in 0.2.58 and prevented reading from a networked filesystem mounted within the filesystem of the worker node for certain pipelines (those that did not trigger "lowering").
    
    We use the IndexReader in `PartitionNativeIntervalReader`, `PartitionNativeReaderIndexed`, and
    `PartitionZippedIndexedNativeReader`.
    
    1. `PartitionNativeIntervalReader` is only used by `query_table`.
    
    2. `PartitionNativeReaderIndexed` is only used by `IndexedRVDSpec2.readTableStage` which is used by
       `TableNativeReader` when there is a new partitioner.
    
    3. `PartitionZippedIndexedNativeReader` is only sued by `AbstractRVDSpec.readZippedLowered` when
       there is a new partitioner.
    
    Two is for tables, three is for matrix tables. In `readZippedLowered` we explicitly [drop the file
    protocol](https://github.com/hail-is/hail/blob/1dedf3c63f9aabf1b6ce538165360056f82f76e4/hail/src/main/scala/is/hail/rvd/AbstractRVDSpec.scala#L154-L155):
    
    ```
            val absPathLeft = removeFileProtocol(pathLeft)
            val absPathRight = removeFileProtocol(pathRight)
    ```
    
    We have done this, by various names, since this lowered code path was added. I added
    `removeFileProtocol` because stripping the protocol in Query-on-Batch prevented the reading and
    writing of gs:// URIs, the only URIs I could read in QoB. `uriPath` (the function whose use I
    replaced with `removeFileProtocol`) was added by Cotton [a very long time
    ago](hail-is@92a9936). It seems he
    added it so that he could use HDFS to generate a temporary file path on the local filesystem but
    pass the file path to binary tools that know nothing of HDFS and file:// URIs.
    
    [`readIndexedPartitions`](https://github.com/hail-is/hail/blob/2b0aded9206849252b453dd80710cea8d2156793/hail/src/main/scala/is/hail/HailContext.scala#L421-L440)
    which *does not* strip any protocols from the path.
    
    This has gone undetected because we never try to read data through the OS's filesystem. We always
    use gs://, Azure, or s3:// because we do not test in environments that have a networked file system
    mounted in the OS's filesystem. To replicate this bug (and add a test for it), we would need a
    cluster with a lustre file system (or another networked filesystem). This would be a fairly large
    lift. The fix is trivial: just never intentionally strip the protocol!
    Dan King committed Nov 30, 2023
    Configuration menu
    Copy the full SHA
    8abc83f View commit details
    Browse the repository at this point in the history