Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[fs] support ls on a bucket #14167

Closed
wants to merge 1 commit into from
Closed

Conversation

danking
Copy link
Contributor

@danking danking commented Jan 17, 2024

I just ran into this trying to probe the bge-neale bucket.

Teaches `hfs.ls('gs://bucket/')` to list the files and directories at the top-level of the bucket.

In `main` that command raises because this line of `_ls_no_glob` raises:

```python3
maybe_sb_and_t, maybe_contents = await asyncio.gather(
    self._size_bytes_and_time_modified_or_none(path), ls_as_dir()
)
```

In particular, `statfile` raises a cloud-specific, esoteric error about a malformed URL or empty
object names:

```python3
async def _size_bytes_and_time_modified_or_none(self, path: str) -> Optional[Tuple[int, float]]:
    try:
        # Hadoop semantics: creation time is used if the object has no notion of last modification time.
        file_status = await self.afs.statfile(path)
        return (await file_status.size(), file_status.time_modified().timestamp())
    except FileNotFoundError:
        return None
```

I decided to add a sub-class of `FileNotFoundError` which is self-describing: `IsABucketError`.

I changed most methods to raise that error when given a bucket URL. The two interesting cases:

1. `isdir`. This raises an error but I could also see this returning `True`. A bucket is like a
   directory whose path/name is empty.

2. `isfile`. This returns False but I could also see this raising an error. This just seems
   convenient, we know the bucket is not a file so we should say so.

---

Apparently `hfs.ls` had no current tests because the globbing system doesn't work with Azure
https:// URLs. I fixed it to use `AsyncFSURL.with_new_path_component` which is resilient to Azure
https weirdness. However, I had to change `with_new_path_component` to treat an empty path in a
special way. I wanted this to hold:

```
actual = str(afs.parse_url('gs://bucket').with_new_path_component('bar'))
expected = 'gs://bucket/bar'
assert actual == expected
```

But `with_new_path_component` interacts badly with `GoogleAsyncFSURL.__str__` to return this:

```
'gs://bucket//bar'
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants