-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add example to create a virtual dataset using lithops #203
Add example to create a virtual dataset using lithops #203
Conversation
This is awesome @thodson-usgs ! Thanks for trying this out!
I'll use #201 to track whatever regression has occurred there, so we can talk about the cool serverless My conception of what |
Hmm. I pulled the lastest version of VirtualiZarr in my testing environment, but I neglected to rebuild the runtime image, so I'll double check that.
Right, I set up s3 storage for cubed, but I think this workflow is entirely in memory. So, we'll invariably hit scaling issues unless avoid the reduce and by writing to disk during the map operation. Nevertheless, I'm really excited by how easy this was to set up, and I hope others will help improve upon it. |
No, I rebuilt the runtime image using the latest VirtualiZarr commit, and the error persisted, so I believe it's real. |
You're not using cubed at all here, that's for the actual rechunking.
Lithops does have the ability to persist things - did you set up the storage layer for that?
My plan for scaling this to arbitrary size is actually to use cubed for the virtualizarr array reduction too - see #123 (comment). I expect this to be pretty complicated to achieve though - I'm not even sure if it's possible yet. |
Presumably this error can be reproduced without lithops involved at all? |
Good point. I'll try that next, though something is fishy or else how did this work with previous versions?
Ah, okay. So maybe my hope of simply creating a skeleton zarr, then writing the meta-chunks during the map with to_zarr(region=) is half baked. (The current workflow writes to json, but I'll test cloud optimized formats once this is running.) |
My current guess is that we simply introduced some accidental regression in virtualizarr recently. The way to find it is to (1) reproduce the error without all the lithops stuff (opening and concatenating 2 files should be enough), then (2) use
If you write a manifest to zarr right now (i.e. a "virtual zarr store") you have no way of opening or loading the data via xarray/zarr, because no zarr reader understands what a My model of what we're trying to do is:
|
064380f
to
8e301ef
Compare
Updated with the changes made in #206. Also fine to close this example, if |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @thodson-usgs !
Also fine to close this example, if cubed can do this better.
I think doing this will Cubed will be quite involved. If this works effectively for you then that's already awesome!
Co-authored-by: Tom Nicholas <tom@cworthy.org>
Co-authored-by: Tom Nicholas <tom@cworthy.org>
Co-authored-by: Tom Nicholas <tom@cworthy.org>
Thank you @thodson-usgs !! |
print(f"{len(file_pattern)} file paths were retrieved.") | ||
|
||
|
||
def map_references(fil): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there a reason not to use file
instead of fil
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I must've copied that directly from an example. I'll need to check whether this follows some convention or just a typo.
Just a guess... |
Ah, @douglatornell! I thought that was the case but then I didn't see |
Yeah @thodson-usgs, I did a bit of a double take when I looked at the reserved words list and realized that |
@thodson-usgs excited to see the lithops integration! |
At the suggestion of @TomNicholas, I created a simple example using lithops (and serverless compute) to create a virtual dataset from a list of netcdf files hosted on s3.
This PR depends on the fix provided in #206 (now merged).
Resolved with #206
The workflow was broken in the latest version of VirtualiZarr.
The example runs fine on 5d08519.
However, using 179bb2a the workflow will run, but complains about
ValueError: Could not convert object to NumPy datetime
when I open the dataset usingxarray
: