Replies: 1 comment
-
As a workaround I tried to coerce the variable to the vlen dtype that Xarray uses internally, but this had no effect: VLEN_DTYPE = xr.coding.strings.create_vlen_dtype(str)
ds["foo"] = ds["foo"].astype(VLEN_DTYPE) I think this is because
|
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I want to concatenate a collection of datasets with a variable that contains variable length strings and save them to Zarr. The strings are mostly very short (1-2 characters), but a few can be much longer (thousands of characters), and can be of arbitrary length, so not suitable for a fixed size representation.
Zarr's VLenUTF8 is ideal for this. It is recommended on http://xarray.pydata.org/en/latest/user-guide/io.html#zarr, where it says "To store variable length strings, convert them to object arrays first with dtype=object".
The datasets to be concatenated are produced in parallel, and the size of each is not known in advance.
Here's the MVCE:
This produces a warning, because it has loaded the entire
ds.foo
variable into memory to determine its type. Obviously for large datasets this is not scalable.So, my question is: how can I concatenate datasets with variable length strings without materalizing them in memory?
Beta Was this translation helpful? Give feedback.
All reactions