Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace to_arrayset/from_arrayset with to_buffers/from_buffers and deprecate the original. #592

Merged
merged 8 commits into from
Dec 11, 2020
4 changes: 2 additions & 2 deletions docs-src/_toc.yml
Original file line number Diff line number Diff line change
Expand Up @@ -20,8 +20,8 @@
title: "Arrow and Parquet"
- file: how-to-convert-pandas
title: "Pandas"
- file: how-to-convert-arrayset
title: "Generic array-sets"
- file: how-to-convert-buffers
title: "Generic buffers"

- file: how-to-create
title: "Creating new arrays"
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -11,10 +11,10 @@ kernelspec:
name: python3
---

Generic array-sets
==================
Generic buffers
===============

Most of the conversion functions target a particular library: NumPy, Arrow, Pandas, or Python itself. As a catch-all for other storage formats, Awkward Arrays can be converted to and from "array-sets," sets of named arrays with a schema that can be used to reconstruct the original array. This section will demonstrate how an array-set can be used to store an Awkward Array in an HDF5 file, which ordinarily wouldn't be able to represent nested, irregular data structures.
Most of the conversion functions target a particular library: NumPy, Arrow, Pandas, or Python itself. As a catch-all for other storage formats, Awkward Arrays can be converted to and from sets of named buffers. The buffers are not (usually) intelligible on their own; the length of the array and a JSON document are needed to reconstitute the original structure. This section will demonstrate how an array-set can be used to store an Awkward Array in an HDF5 file, which ordinarily wouldn't be able to represent nested, irregular data structures.

```{code-cell} ipython3
import awkward as ak
Expand All @@ -23,8 +23,8 @@ import h5py
import json
```

From Awkward to an array-set
----------------------------
From Awkward to buffers
-----------------------

Consider the following complex array:

Expand All @@ -37,18 +37,17 @@ ak_array = ak.Array([
ak_array
```

The [ak.to_arrayset](https://awkward-array.readthedocs.io/en/latest/_auto/ak.to_arrayset.html) function decomposes it into a set of one-dimensional arrays (a zero-copy operation).
The [ak.to_buffers](https://awkward-array.readthedocs.io/en/latest/_auto/ak.to_buffers.html) function decomposes it into a set of one-dimensional arrays (a zero-copy operation).

```{code-cell} ipython3
form, container, num_partitions = ak.to_arrayset(ak_array)
form, length, container = ak.to_buffers(ak_array)
```

The pieces needed to reconstitute this array are:

* the [Form](https://awkward-array.readthedocs.io/en/latest/ak.forms.Form.html), which defines how structure is built from one-dimensional arrays,
* the one-dimensional arrays in the `container` (a [MutableMapping](https://docs.python.org/3/library/collections.abc.html#collections-abstract-base-classes)),
* the number of partitions, if any,
* the length of the original array or lengths of all partitions ([ak.partitions](https://awkward-array.readthedocs.io/en/latest/_auto/ak.partitions.html)) are needed if we wish to read it back _lazily_ (more on that below).
* the length of the original array or lengths of all of its partitions ([ak.partitions](https://awkward-array.readthedocs.io/en/latest/_auto/ak.partitions.html)),
* the one-dimensional arrays in the `container` (a [MutableMapping](https://docs.python.org/3/library/collections.abc.html#collections-abstract-base-classes)).

The [Form](https://awkward-array.readthedocs.io/en/latest/ak.forms.Form.html) is like an Awkward [Type](https://awkward-array.readthedocs.io/en/latest/ak.types.Type.html) in that it describes how the data are structured, but with more detail: it includes distinctions such as the difference between [ListArray](https://awkward-array.readthedocs.io/en/latest/ak.layout.ListArray.html) and [ListOffsetArray](https://awkward-array.readthedocs.io/en/latest/ak.layout.ListOffsetArray.html), as well as the integer types of structural [Indexes](https://awkward-array.readthedocs.io/en/latest/ak.layout.Index.html).

Expand All @@ -58,48 +57,42 @@ It is usually presented as JSON, and has a compact JSON format (when [Form.tojso
form
```

This `container` is a new dict, but it could have been a user-specified [MutableMapping](https://docs.python.org/3/library/collections.abc.html#collections-abstract-base-classes).
In this case, the `length` is just an integer. It would be a list of integers if `ak_array` was partitioned.

```{code-cell} ipython3
container
length
```

This array has no partitions.
This `container` is a new dict, but it could have been a user-specified [MutableMapping](https://docs.python.org/3/library/collections.abc.html#collections-abstract-base-classes) if passed into [ak.to_buffers](https://awkward-array.readthedocs.io/en/latest/_auto/ak.to_buffers.html) as an argument.

```{code-cell} ipython3
num_partitions is None
```

This is also what we find from [ak.partitions](https://awkward-array.readthedocs.io/en/latest/_auto/ak.partitions.html).

```{code-cell} ipython3
ak.partitions(ak_array) is None
container
```

From array-set to Awkward
-------------------------
From buffers to Awkward
-----------------------

The function that reverses [ak.to_arrayset](https://awkward-array.readthedocs.io/en/latest/_auto/ak.to_arrayset.html) is [ak.from_arrayset](https://awkward-array.readthedocs.io/en/latest/_auto/ak.from_arrayset.html). Its first three arguments are `form`, `container`, and `num_partitions`.
The function that reverses [ak.to_buffers](https://awkward-array.readthedocs.io/en/latest/_auto/ak.to_buffers.html) is [ak.from_buffers](https://awkward-array.readthedocs.io/en/latest/_auto/ak.from_buffers.html). Its first three arguments are `form`, `length`, and `container`.

```{code-cell} ipython3
ak.from_arrayset(form, container, num_partitions)
ak.from_buffers(form, length, container)
```

Saving Awkward Arrays to HDF5
-----------------------------

The [h5py](https://www.h5py.org/) library presents each group in an HDF5 file as a [MutableMapping](https://docs.python.org/3/library/collections.abc.html#collections-abstract-base-classes), which we can use as a container for an array-set. We must also save the `form`, `num_partitions`, and `length` as metadata for the array to be retrievable.
The [h5py](https://www.h5py.org/) library presents each group in an HDF5 file as a [MutableMapping](https://docs.python.org/3/library/collections.abc.html#collections-abstract-base-classes), which we can use as a container for an array-set. We must also save the `form` and `length` as metadata for the array to be retrievable.

```{code-cell} ipython3
file = h5py.File("/tmp/example.hdf5", "w")
group = file.create_group("awkward")
group
```

We can fill this `group` as a `container` by passing it in to [ak.to_arrayset](https://awkward-array.readthedocs.io/en/latest/_auto/ak.to_arrayset.html).
We can fill this `group` as a `container` by passing it in to [ak.to_buffers](https://awkward-array.readthedocs.io/en/latest/_auto/ak.to_buffers.html).

```{code-cell} ipython3
form, container, num_partitions = ak.to_arrayset(ak_array, container=group)
form, length, container = ak.to_buffers(ak_array, container=group)
```

```{code-cell} ipython3
Expand All @@ -115,7 +108,7 @@ container.keys()
Here's one.

```{code-cell} ipython3
np.asarray(container["node0-offsets"])
np.asarray(container["part0-node0-offsets"])
```

Now we need to add the other information to the group as metadata. Since HDF5 accepts string-valued metadata, we can put it all in as JSON or numbers.
Expand All @@ -126,38 +119,27 @@ group.attrs["form"]
```

```{code-cell} ipython3
group.attrs["num_partitions"] = json.dumps(num_partitions)
group.attrs["num_partitions"]
```

```{code-cell} ipython3
group.attrs["partition_lengths"] = json.dumps(ak.partitions(ak_array))
group.attrs["partition_lengths"]
```

```{code-cell} ipython3
group.attrs["length"] = len(ak_array)
group.attrs["length"] = json.dumps(length) # JSON-encode it because it might be a list
group.attrs["length"]
```

Reading Awkward Arrays from HDF5
--------------------------------

With that, we can reconstitute the array by supplying [ak.from_arrayset](https://awkward-array.readthedocs.io/en/latest/_auto/ak.from_arrayset.html) the right arguments from the group and metadata.
With that, we can reconstitute the array by supplying [ak.from_buffers](https://awkward-array.readthedocs.io/en/latest/_auto/ak.from_buffers.html) the right arguments from the group and metadata.

The group can't be used as a `container` as-is, since subscripting it returns `h5py.Dataset` objects, rather than arrays.

```{code-cell} ipython3
reconstituted = ak.from_arrayset(
reconstituted = ak.from_buffers(
ak.forms.Form.fromjson(group.attrs["form"]),
json.loads(group.attrs["length"]),
{k: np.asarray(v) for k, v in group.items()},
)
reconstituted
```

Like [ak.from_parquet](https://awkward-array.readthedocs.io/en/latest/_auto/ak.from_parquet.html), [ak.from_arrayset](https://awkward-array.readthedocs.io/en/latest/_auto/ak.from_arrayset.html) has the option to read lazily, only accessing record fields and partitions that are accessed.

To do so, we need to pass `lazy=True`, but also the total length of the array (if not partitioned) or the lengths of all the partitions (if partitioned).
Like [ak.from_parquet](https://awkward-array.readthedocs.io/en/latest/_auto/ak.from_parquet.html), [ak.from_buffers](https://awkward-array.readthedocs.io/en/latest/_auto/ak.from_buffers.html) has the option to read lazily, only accessing record fields and partitions that are accessed.

```{code-cell} ipython3
class LazyGet:
Expand All @@ -168,11 +150,11 @@ class LazyGet:
print(key)
return np.asarray(self.group[key])

lazy = ak.from_arrayset(
lazy = ak.from_buffers(
ak.forms.Form.fromjson(group.attrs["form"]),
json.loads(group.attrs["length"]),
LazyGet(group),
lazy=True,
lazy_lengths = group.attrs["length"],
)
```

Expand Down
2 changes: 1 addition & 1 deletion docs-src/how-to-convert.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,4 +20,4 @@ Converting arrays
* **[ROOT via Uproot](how-to-convert-uproot)**
* **[Arrow and Parquet](how-to-convert-arrow)**
* **[Pandas](how-to-convert-pandas)**
* **[Generic array-sets](how-to-convert-arrayset)**
* **[Generic array-sets](how-to-convert-buffers)**
32 changes: 31 additions & 1 deletion src/awkward/_util.py
Original file line number Diff line number Diff line change
Expand Up @@ -761,7 +761,9 @@ def apply(inputs, depth, user):
outcontent = apply(nextinputs, depth + 1, user)
assert isinstance(outcontent, tuple)

return tuple(ak.layout.RegularArray(x, maxsize, maxlen) for x in outcontent)
return tuple(
ak.layout.RegularArray(x, maxsize, maxlen) for x in outcontent
)

elif not all_same_offsets(nplike, inputs):
fcns = [
Expand Down Expand Up @@ -1695,3 +1697,31 @@ def union_to_record(unionarray, anonymous):
)

return ak.layout.RecordArray(all_fields, all_names, len(unionarray))


def adjust_old_pickle(form, container, num_partitions, behavior):
def key_format(**v):
if num_partitions is None:
if v["attribute"] == "data":
return "{form_key}".format(**v)
else:
return "{form_key}-{attribute}".format(**v)

else:
if v["attribute"] == "data":
return "{form_key}-part{partition}".format(**v)
else:
return "{form_key}-{attribute}-part{partition}".format(**v)

return ak.operations.convert.from_buffers(
form,
None,
container,
partition_start=0,
key_format=key_format,
lazy=False,
lazy_cache="new",
lazy_cache_key=None,
highlevel=False,
behavior=behavior,
)
34 changes: 25 additions & 9 deletions src/awkward/highlevel.py
Original file line number Diff line number Diff line change
Expand Up @@ -1386,16 +1386,24 @@ def numba_type(self):
return numba.typeof(self._numbaview)

def __getstate__(self):
form, container, num_partitions = ak.to_arrayset(self)
form, length, container = ak.operations.convert.to_buffers(self._layout)
if self._behavior is ak.behavior:
behavior = None
else:
behavior = self._behavior
return form, container, num_partitions, behavior
return form, length, container, behavior

def __setstate__(self, state):
form, container, num_partitions, behavior = state
layout = ak.from_arrayset(form, container, num_partitions, highlevel=False)
if isinstance(state[1], dict):
form, container, num_partitions, behavior = state
layout = ak._util.adjust_old_pickle(
form, container, num_partitions, behavior
)
else:
form, length, container, behavior = state
layout = ak.operations.convert.from_buffers(
form, length, container, highlevel=False, behavior=behavior
)
if self.__class__ is Array:
self.__class__ = ak._util.arrayclass(layout, behavior)
self.layout = layout
Expand Down Expand Up @@ -1975,17 +1983,25 @@ def numba_type(self):
return numba.typeof(self._numbaview)

def __getstate__(self):
form, container, num_partitions = ak.to_arrayset(self._layout.array)
form, length, container = ak.operations.convert.to_buffers(self._layout.array)
if self._behavior is ak.behavior:
behavior = None
else:
behavior = self._behavior
return form, container, num_partitions, behavior, self._layout.at
return form, length, container, behavior, self._layout.at

def __setstate__(self, state):
form, container, num_partitions, behavior, at = state
array = ak.from_arrayset(form, container, num_partitions, highlevel=False)
layout = ak.layout.Record(array, at)
if isinstance(state[1], dict):
form, container, num_partitions, behavior, at = state
layout = ak._util.adjust_old_pickle(
form, container, num_partitions, behavior
)
else:
form, length, container, behavior, at = state
layout = ak.operations.convert.from_buffers(
form, length, container, highlevel=False, behavior=behavior
)
layout = ak.layout.Record(layout, at)
if self.__class__ is Record:
self.__class__ = ak._util.recordclass(layout, behavior)
self.layout = layout
Expand Down
Loading