Skip to content

Future Data Items

Chris Meyer edited this page Apr 4, 2024 · 1 revision

This page describes a feature planned for a future version of Nion Swift.

Data Items Version 2

The second version of data items have expanded capabilities, incorporating ideas from HDF5, Pandas, HyperSpy, xarray, and other libraries.

As before, data items should be fully mappable to both HDF5 and a simpler JSON + Numpy directory structure or zip file.

The new data items can support more complex organization. Key features include:

  • Numeric value structures: scalar, complex, rgb, rgba, vector, tensor.
  • General value structures including strings and timestamps.
  • Data dimensions 0d, 1d, 2d, and 3d.
  • Data organized into dimension sets, 1d, 2d, or 3d each.
  • Data stored contiguously or sparsely.
  • Datum dimension set corresponds to the final enclosed dimension set.
  • Collection/sequence dimension sets correspond to enclosing dimension sets.
  • Hierarchical organization into arrays, lists, structures and dictionaries.
  • Sub-views of data within hierarchy.
  • Intensity scales specified by formula.
  • Dimension scales specified by formula or coordinates. Sharable.
  • Arbitrary number of intensity scales attached to datum dimension set.
  • Arbitrary number of dimensional scales attached to each dimension.
  • Calibrations adhering to a unit standard.
  • Reference frames as list of dimension scales attached to dimension sets. Shareable.
  • Numeric data types, strings, timestamps, and references within data item.
  • Efficient conversions to various Python structures: numpy, pandas, xarray.
  • Optionally include schema at various levels of organization.
    • Schema can separately describe recommended displays and reductions.
  • Attachable storage handler, ndata, hdf5, zarr, etc.
    • Supports partial paging to memory/disk
  • Fully observable (data, properties, insert/remove, etc.)
  • Data item is an interface with storage and memory drivers for implementation
  • Storage and memory drivers should have the ability to:
    • Be local or remote with optimizations
    • Slice and reduce optimizations
    • Asynchronous access that may not occur instantly
    • Pipelined updates (partial data updates)
    • Use the GPU
    • Specify a primary in-memory storage mechanism (numpy array, pandas table, h5py memory mapped, gpu, etc.)
  • Improved definition of formal vs informal attributes.
  • Formal attributes:
    • Units (nm, ms, etc.), dimension scales, quantity type (length, time, etc.), reference frames
    • Domain (time or space vs frequency)
    • Provenance
    • Validity/timestamp of arbitrary data slices
      • Part of data from one scan, part from another.
    • Timestamps and timezone.
  • API
    • Fall back to old API when possible.
    • Improved indexing (xarray data.loc[calibrated]) (TBD).

Proposed Migration

  • Merge data item and data and metadata objects
  • Define better terminology and use it
  • Deprecate old methods and eliminate use within Nion libraries

HyperSpy

  • How is calibration supported?
  • Indexing is interesting
  • Possible data ordering issue
  • Not sure if it supports 5D data

Pandas

  • Only supports 1D Series and 2D DataFrame
  • No support for calibration info.

Tensorflow

  • Ragged lists

xarray

  • DataArray is very close to DataAndMetadata
  • Does not support data loading/unloading
  • Does not support generated coordinates -- coordinates are explicit array
  • Missing intensity calibration?
  • DataSets have common dimensions.
  • Includes dimension names, coordinates

zarr

  • Storage

References