Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Goals and use cases for DASStore #3

Open
John-Ragland opened this issue Jan 31, 2023 · 1 comment
Open

Goals and use cases for DASStore #3

John-Ragland opened this issue Jan 31, 2023 · 1 comment

Comments

@John-Ragland
Copy link
Contributor

I thought it might be a good idea to talk through the goals of DASStore and brainstorm all of the use cases that we might want to cater to. This will help in figuring out how best to organize the package contents.

Convert to cloud friendly formats

Basically, what I was thinking is that DASStore could be a toolbox that would allow someone to convert existing DAS data formats (such as hdf5) to more cloud friendly data structures (such as zarr).

Supported Input formats

for this, it would be ideal to be able to automatically support all data file types and organizational methods that are currently produced from all available interrogators. The only file type that I've worked with so far is hdf5 from an optasense interrogator.

Supported Output Formats

  • zarr
  • parqet (?) I don't have any experience with this, I just know that it's also cloud optimized
  • chunked hdf5 (?) I also don't have any experience with this either

These output formats will probably initially be related to the benchmarking that @niyiyu is doing

Other goals

  • Ideally, it would be nice if everything was also compatible with xarray (although this creates some limitations on data structures. (Xarray Datasets only allow a single depth of variable)

Are there any other uses we should consider?

@niyiyu
Copy link
Owner

niyiyu commented Feb 3, 2023

Thanks for bringing these. I have more ideas about this DASStore package.

Use case

There are basically three use case I need to consider.

  • Research group level: the whole solution should be easy to deploy with local file system. The DAS data is local and coming in real time. The computing is also made local, or at lease within the campus ethernet with 1 GigaByte bandwidth. The object storage server may not be as powerful as S3 and may not handle too much request.
  • S3 compatible: minimal modification required to move codes to the commercial cloud (e.g., AWS). We can move and convert data to the cloud storage. Network has no possibility to saturate, with typical EC2 bandwidth of ~10 Gigabyte.
  • Potentially this data schema would be used by IRIS DMC in the future. Ideally this should not introduce too much burden for the data center.

Input format

  • HDF5/ASDF
  • SEG-Y: a lot of DAS data are also in this old seismic data format. Data are not arranged in a matrix; instead, it's arranged as header1-trace1-header2-trace2-....
  • TDMS: I also don't have much experience with is format....

Could-optimized data format & solution

  • Zarr: I am not too worrying about the data structure as single depth make no difference to the data it self. The only concern I have about Zarr is that the chunk size of Zarr has to be big (~MB), otherwise each chunk would be too small, which slow down conversion and the file system. This potentially increase the network throughput/egress cost, which is not ideal for a big data center.

Just bringing more possibility to this conversation, there are more options.

  • kerchunk. Basically the library stores the byte range of each data set in a HDF5 as index. Seems that this has to work with byte range request. But I think it is not well supported at this moment.
  • h5coro, a cloud-optimized read-only library that allows reading a H5 dataset directly from S3. Doesn't work with local MinIO yet.

More thoughts?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants