Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CSV to Parquet recipe #94

Open
rabernat opened this issue Apr 6, 2021 · 4 comments
Open

CSV to Parquet recipe #94

rabernat opened this issue Apr 6, 2021 · 4 comments

Comments

@rabernat
Copy link
Contributor

rabernat commented Apr 6, 2021

So far we basically only have NetCDF (or other things that Xarray can read; e.g. Grib) to Zarr recipes.

Some recipes will want to work with tabular data, e.g. transforming a collections of CSVs to Parquet. (Example: pangeo-forge/staged-recipes#3)

This will require an entirely new recipe class. Creating this class will force us to refactor the recipe module significantly. This will be laborious but hopefully relatively straightforward.

@cisaacstern
Copy link
Member

I'm sitting here with @einatlev-ldeo at the EarthCube Annual Meeting in La Jolla. We are discussing if/how we may be able to provide cloud-optimized access to (at least some subset of) the data provided on

via Pangeo Forge.

Based on our discussions, it seems that this may be a great use case for a Parquet recipe. It strikes me that once we complete the work scoped in #376, the possibility of writing a Parquet recipe is perhaps quite approachable (as really just few additional PTransforms).

While we're waiting for the first phase Beam work to complete, perhaps we can start brainstorming what data objects would make sense to assemble from these raw data. For example, are there a set(s) of variables with the same time resolution, which would be able to fit all in a single large table together.? If so, what are those variables and their access paths on the file server? Can we assemble a demonstration CSV from them using a simple standalone Python script? If so, that would be a very useful basis for building a larger table with Pangeo Forge.

Side note: there's some awesome webcam data available through the same project. I wonder what ARCO format might be suitable for webcam time series data?

@TomAugspurger
Copy link
Contributor

Just FYI, I have some notes on how we think about tabular data for the Planetary Computer: https://gist.github.com/TomAugspurger/457a2288f6ef7490ab87546faf665e14

@cisaacstern
Copy link
Member

Thanks Tom this is great

@einatlev-ldeo
Copy link

einatlev-ldeo commented Jun 15, 2022 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants