Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pangeo use case: Advanced regridding using ESMF/ESMpy/OCGIS/xESMF/Xarray/Dask #197

Closed
jhamman opened this issue Apr 6, 2018 · 30 comments
Closed

Comments

@jhamman
Copy link
Member

jhamman commented Apr 6, 2018

I sat down with @bekozi and @rokuingh from NOAA/ESRL/NESII this afternoon. We discussed putting together a compelling use case that demonstrates the developing facility within the ESMF/ESMpy/ocgis/xESMF/Xarray/Dask libraries to do regridding of large geospatial datasets.

We discussed two basic applications:

  1. Regridding using xESMF, xarray, and dask of traditional gridded datasets
  2. Regridding using ocgis, xarray, and dask (and maybe ESMpy) to facilitate grid-->unstructured mesh (e.g. hydrologic basin polygons)

We generally agreed that most of the pieces are in place to do both of these and the development of a complete use case would provide the opportunity to work out the remaining kinks. This may be particularly useful as an oportunity to entrain the NESII developers in some of the current pangeo efforts.

also cc @JiaweiZhuang

@JiaweiZhuang
Copy link
Member

JiaweiZhuang commented Apr 6, 2018

This sounds great and I indeed want to get involved in.

My schedule has been crazy recently (having my PhD qualifying exam this semester), so I must wait until this summer to continue xESMF development. Planning to add parallel&out-of-core regridding support with dask in v0.2.

Right now I can talk about general ideas. I did some parallel regridding tests in this notebook (ref: JiaweiZhuang/xESMF#3). The entire process (read data from disk, perform regridding, write data to disk) is clearly I/O-bounded. The regridding computation (sparse matrix multiplication) has a throughput of 1~2 GB/s/core. A typical SSD just has ~500 MB/s throughput. The benefit of using >5 cores is negligible with serial I/O. To efficiently process large volumes of data, will need high-performance I/O hardware (often seen on HPC centers) that can deliver a total throughput of >100 GB/s.

Here are some practical considerations:
(1) xarray only does serial I/O right now (doesn't seem to be true, see comments below). The netcdf4-python library seems to support parallel IO with MPI (Unidata/netcdf4-python#717) but I haven't tried. Also, if mpi4py is used for I/O, then what's the role of dask?
(2) Which computational platform should we use? Local supercomputers with Lustre should make parallel I/O easier, but Pangeo seems to target more at commercial clouds (which are also more accessible to the public). I personally much prefer cloud, too.
(3) If the answer to (2) is cloud, then how to deal with the NetCDF+cloud storage issue as discussed in this blog post by @mrocklin ? It is definitely possible to get high throughput from cloud storage. For example, the pywren library achieved 60 GB/s read from S3 using AWS lambda. But it was just reading a txt file; NetCDF will make things much more complicated.

From another perspective, once the I/O performance is properly addressed, it will vastly help any data processing pipeline. There's nothing special about regridding operation (at least for structured grids that are computationally simple).

@niallrobinson
Copy link
Contributor

niallrobinson commented Apr 6, 2018

Hi - I need to talk it over with the team but...

I think we'd be really interested in doing this in the cloud, and we'd probably want to try and use Iris, at least in the first instance. One thing I would say about NetCDF in the cloud is that inefficient solutions aren't such a big deal if they parallelise - I think we'd probably try using FUSE in the first instance for this reason. I think this might be even more the case in a fairly computation heavy operation like regridding.

Also, in terms of MPI vs TCP, I'm not sure I see this use case necessarily needs to transfer lots of data between nodes. If you can chunks via load range requests from s3 onto compute nodes (S3+FUSE+Iris can do this), regrid them in isolation (admittedly lat/lon tiles would need some overlap), and then gather the metadata only onto a head node (again, I think Iris could do this), you can then operate on the head node object...didn't explain that very well, hope it made sense. btw I presume you could substitute the word "Iris" of "Xarray" in any of the above.

The one use case I can think of which might need a big data gather is writing back to S3, where the regridded data would all stream up through a head node and on to S3...unless you use something like Zarr.

As some other githubbers know, I think Zarr looks supercool, but the current tie to Python, and the move away from community standards makes me twitchy (that's my word for willing to be convince but my gut says currently bad idea). What would be really cool is for NetCDF/HDF to be Zarr compatible!

@jacobtomlinson @AlexHilson @dkillick

@rabernat
Copy link
Member

rabernat commented Apr 6, 2018

Since I have been obsessing over xarray / distributed I/O throughput, I will share my $0.02. I/O constraints are universal to most "big data" calculations we want to do, and regridding is no exception.

xarray only does serial I/O right now.

This is not really correct. If you use xarray with dask, the reads will happen in parallel. If you use dask distributed, the reads will come from many nodes of a cluster. What matters is the ability of the underlying storage medium to deliver scalable performance. This can be achieved on HPC and cloud platforms; single servers and workstations are fundamentally constrained by the physical speed of the disk.

In pangeo.pydata.org, we get basically linear scalability of read speed with the size of the cluster on datasets stored in zarr format. So there is absolutely no need to invoke MPI. We get extremely good I/O scaling on cheyenne (when it is working) using standard netCDF files.

The "netCDF in the cloud" issue is a very general one that has to be solved by our community as we transition to cloud. It is being discussed in many different contexts within and outside of Pangeo. There is a whole repo devoted to examining it: https://github.com/pangeo-data/storage-benchmarks. I recommend we not conflate the specific and actionable use cased proposed here with the much more general question of cloud storage formats and I/O bandwidth.

@mrocklin
Copy link
Member

mrocklin commented Apr 6, 2018

It seems like this conversation has shifted towards I/O concerns, which are valid and a major topic of this group in general.

However I think that the original purpose of the issue would be satisfied just by demonstrating the capability of doing scalable regridding on large datasets. I think it's fine if the regridding stage is not the bottleneck as long as we can do it in a scalable manner on large datasets. @JiaweiZhuang 's performance-oriented work has been great, we should connect the dots so that it is easy to apply it to these scenarios.

@niallrobinson
Copy link
Contributor

Maybe this was implicit, but I want us to be mindful to not conflate two things:

  1. I/O speed from S3 into memory on some node
  2. And I/O into a calculation

the implication being that 1 can be slow but, if it parallelises linearly then 2 (which I think is what matters) can still be fine.

@jhamman
Copy link
Member Author

jhamman commented Apr 6, 2018

Glad to see this has generated some interest. Full agreement from me on not conflating the storage of these datasets with the ability to operate on them.

@JiaweiZhuang - good luck on the quals.

If I go back to one of my original points, we have an opportunity to work with the ESMF folks here and streamline the development of any missing pieces.

@bekozi and @rokuingh, how do you see this proceeding? Should we work up some examples, and just rope you all in when things start breaking? I can get my ocgis/dask example up and running on Cheyenne working with NHD+.

@JiaweiZhuang
Copy link
Member

@jhamman Thanks!

If you use xarray with dask, the reads will happen in parallel. If you use dask distributed, the reads will come from many nodes of a cluster. What matters is the ability of the underlying storage medium to deliver scalable performance.

Thanks for pointing this out. A bit more documentations&benchmark on the I/O scaling of xarray would be useful. The current chapter on parallel computation only shows the scaling of in-memory computation. Indeed, scaling-up I/O can't be done on laptops. But examples using some well-known supercomputers (e.g. NASA Pleiades) will be a great reference.

https://github.com/pangeo-data/storage-benchmarks looks great. If anyone one can setup a demo achieving high read&write throughput with S3/Google Cloud Storage, I can simply add the regridding computation step in the middle.

@rabernat
Copy link
Member

rabernat commented Apr 6, 2018

@JiaweiZhuang -- I recommend you start playing around on pangeo.pydata.org. You should be able to get an example working there pretty easily.

edit for clarification: the example notebooks already in pangeo.pydata.org already demonstrate high-throughput reading.

@mrocklin
Copy link
Member

mrocklin commented Apr 6, 2018 via email

@JiaweiZhuang
Copy link
Member

JiaweiZhuang commented Apr 6, 2018

@rokuingh mentioned to me that ESMF regridding scales to 5000 cores (ref: ESMF benchmark). I wonder if that considers I/O performance? Were you trying to improve I/O? ESMF's regridding capability seems to mainly support online ESM simulations, so the data can live in memory for a long time and you can worry less about disk operation. Please tell me if I am wrong.

@JiaweiZhuang
Copy link
Member

JiaweiZhuang commented Apr 6, 2018

@rabernat @mrocklin Yes I've played with pangeo.pydata.org a little and it feels great. If I understand correctly the sample data are mostly in zarr format, which already integrates well with cloud storage. An example using giant NetCDF files will actually solve the core problem of large-scale regridding of existing Earth science data. You don't need to particularly care about regridding during your development. Just read NC files from object storage, do some light weight computation like a*dr + b (Think about unit conversion. This actually resembles regridding which is also a linear transform), and write data back. This will be an excellent framework for regridding and other custom operations.

The output might be in zarr format instead of NetCDF, as long as people don't need to read them by C/Fortran programs to support some model simulations.

@JiaweiZhuang
Copy link
Member

In terms of "giant NetCDF files", the NASA-NEX repository on AWS should be a good example. Many of them are produced by CMIP5, which is a great reference for CMIP6.

For example, the s3://nasanex/NEX-GDDP bucket is 11 TB. Here's a 1.4 TB subset (global 0.25 degree air temperature data):

$ aws s3 ls --summarize --human-readable s3://nasanex/NEX-GDDP/BCSD/rcp85/day/atmos/tasmax/r1i1p1/v1.0/
[a bunch of files, each has a size of 600 MB]
Total Objects: 3986
   Total Size: 1.4 TiB

Notice that each file is quite small, and there are a large number of files (produced by different models, for each year). This should be a common pattern for CMIP-style output. Hopefully the small individual size will make NetCDF read easier.

@mrocklin
Copy link
Member

mrocklin commented Apr 7, 2018

I think that everyone agrees that resolving the HDF/NetCDF on cloud storage problem is important. There are a number of different approaches to this that are being investigated. FWIW I consider the data access problem orthogonal to the regridding problem. I think that we can make progress on them independently without blocking on the other.

@JiaweiZhuang
Copy link
Member

I think that we can make progress on them independently without blocking on the other.

Yes, this is exactly what I mean!

@niallrobinson
Copy link
Contributor

I don't know where to register this thought @rabernat @mrocklin @jhamman @jacobtomlinson - I've seen zarr mentioned a lot on github lately. It looks really cool but I've still got a few concerns, and there are other options that we've seen at EGU and elsewhere. zarr isn't officially part of Pangeo yet right? Before we sort of endorse an approach, we need to (a) benchmark performance (which is happening) but also (b) have a think about what kind of functionality/architecture we value. I think we all agree - just wanted to make sure 😄

@rabernat
Copy link
Member

I welcome discussion on functionality / architecture.

It is important to clarify that there is no such thing as "officially part of Pangeo." I'm not sure what that would even mean.

@niallrobinson
Copy link
Contributor

no such thing as "officially part of Pangeo."

I agree - I didn't mean to imply that people had already decided on zarr...

I guess we're aiming to get towards a "turn key" approach with Helm, which will effectively endorse some approaches (although a big part of Pangeo is that these can be swapped in and out).

I think what I mean is that there are several things to balance when assessing the different approaches. Here are the things that come to mind for me - I dare say that the Pangeo project is on top of all of this, I just wanted to register these somewhere:

  • do we value consistency? It really depends if we expect to be regularly write/reading. One school of thought says that s3 is expensive compared to compute so one might expect to not be writing data particularly regularly
  • does absolute access speed matter if it paralellizes better than an absolutely faster approach which doesn't? Again, depends on usage patterns
  • Can we transparently serve up data to people who aren't using Pangeo in a way that won't put the late adopters off accessing the data without Ppangeo
  • Finally how much do all of these approaches cost? This is a function of technology and of access pattern.

@rabernat
Copy link
Member

do we value consistency? It really depends if we expect to be regularly write/reading. One school of thought says that s3 is expensive compared to compute so one might expect to not be writing data particularly regularly

I think we should focus for now mostly on read-only situations. The use cases we are emphasizing are really about working with big, common datasets such as CMIP, reanalysis, etc. We shouldn't obsess too much about write performance...let's just get them online and move on. (Note this is a self-criticism, see #166)

That said, we need a way for users to persist temporary data in the cloud (see #209). Ideally we would have some sort of user "scratch space" that is quota's and regularly purged at some time interval (or, alternatively, billed to the user's organization). This could fit in well with a data catalog (see #39), which could track the owner and creation date of the temporary datasets. Building such a system sounds like considerable work, but important.

does absolute access speed matter if it paralellizes better than an absolutely faster approach which doesn't? Again, depends on usage patterns

Great question to investigate rigorously. It's hard to me to imagine an access approach using cloud storage that doesn't parallelize well. What did you have in mind?

Can we transparently serve up data to people who aren't using Pangeo in a way that won't put the late adopters off accessing the data without Pangeo

I believe opendap can play this role. I believe we should fork and revive pydap (which is currently basically dead) and repurpose it for this goal. If we put a cluster of pydap servers behind a load balancer with a scalable cloud storage backend (e.g. zarr), we will basically have a poor-man's version of HSDS.

Finally how much do all of these approaches cost? This is a function of technology and of access pattern.

We have been ignoring that. My overall impression is that cloud is just cheaper than I expected it to be. Would be great to have more concrete numbers corresponding to use cases.

@jhamman
Copy link
Member Author

jhamman commented Jun 5, 2018

@JiaweiZhuang, @bekozi and @rokuingh - I think we'd put this issue on hold until the summer which has now arrived. This issue went a bit sideways so I'll repeat my original ideas.

I want us to create a few jupyter notebooks that demonstrate the functionality in ESMF/ESMpy/OCGIS/xESMF/Xarray/Dask. Ideally, I'd like to develop and test these on pangeo.pydata.org.

  1. Regridding using xESMF, xarray, and dask of traditional gridded datasets
  2. Regridding using ocgis, xarray, and dask (and maybe ESMpy) to facilitate grid-->unstructured mesh (e.g. hydrologic basin polygons)

Are you up for picking this up in the coming weeks?

@tomLandry
Copy link

Hello @jhamman. I'd like to point out that @bekozi is an important collaborator in Birdhouse #220. I know he recently gave some thought on #271. After a few email exchanges with him, we concluded he would be a great mutual collaborator at Pangeo first dev workshop, both technically and geographically as his quite close to Boulder.

@tomLandry
Copy link

By the way, howdy from Fort Collins from OGC Technical Meeting. Went to see the Cheyenne supercomputer at NCAR yesterday. Fun times for a geek!

@JiaweiZhuang
Copy link
Member

Are you up for picking this up in the coming weeks?

Yes, I will start to write the next version of xESMF next week.

Do you have any particular data set to regrid? Otherwise, we have many TBs of NASA MERRA2 reanalysis data on the native cubed-sphere grid. This use-case has some scientific implications -- most global models (including those participating in CMPI5/CMIP6) do not run on regular lat-lon grids, while their output data are pre-regridded to lat-lon for easy analysis. With large-scale regridding capability it actually makes sense to release data on the native grid (which preserves more information), and regrid to lat-lon on-the-fly when necessary.

@bekozi
Copy link

bekozi commented Jun 8, 2018

Are you up for picking this up in the coming weeks?

@jhamman, yes, @rokuingh and I are working (albeit slowly) on our part. We need to get the multi-geometry support in ESMPy working - which is just adding a parameter to the mesh creation. The mesh stuff in ocgis is mostly complete. I'll put up a link to some example code on this issue for us to hack upon soon. Per our discussion, it will regrid to each mesh element from an exact source field (each regrid operation will process a single element so you can parallelize using dask).

@JiaweiZhuang, it's great you are working on the next xESMF version. Where can I go to find out what you are planning to add/rework?

@rabernat
Copy link
Member

rabernat commented Jun 8, 2018

@JiaweiZhuang this would be a good time to state clearly that a great many people are excited about xESMF, and you don't have to do all the heavy lifting yourself! Lots of Pangeo folks would be happy to contribute to developing xESMF if you can help clarify what the current challenges are. We would love to see this evolve into a community project if you are willing.

@JiaweiZhuang
Copy link
Member

Where can I go to find out what you are planning to add/rework?

Mostly the plans in JiaweiZhuang/xESMF#3. This update is mostly on the xarray/dask side, not the ESMPy side.

Lots of Pangeo folks would be happy to contribute to developing xESMF if you can help clarify what the current challenges are. We would love to see this evolve into a community project if you are willing.

Two biggest challenges/potentials are:

I welcome the Pangeo community discuss in those threads😃

@bekozi
Copy link

bekozi commented Jun 11, 2018

Hi, all. Here is a starting point for the regridding demo: https://github.com/NCPP/ocgis/blob/i481-esmf-mesh/doc/sphinx_examples/regridding_single_mesh_elements.py

Let's consider it a very much work in progress and go from there. The script will do a full regrid using ESMPy with conversion to xarray or write a weights file for using in xESMF. It works for the 671 catchments in the test dataset. You'll see at line ~40, the loop that needs unwrapping.

@jhamman, can you confirm the test hydrologic catchments shapefile dataset? I think this is the one we are targeting.

I don't know where this code should live, so maybe the first step is just to move it. We can then make a wish list about what's needed/desired? There is no dask or xESMF in this example code, so that would of course be great to add. This script also hides the ESMPy calls and maybe the idea is to display the full object usage in the toolchain. We could also include other methods to generate the weights.

One caveat. The underlying ESMF mesh is using Cartesian coordinates for both the grid and mesh. We hit a bug with regional spherical coordinates and grid-to-mesh regridding. Working on a fix.

In sum, there is at least one ESMPy patch incoming, and the code is obviously using an ocgis branch...

Ideal installation instructions:

conda create -n test -c conda-forge xarray esmpy ocgis nose mock
source activate test
conda remove ocgis

git clone -b i481-esmf-mesh https://github.com/NCPP/ocgis.git
cd ocgis
python setup.py install

python regridding_single_mesh_elements.py

Many thanks to @rokuingh and @oehmke for the ESMF help and of course @jhamman for getting this started.

@stale stale bot added the stale label Aug 10, 2018
@jhamman jhamman removed the stale label Aug 11, 2018
@pangeo-data pangeo-data deleted a comment from stale bot Aug 11, 2018
@stale
Copy link

stale bot commented Oct 10, 2018

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Oct 10, 2018
@bekozi
Copy link

bekozi commented Oct 10, 2018

Still working on this. See JiaweiZhuang/xESMF#29 (comment) for a brief update.

@stale stale bot removed the stale label Oct 10, 2018
@stale
Copy link

stale bot commented Dec 9, 2018

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Dec 9, 2018
@stale
Copy link

stale bot commented Dec 16, 2018

This issue has been automatically closed because it had not seen recent activity. The issue can always be reopened at a later date.

@stale stale bot closed this as completed Dec 16, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants