-
Notifications
You must be signed in to change notification settings - Fork 189
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pangeo use case: Advanced regridding using ESMF/ESMpy/OCGIS/xESMF/Xarray/Dask #197
Comments
This sounds great and I indeed want to get involved in. My schedule has been crazy recently (having my PhD qualifying exam this semester), so I must wait until this summer to continue xESMF development. Planning to add parallel&out-of-core regridding support with dask in v0.2. Right now I can talk about general ideas. I did some parallel regridding tests in this notebook (ref: JiaweiZhuang/xESMF#3). The entire process (read data from disk, perform regridding, write data to disk) is clearly I/O-bounded. The regridding computation (sparse matrix multiplication) has a throughput of 1~2 GB/s/core. A typical SSD just has ~500 MB/s throughput. The benefit of using >5 cores is negligible with serial I/O. To efficiently process large volumes of data, will need high-performance I/O hardware (often seen on HPC centers) that can deliver a total throughput of >100 GB/s. Here are some practical considerations: From another perspective, once the I/O performance is properly addressed, it will vastly help any data processing pipeline. There's nothing special about regridding operation (at least for structured grids that are computationally simple). |
Hi - I need to talk it over with the team but... I think we'd be really interested in doing this in the cloud, and we'd probably want to try and use Iris, at least in the first instance. One thing I would say about NetCDF in the cloud is that inefficient solutions aren't such a big deal if they parallelise - I think we'd probably try using FUSE in the first instance for this reason. I think this might be even more the case in a fairly computation heavy operation like regridding. Also, in terms of MPI vs TCP, I'm not sure I see this use case necessarily needs to transfer lots of data between nodes. If you can chunks via load range requests from s3 onto compute nodes (S3+FUSE+Iris can do this), regrid them in isolation (admittedly lat/lon tiles would need some overlap), and then gather the metadata only onto a head node (again, I think Iris could do this), you can then operate on the head node object...didn't explain that very well, hope it made sense. btw I presume you could substitute the word "Iris" of "Xarray" in any of the above. The one use case I can think of which might need a big data gather is writing back to S3, where the regridded data would all stream up through a head node and on to S3...unless you use something like Zarr. As some other githubbers know, I think Zarr looks supercool, but the current tie to Python, and the move away from community standards makes me twitchy (that's my word for willing to be convince but my gut says currently bad idea). What would be really cool is for NetCDF/HDF to be Zarr compatible! @jacobtomlinson @AlexHilson @dkillick |
Since I have been obsessing over xarray / distributed I/O throughput, I will share my $0.02. I/O constraints are universal to most "big data" calculations we want to do, and regridding is no exception.
This is not really correct. If you use xarray with dask, the reads will happen in parallel. If you use dask distributed, the reads will come from many nodes of a cluster. What matters is the ability of the underlying storage medium to deliver scalable performance. This can be achieved on HPC and cloud platforms; single servers and workstations are fundamentally constrained by the physical speed of the disk. In pangeo.pydata.org, we get basically linear scalability of read speed with the size of the cluster on datasets stored in zarr format. So there is absolutely no need to invoke MPI. We get extremely good I/O scaling on cheyenne (when it is working) using standard netCDF files. The "netCDF in the cloud" issue is a very general one that has to be solved by our community as we transition to cloud. It is being discussed in many different contexts within and outside of Pangeo. There is a whole repo devoted to examining it: https://github.com/pangeo-data/storage-benchmarks. I recommend we not conflate the specific and actionable use cased proposed here with the much more general question of cloud storage formats and I/O bandwidth. |
It seems like this conversation has shifted towards I/O concerns, which are valid and a major topic of this group in general. However I think that the original purpose of the issue would be satisfied just by demonstrating the capability of doing scalable regridding on large datasets. I think it's fine if the regridding stage is not the bottleneck as long as we can do it in a scalable manner on large datasets. @JiaweiZhuang 's performance-oriented work has been great, we should connect the dots so that it is easy to apply it to these scenarios. |
Maybe this was implicit, but I want us to be mindful to not conflate two things:
the implication being that 1 can be slow but, if it parallelises linearly then 2 (which I think is what matters) can still be fine. |
Glad to see this has generated some interest. Full agreement from me on not conflating the storage of these datasets with the ability to operate on them. @JiaweiZhuang - good luck on the quals. If I go back to one of my original points, we have an opportunity to work with the ESMF folks here and streamline the development of any missing pieces. @bekozi and @rokuingh, how do you see this proceeding? Should we work up some examples, and just rope you all in when things start breaking? I can get my ocgis/dask example up and running on Cheyenne working with NHD+. |
@jhamman Thanks!
Thanks for pointing this out. A bit more documentations&benchmark on the I/O scaling of xarray would be useful. The current chapter on parallel computation only shows the scaling of in-memory computation. Indeed, scaling-up I/O can't be done on laptops. But examples using some well-known supercomputers (e.g. NASA Pleiades) will be a great reference. https://github.com/pangeo-data/storage-benchmarks looks great. If anyone one can setup a demo achieving high read&write throughput with S3/Google Cloud Storage, I can simply add the regridding computation step in the middle. |
@JiaweiZhuang -- I recommend you start playing around on pangeo.pydata.org. You should be able to get an example working there pretty easily. edit for clarification: the example notebooks already in pangeo.pydata.org already demonstrate high-throughput reading. |
Done: http://pangeo.pydata.org/
https://youtu.be/mDrjGxaXQT4?t=1681
(aww, ryan beat me to it)
…On Fri, Apr 6, 2018 at 1:37 PM, Jiawei Zhuang ***@***.***> wrote:
@jhamman <https://github.com/jhamman> Thanks!
If you use xarray with dask, the reads will happen in parallel. If you use
dask distributed, the reads will come from many nodes of a cluster. What
matters is the ability of the underlying storage medium to deliver scalable
performance.
Thanks for pointing this out. A bit more documentations&benchmark on the
I/O scaling of xarray would be useful. The current chapter on parallel
computation
<http://xarray.pydata.org/en/stable/dask.html#automatic-parallelization>
only shows the scaling of in-memory computation
<http://xarray.pydata.org/en/stable/dask.html#automatic-parallelization>.
Indeed, scaling-up I/O can't be done on laptops. But examples using some
well-known supercomputers (e.g. NASA Pleiades) will be a great reference.
https://github.com/pangeo-data/storage-benchmarks looks great. If anyone
one can setup a demo achieving high read&write throughput with S3/Google
Cloud Storage, I can simply add the regridding computation step in the
middle.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#197 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AASszMXUQlrCYUNqgbqMV4eIpz3RSg0Eks5tl6fVgaJpZM4TJZMM>
.
|
@rokuingh mentioned to me that ESMF regridding scales to 5000 cores (ref: ESMF benchmark). I wonder if that considers I/O performance? Were you trying to improve I/O? ESMF's regridding capability seems to mainly support online ESM simulations, so the data can live in memory for a long time and you can worry less about disk operation. Please tell me if I am wrong. |
@rabernat @mrocklin Yes I've played with pangeo.pydata.org a little and it feels great. If I understand correctly the sample data are mostly in zarr format, which already integrates well with cloud storage. An example using giant NetCDF files will actually solve the core problem of large-scale regridding of existing Earth science data. You don't need to particularly care about regridding during your development. Just read NC files from object storage, do some light weight computation like The output might be in zarr format instead of NetCDF, as long as people don't need to read them by C/Fortran programs to support some model simulations. |
In terms of "giant NetCDF files", the NASA-NEX repository on AWS should be a good example. Many of them are produced by CMIP5, which is a great reference for CMIP6. For example, the
Notice that each file is quite small, and there are a large number of files (produced by different models, for each year). This should be a common pattern for CMIP-style output. Hopefully the small individual size will make NetCDF read easier. |
I think that everyone agrees that resolving the HDF/NetCDF on cloud storage problem is important. There are a number of different approaches to this that are being investigated. FWIW I consider the data access problem orthogonal to the regridding problem. I think that we can make progress on them independently without blocking on the other. |
Yes, this is exactly what I mean! |
I don't know where to register this thought @rabernat @mrocklin @jhamman @jacobtomlinson - I've seen zarr mentioned a lot on github lately. It looks really cool but I've still got a few concerns, and there are other options that we've seen at EGU and elsewhere. zarr isn't officially part of Pangeo yet right? Before we sort of endorse an approach, we need to (a) benchmark performance (which is happening) but also (b) have a think about what kind of functionality/architecture we value. I think we all agree - just wanted to make sure 😄 |
I welcome discussion on functionality / architecture. It is important to clarify that there is no such thing as "officially part of Pangeo." I'm not sure what that would even mean. |
I agree - I didn't mean to imply that people had already decided on zarr... I guess we're aiming to get towards a "turn key" approach with Helm, which will effectively endorse some approaches (although a big part of Pangeo is that these can be swapped in and out). I think what I mean is that there are several things to balance when assessing the different approaches. Here are the things that come to mind for me - I dare say that the Pangeo project is on top of all of this, I just wanted to register these somewhere:
|
I think we should focus for now mostly on read-only situations. The use cases we are emphasizing are really about working with big, common datasets such as CMIP, reanalysis, etc. We shouldn't obsess too much about write performance...let's just get them online and move on. (Note this is a self-criticism, see #166) That said, we need a way for users to persist temporary data in the cloud (see #209). Ideally we would have some sort of user "scratch space" that is quota's and regularly purged at some time interval (or, alternatively, billed to the user's organization). This could fit in well with a data catalog (see #39), which could track the owner and creation date of the temporary datasets. Building such a system sounds like considerable work, but important.
Great question to investigate rigorously. It's hard to me to imagine an access approach using cloud storage that doesn't parallelize well. What did you have in mind?
I believe opendap can play this role. I believe we should fork and revive pydap (which is currently basically dead) and repurpose it for this goal. If we put a cluster of pydap servers behind a load balancer with a scalable cloud storage backend (e.g. zarr), we will basically have a poor-man's version of HSDS.
We have been ignoring that. My overall impression is that cloud is just cheaper than I expected it to be. Would be great to have more concrete numbers corresponding to use cases. |
@JiaweiZhuang, @bekozi and @rokuingh - I think we'd put this issue on hold until the summer which has now arrived. This issue went a bit sideways so I'll repeat my original ideas. I want us to create a few jupyter notebooks that demonstrate the functionality in ESMF/ESMpy/OCGIS/xESMF/Xarray/Dask. Ideally, I'd like to develop and test these on pangeo.pydata.org.
Are you up for picking this up in the coming weeks? |
Hello @jhamman. I'd like to point out that @bekozi is an important collaborator in Birdhouse #220. I know he recently gave some thought on #271. After a few email exchanges with him, we concluded he would be a great mutual collaborator at Pangeo first dev workshop, both technically and geographically as his quite close to Boulder. |
By the way, howdy from Fort Collins from OGC Technical Meeting. Went to see the Cheyenne supercomputer at NCAR yesterday. Fun times for a geek! |
Yes, I will start to write the next version of xESMF next week. Do you have any particular data set to regrid? Otherwise, we have many TBs of NASA MERRA2 reanalysis data on the native cubed-sphere grid. This use-case has some scientific implications -- most global models (including those participating in CMPI5/CMIP6) do not run on regular lat-lon grids, while their output data are pre-regridded to lat-lon for easy analysis. With large-scale regridding capability it actually makes sense to release data on the native grid (which preserves more information), and regrid to lat-lon on-the-fly when necessary. |
@jhamman, yes, @rokuingh and I are working (albeit slowly) on our part. We need to get the multi-geometry support in @JiaweiZhuang, it's great you are working on the next |
@JiaweiZhuang this would be a good time to state clearly that a great many people are excited about xESMF, and you don't have to do all the heavy lifting yourself! Lots of Pangeo folks would be happy to contribute to developing xESMF if you can help clarify what the current challenges are. We would love to see this evolve into a community project if you are willing. |
Mostly the plans in JiaweiZhuang/xESMF#3. This update is mostly on the xarray/dask side, not the ESMPy side.
Two biggest challenges/potentials are:
I welcome the Pangeo community discuss in those threads😃 |
Hi, all. Here is a starting point for the regridding demo: https://github.com/NCPP/ocgis/blob/i481-esmf-mesh/doc/sphinx_examples/regridding_single_mesh_elements.py Let's consider it a very much work in progress and go from there. The script will do a full regrid using @jhamman, can you confirm the test hydrologic catchments shapefile dataset? I think this is the one we are targeting. I don't know where this code should live, so maybe the first step is just to move it. We can then make a wish list about what's needed/desired? There is no One caveat. The underlying In sum, there is at least one Ideal installation instructions:
Many thanks to @rokuingh and @oehmke for the |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
Still working on this. See JiaweiZhuang/xESMF#29 (comment) for a brief update. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
This issue has been automatically closed because it had not seen recent activity. The issue can always be reopened at a later date. |
I sat down with @bekozi and @rokuingh from NOAA/ESRL/NESII this afternoon. We discussed putting together a compelling use case that demonstrates the developing facility within the ESMF/ESMpy/ocgis/xESMF/Xarray/Dask libraries to do regridding of large geospatial datasets.
We discussed two basic applications:
We generally agreed that most of the pieces are in place to do both of these and the development of a complete use case would provide the opportunity to work out the remaining kinks. This may be particularly useful as an oportunity to entrain the NESII developers in some of the current pangeo efforts.
also cc @JiaweiZhuang
The text was updated successfully, but these errors were encountered: