-
Notifications
You must be signed in to change notification settings - Fork 262
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
more tests, more internal documentation, minor bug fixes #711
more tests, more internal documentation, minor bug fixes #711
Conversation
I merged in some changes I made from master and turned on parallel tests for CI. Let me explore that and figure out what's going on, I don't believe this is a 'true' failure. |
I see this branch is passing travis again. Hopefully that means it can now be merged. ;-) |
Compiler errors in Windows but I don't expect them to be difficult to handle. Will follow up once fixed. |
Got caught up in another project and then having to update Visual Studio (which has taken quite a while so far), still working on this. |
At AGU but pursing this now |
tst_nccopy4 is failing. I will examine why and follow up. |
Where is it failing? On Windows?
…On Mon, Dec 11, 2017 at 2:27 PM, Ward Fisher ***@***.***> wrote:
tst_nccopy4 is failing. I will examine why and follow up.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#711 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AEUr3L8u1s0TivEfIN0OtSmlc5WxOy5Mks5s_Z5UgaJpZM4QztnJ>
.
|
Hope you are having a good time at AGU. Enjoy the conference, this can wait
until you get back.
At the moment I am deep in some PIO problems. Almost ready to put up a PR
with a fully working PIO built into netCDF. So very excited about that. ;-)
On Tue, Dec 12, 2017 at 6:36 AM, Ed Hartnett <edwardjameshartnett@gmail.com>
wrote:
… Where is it failing? On Windows?
On Mon, Dec 11, 2017 at 2:27 PM, Ward Fisher ***@***.***>
wrote:
> tst_nccopy4 is failing. I will examine why and follow up.
>
> —
> You are receiving this because you authored the thread.
> Reply to this email directly, view it on GitHub
> <#711 (comment)>,
> or mute the thread
> <https://github.com/notifications/unsubscribe-auth/AEUr3L8u1s0TivEfIN0OtSmlc5WxOy5Mks5s_Z5UgaJpZM4QztnJ>
> .
>
|
When you say built in, do you mean hooks to link against, like pnetcdf? Looking forward to seeing it! Take your time, the WiFi at the convention center is terrible. |
Oh, it is failing on Windows and I went "too clever" trying to fix it last night. Still working on it this morning. |
When I say built-in, I mean built in. There is a new directory, libpio, and
a new test directory, pio_test.
It cannot be hooked in like pnetcdf - it is too fundamental.
Best would be if you and Dennis made some time and I can come down and
explain in person.
Ed
…On Tue, Dec 12, 2017 at 9:22 AM, Ward Fisher ***@***.***> wrote:
Oh, it is failing on Windows and I went "too clever" trying to fix it last
night. Still working on it this morning.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#711 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AEUr3K2Jhv7eTDIrkXPw7S3eWfWT7Q8xks5s_qhagaJpZM4QztnJ>
.
|
BTW the features of PIO + netcdf are awesome. Worked out even better than I
had hoped. It will really position netcdf as the HPC leader. And it takes
full advantage of netCDF-4, netCDF-4 parallel, and pnetcdf (not yet but
soon).
And the usefulness of the features is already well proven by the CESM model.
On Tue, Dec 12, 2017 at 10:11 AM, Ed Hartnett <edwardjameshartnett@gmail.com
… wrote:
When I say built-in, I mean built in. There is a new directory, libpio,
and a new test directory, pio_test.
It cannot be hooked in like pnetcdf - it is too fundamental.
Best would be if you and Dennis made some time and I can come down and
explain in person.
Ed
On Tue, Dec 12, 2017 at 9:22 AM, Ward Fisher ***@***.***>
wrote:
> Oh, it is failing on Windows and I went "too clever" trying to fix it
> last night. Still working on it this morning.
>
> —
> You are receiving this because you authored the thread.
> Reply to this email directly, view it on GitHub
> <#711 (comment)>,
> or mute the thread
> <https://github.com/notifications/unsubscribe-auth/AEUr3K2Jhv7eTDIrkXPw7S3eWfWT7Q8xks5s_qhagaJpZM4QztnJ>
> .
>
|
Have you had to make significant changes outside of the libpio and pio_test directories? |
Some changes in libdispatch/dfile.c in order to invoke things properly for opens/creates. I am isolating most of the changes in functions that will only be built and called if the library is built with PIO. Otherwise no changes outside the libpio and pio_test directories. Basically the libpio code sits on top of the libsrc/libsrc4 code and calls it as needed. I added several new functions, which have no analogs elsewhere in the library. This is the complete list:
I may combine the setframe function into the read/write_darray functions. So there will be some minor churn... |
That is a lot of changes to the netcdf-c API, which makes me unhappy :-( |
This is a maximum. A bunch of functions will go away. I need to get all tests working first. ;-) PIO introduces two new items to the netCDF data model, IO Systems and decompositions. IO systems describe the HPC environment, how many cores, how many to do I/O, and which ones do which. They allow the program to adjust at start-up to the number of available cores and how the user wants to divide the computation/IO workload. Decompositions map netCDF file data space across the available computational cores. For example if a var is 10x10x10 and there are 100 cores, then you could put 10 value per core. The decomposition specifies this. With IO systems I can say that if my program is running on 10K cores, that 1000 of them should be for I/O, and 5K for an atmospheric model and 4K for an ocean model. With decompositions I say how the 3D and 2D data arrays (each get their own decomposition) are scattered across the 5K atmospheric cores. Now each of the 5K cores can call nc_write_darray() with their local array, and the 5000 sub-arrays will be automatically transfered to the 1000 I/O cores. Those I/O cores will (later) write that data to disk in the netCDF file. The decompostion(s) and IO system allow the PIO code to do all the donkey-work to make it happen. So the decomposition and IO system exist independently of any netCDF file. Both might easily be used to create and/or read many netCDF files. In fact, that would be typical. The user will set up the IOSystem and decompositions once for the supercomputer, then all subsequent netCDF file access, reading or writing, will use those. It's neat, so it's worth a few extra functions. However, after I take my next pass, if you can suggest any further reductions, that would be good. |
If by extending the data model, you mean |
Never mind. I will wait until you have a functioning system and then Ward and I |
I am open to any changes that will work, but I don't understand how your
data structure idea would work in this case.
Starting with IO systems -
The IO system must be set up before the first nc_create()/nc_open(),
because only some cores will be calling create/open (the computation cores,
trying to write their results). The IO cores already need to be looping in
a message mode, waiting for I/O calls to come in. It has to be initialized
and running before the first nc_create() call.
The code that splits up the work between different sets of cores is, by
necessity, user code. They must write their core Earth model engine. That
Earth model engine initializes the IO system, and then launches component
models on different sets of cores. Those component models contain netCDF
code.
So I believe, at least for I/O systems, that init/free functions are
required.
…On Tue, Dec 12, 2017 at 2:10 PM, Dennis Heimbigner ***@***.*** > wrote:
Never mind. I will wait until you have a functioning system and then Ward
and I
will decide what changes need to be made.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#711 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AEUr3Kla-qmhc3NIagK4S-iuAtTnJSwhks5s_uuzgaJpZM4QztnJ>
.
|
|
In any case, as I said. The important thing is to get a working version in a branch |
I'm afraid what you propose will not work for IO systems due to the way MPI
works. (BTW I should add that, as currently written, the PIO integration
requires netCDF be built for parallel, and the new functions are declared
in netcdf_par.h. So this is not really an expansion of the usual netCDF
API, but an expansion of the netCDF parallel API.)
In the example I gave above, when the atmospheric model wants to create a
file, it calls nc_create() on it's 5000 cores. These cores (actually just
core 0) send the parameters to the 1000 IO processing cores (but this must
be set up first). If this is a serial netCDF classic file, that results on
NC3_create() being called on core 0 of the IO cores. The error code is
passed back, first to the 1000 IO cores, then to the 5000 atmospheric model
cores. The other 4000 cores on the machine do not get a nc_create() call at
this time.
So the processors on which nc_create is NOT being called, also must be
initialized. The IO cores must be running the messaging loop before any
netCDF calls are made. The IO System encompasses ALL communicators and must
know about all of them. It is set up at model initialization and knows
about all the existing computational components, and the IO components, and
has their MPI communicators all set up.
It is truly a separate and worth netCDF object, but one that is only needed
on HPC systems.
The PIO library is well-designed to fit into existing Earth system models.
The setup of the model and all the sub-models is already a big thing, as
you can imagine. Lots of science stuff. IO is only a small part of it. So
they will not provide call-back functions. Instead, they will want to
initialize IO as part of their overall initialization. That is where they
decide how many cores to use for each computation component, the IO
component, etc. Then they launch the computational components on their
cores, the IO component on it's cores.
NetCDF/PIO fits into this existing workflow in order to maximize reuse and
make it easy for users to use in their netCDF-based models (which are very
large code bases that are expensive to change.)
…On Tue, Dec 12, 2017 at 2:55 PM, Dennis Heimbigner ***@***.*** > wrote:
The IO system must be set up before the first nc_create()/nc_open(),
We handle this like we do e.g. hdf5 initialization. Namely, nc_create and
nc_open
examine internal global variables to see if the necessary pio
initialization has been
performed, and if not, do it.
The code that splits up the work between different sets of cores is, by
necessity, user code.
I am certainly not opposed to user provided callback functions, but
again,, that can be passed
as data to the existing netcdf-c API functions.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#711 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AEUr3MfhpHG0-FYr4LmmrTkjoT8S0vTDks5s_vY4gaJpZM4QztnJ>
.
|
I don't follow. You can pass all the info you want about the other cores as the parameter |
OK, as you suggest, let me get a working system up and running, and
eliminate all the obviously unnecessary functions. Then we can discuss the
ones that are left. I am open to anything that can be made to work.
…On Tue, Dec 12, 2017 at 4:40 PM, Dennis Heimbigner ***@***.*** > wrote:
I don't follow. You can pass all the info you want about the other cores
as the parameter
argument to nc_create.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#711 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AEUr3P1xAu4xQol1TftDrU5RybRdIx5_ks5s_w8GgaJpZM4QztnJ>
.
|
But to continue with this example, I need to create MPI communicators on
all cores, and yet nc_create() is only being called on some cores. So even
if I pass all information to those cores that are calling nc_create(), that
still leaves all the other cores, which are not running the correct code.
The init code has to run on *all* cores.
In practice, this happens at init time for the whole model system. Then the
computational components start executing, and they make netCDF calls.
On Tue, Dec 12, 2017 at 6:18 PM, Ed Hartnett <edwardjameshartnett@gmail.com>
wrote:
… OK, as you suggest, let me get a working system up and running, and
eliminate all the obviously unnecessary functions. Then we can discuss the
ones that are left. I am open to anything that can be made to work.
On Tue, Dec 12, 2017 at 4:40 PM, Dennis Heimbigner <
***@***.***> wrote:
> I don't follow. You can pass all the info you want about the other cores
> as the parameter
> argument to nc_create.
>
> —
> You are receiving this because you authored the thread.
> Reply to this email directly, view it on GitHub
> <#711 (comment)>,
> or mute the thread
> <https://github.com/notifications/unsubscribe-auth/AEUr3P1xAu4xQol1TftDrU5RybRdIx5_ks5s_w8GgaJpZM4QztnJ>
> .
>
|
Hi, @edhartnett
This sounds interesting. Could you please provide references, publications to this work? Thanks |
@wkliao I don't have any references or publications. I'm a programmer dude, not a scientist. :-) However, take a look at CESM here: http://www.cesm.ucar.edu/. Probably there are lots of papers about it, but probably none of them mention the "minor" implementation detail of how they do IO. ;-) Anyway, further discussion of PIO merges should take place on it's newly-created PR #720. Meanwhile, this PR contains minor cleanup and a bunch of documentation. Hopefully we can get it merged this week. |
More tests for nc4var.c code. Also some minor fixes and a bunch of (internal) documentation.
This gets test coverage of nc4var.c from 70% to ~85%, but there is more to go.
Added or fixed doxygen documentation for all functions in libsrc4.
Also added documentation for a few missing public functions.
Part of #702.
Fixes #704.
Fixes #709.
Fixes #707.
Fixes #706.
Fixes #714.
Fixes #713.
Fixes #716.
From PR #700:
Fixes #699.
Fixes #698.
Fixes #697.
Fixes #696.
Fixes #694.
Fixes #662.
From PR #658:
Fixes #392.