Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clean import of code for GDAL in ESMF. Compiles/runs in ESMF #1981

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

msl3v
Copy link
Collaborator

@msl3v msl3v commented Jan 5, 2024

Need to diagnose parallelization issues.

Copy link
Contributor

@jedwards4b jedwards4b left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are writing a serial interface - you should only call GDAL from io_root
emulate the netcdf serial interface. NetCDF files are read on io_root and then results are distributed to all tasks. By the way - what does your test look like? You will want to add it to the tests/cunit directory.

@@ -19,6 +19,8 @@
#include <netcdf.h>
#include <netcdf_meta.h>

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should also include an #ifdef here.

src/clib/pio_file.c Show resolved Hide resolved
@@ -949,6 +949,10 @@ PIOc_read_darray(int ncid, int varid, int ioid, PIO_Offset arraylen,
if ((ierr = pio_read_darray_nc(file, iodesc, varid, iobuf)))
return pio_err(ios, file, ierr, __FILE__, __LINE__);
break;
case PIO_IOTYPE_GDAL:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will need #ifdef

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey. Yeah. I tried to set up an #ifdef. It's on the list of to-do

@@ -606,6 +608,13 @@ typedef struct file_desc_t
* feature. One consequence is that PIO_IOTYPE_NETCDF4C files will
* not have deflate automatically turned on for each var. */
int ncint_file;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#ifdef here too

src/clib/pio_gdal.c Outdated Show resolved Hide resolved
src/clib/pio_gdal.c Outdated Show resolved Hide resolved
@jedwards4b
Copy link
Contributor

Also you don't need to create a new PR with additional changes, just push them back to the same branch. How is the GDAL library available? Can it be obtained with apt-git? Just trying to figure out how to add it to the github testing workflow.

@msl3v
Copy link
Collaborator Author

msl3v commented Jan 5, 2024

GDAL is available in several forms. It is available in my Fedora distro via dnf as an RPM. The source is also available, and is not complex to compile. We have it on the NCCS machines here at NASA.

@msl3v
Copy link
Collaborator Author

msl3v commented Jan 8, 2024

@jedwards4b, the ESMF test results show that for a file with 4 points, all 4 points are being read correctly on io_root. The data being read are [1 2 3 4]. If I run on 4 procs, the local buffers are [1] [1] [1] [1]. If I run on 2 procs, the local buffers are [1 2] [1 2]. So it looks like MPI is sending/recving but that indexing isn't correct.

@jedwards4b
Copy link
Contributor

If you could provide your test program I'll give it a try. Please remove reference to ESMF, let's take that library out of the equation for now.

@msl3v
Copy link
Collaborator Author

msl3v commented Jan 8, 2024

ESMF does all the calling to PIO, and it would be a complex task (for me) to uncouple. ESMF starts in Fortran, can then calls into C bindings, sets up arrays etc, and then calls PIO. I'll see if I can isolate the PIO-specifc pieces but we may have to iterate.

@msl3v
Copy link
Collaborator Author

msl3v commented Jan 8, 2024

Is there an existing test in PIO that I could modify for this?

@jedwards4b
Copy link
Contributor

Yes look in the test directories, specifically in tests/cunit for a number of possible tests. test_darray.c might be a good one to try.

@msl3v
Copy link
Collaborator Author

msl3v commented Jan 9, 2024

I've added test_gdal.c in the most recent commit to github.com/msl3v/ParallelIO:msl3v/GDAL_clean
It segfaults in pio_swapm() and I'm not able to diagnose why. If you have any insight it would be appreciated. Thank you.

@jedwards4b
Copy link
Contributor

Great thanks - I have a day full of meetings today, but I'll try to get to this as soon as I can.

@jedwards4b
Copy link
Contributor

What version of gdal is required - my system has gdal/3.7.1 built with gcc - can I test with that?

@msl3v
Copy link
Collaborator Author

msl3v commented Jan 9, 2024

3.7.1 works. Here's my cmake cmd

env CC=mpicc FC=mpifort CFLAGS="-g -fPIC" cmake -DCMAKE_INSTALL_PREFIX=../Install -Wno-dev -DPIO_ENABLE_TIMING=OFF -DPIO_ENABLE_EXAMPLES=OFF -DPIO_ENABLE_FORTRAN=OFF -DPIO_ENABLE_LOGGING=ON -DCMAKE_BUILD_TYPE=DEBUG -DNetCDF_C_INCLUDE_DIR=/usr/lib64/gfortran/modules -DNetCDF_C_LIBRARY=/usr/lib64/libnetcdf.so -DWITH_PNETCDF=off -DGDAL_INCLUDE_DIR=<path_to_gdal>/include -DGDAL_LIBRARY=<path_to_gdal_lib64>/libgdal.so ../ParallelIO

@jedwards4b
Copy link
Contributor

The first problem is because your test doesn't run from the source directory and so it
can't find file data/cb_2018_us_region_20m.shp in the relative path. i'll look into how to do this in the build, but for now you can create a link from tests/cunit/data to build/tests/cunit/data.

@msl3v
Copy link
Collaborator Author

msl3v commented Jan 9, 2024

Interesting. It opens, queries and reads data from the file in my local version. Curious what's different

@msl3v
Copy link
Collaborator Author

msl3v commented Jan 10, 2024

Is it maybe that I'm running the test directly from tests/cunit?
I'll create the link and push the change.

I don't see a build dir after cmake, make install, make tests. Is 'build' in this case simply the path to src?

@jedwards4b
Copy link
Contributor

Looks like you found it - that's great!

@msl3v
Copy link
Collaborator Author

msl3v commented Jan 23, 2024

OK. The issue with the segfault was that the recvbuf was NULL. It was fixed with a simple line in test_gdal.c
I'll keep banging on it to see what the recvbuf data look like.

@msl3v
Copy link
Collaborator Author

msl3v commented Jan 23, 2024

@jedwards4b
In my pio_log output, for 4 data points on 4 MPI tasks, I get the following output:

0 sendcounts[0] = 1 sdispls[0] = 0 sendtypes[0] = 27742288 recvcounts[0] = 1 rdispls[0] = 0 recvtypes[0] = 27745072
0 sendcounts[1] = 1 sdispls[1] = 0 sendtypes[1] = 27743024 recvcounts[1] = 0 rdispls[1] = 0 recvtypes[1] = 2085777792
0 sendcounts[2] = 1 sdispls[2] = 0 sendtypes[2] = 27743696 recvcounts[2] = 0 rdispls[2] = 0 recvtypes[2] = 2085777792
0 sendcounts[3] = 1 sdispls[3] = 0 sendtypes[3] = 27744368 recvcounts[3] = 0 rdispls[3] = 0 recvtypes[3] = 2085777792

Is this telling me that only one data point is being received (line 1)?

(addenum: this is associated with the call to rearrange_io2comp())

(addendum 2: the data vector at the end of the test is [1 0 0 0]. The data vector read from file is [1 2 3 4])

@msl3v
Copy link
Collaborator Author

msl3v commented Jan 23, 2024

@jedwards4b I'm at a loss. I can keep trying but it's just not clear what the problem might be. Is it the case that the data from the serially-read vector are broadcast via the MPI_AlltoAllw() call in rearrange_io2comp()->pio_swapm()?

@jedwards4b
Copy link
Contributor

Yes that's correct, I can try to have another look at this tomorrow.

@msl3v
Copy link
Collaborator Author

msl3v commented Jan 24, 2024

Is it the case, then, that the first line in the output above
0 sendcounts[0] = 1 sdispls[0] = 0 sendtypes[0] = 27742288 recvcounts[0] = 1 rdispls[0] = 0 recvtypes[0] = 27745072
is saying that only task 0 is set to receive any data, since all the rest of the receive params are zero?

@jedwards4b
Copy link
Contributor

yes, I will try to look at this again after lunch.

@msl3v
Copy link
Collaborator Author

msl3v commented Jan 24, 2024

No hurry. I'm just banging away at the code & posting what comes up.

@msl3v
Copy link
Collaborator Author

msl3v commented Jan 25, 2024

@jedwards4b just an update: Still no progress. It looks like the iodesc->map might not be getting set up correctly. MPI doesn't know to send what element to which proc. ... maybe. But I still haven't interpreted what is actually happening here. I'll keep trying.

@jedwards4b
Copy link
Contributor

I think that you should walk through the netcdf serial test step by step - it should be exactly the same as this one.

@msl3v
Copy link
Collaborator Author

msl3v commented Jan 26, 2024

@jedwards4b
It's not clear which test is the mpi serial test. I've modeled the GDAL test off test_darray.c
Is the serial test test_simple?

Also, in a totally vanilla version of ParallelIO, all darray and the test_simple tests fail on my laptop. So maybe that is itself part of the problem. :-/

@jedwards4b
Copy link
Contributor

That does seem like a problem. On my system all of the tests are passing except the gdal test. In the test_darray.c there is a loop over num_flavors at line 110. One of those flavors is netcdf-serial (PIO_IOTYPE_NETCDF). Your test should behave exactly the same as test_darray.c does with this flavor.

@msl3v
Copy link
Collaborator Author

msl3v commented Jan 26, 2024

Switched to another system. All tests pass (from remotes/origin/main). It seems to be a local MPI problem on my laptop.
Thanks for the pointer re test_darray.c. Onward.

@msl3v
Copy link
Collaborator Author

msl3v commented Jan 30, 2024

Yeah. It may simply be that my laptop openMPI was upgraded (to v 4.1.9) and no longer works with ParallelIO. Switched to MPICH and things seem to be working, including the GDAL read :-/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants