Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SegFaults in the sample data tests with netCDF4=1.6.1 #1727

Closed
valeriupredoi opened this issue Sep 20, 2022 · 31 comments
Closed

SegFaults in the sample data tests with netCDF4=1.6.1 #1727

valeriupredoi opened this issue Sep 20, 2022 · 31 comments
Labels
bear at a dinner party Something very unexpected testing

Comments

@valeriupredoi
Copy link
Contributor

valeriupredoi commented Sep 20, 2022

@ESMValGroup/technical-lead-development-team I need your (rather quick) input here please: we have seen that the new netCDF4=1.6.1 is causing frequent segfaults in our CI tests, and we have pinned it to !=1.6.1 to brush the problem under the carpet for us. However, the good folk at Unidata/netCDF4 are scratching their heads and are wondering what the heck's going on, let's try and help them figure that out, even if we can provide a bit of a narrowed-down picture, it' still helpful. For that, I have opened

(have a read through the discussion there, it's a lot of paint thrown at a white wall)

and I have managed to isolate our side of the problem to the sample data testing. Simplifying the problem, this is how the toy model looks:

import iris
import numpy as np
import pickle
import platform
import pytest

TEST_REVISION = 1

def get_cache_key(value):
    """Get a cache key that is hopefully unique enough for unpickling.

    If this doesn't avoid problems with unpickling the cached data,
    manually clean the pytest cache with the command `pytest --cache-clear`.
    """
    py_version = platform.python_version()
    return (f'{value}_iris-{iris.__version__}_'
            f'numpy-{np.__version__}_python-{py_version}'
            f'rev-{TEST_REVISION}')


@pytest.fixture(scope="module")
def timeseries_cubes_month(request):
    """Load representative timeseries data."""
    # cache the cubes to save about 30-60 seconds on repeat use
    cache_key = get_cache_key("sample_data/monthly")
    data = request.config.cache.get(cache_key, None)
    cubes = pickle.loads(data.encode('latin1'))

    return cubes


# @pytest.mark.skip
def test_io_1(timeseries_cubes_month):
    cubes = timeseries_cubes_month
    _ = [c.data for c in cubes]  # this produces SegFaults


@pytest.mark.skip
def test_io_2(timeseries_cubes_month):
    cubes = timeseries_cubes_month
    loaded_cubes = []
    for i, c in enumerate(cubes):
        iris.save(c, str(i) + ".nc")
        lc = iris.load_cube(str(i) + ".nc")
        loaded_cubes.append(lc)
    _ = [c.data for c in loaded_cubes]  # this doesn't produce SegFaults

From my tests I found out test_io_1 has a tendency to produce segfaults at that listcomp step (can test with -n 0 or -n 2, doesn't really matter), whereas the other doesn't. Can we gauge anything from that without digging in the actual IO/threading (that is not our plot of land anyway)? Hive mind, folks! 🐝

UPDATE as of 20-Oct-2022 #1727 (comment)

@valeriupredoi valeriupredoi added testing bear at a dinner party Something very unexpected labels Sep 20, 2022
@bouweandela
Copy link
Member

Have you tried removing the cache to simplify the problem even further?

@valeriupredoi
Copy link
Contributor Author

Have you tried removing the cache to simplify the problem even further?

indeed one thing I will test with in the next hour, bud! 🍺

@valeriupredoi
Copy link
Contributor Author

OK so I've done some serious testing with the cache test and without (simply reading files off disk):

cached case

7% fail rate:

  • 5% HDF crappe
    • always at second cube: saved it and accessed c.data on it 100 times - no issue
  • 2% SegFaults

regular read off disk

no issues: 100% pass rate

So it is pretty clear that whatever happens when the cached sample data is read is the thing that causes 1.6.1 to produce SegFaults and stray HDF issues - what that exactly is, I have no idea? Any clues @bouweandela and maybe @stefsmeets ?

@bouweandela
Copy link
Member

Does the problem persist if you clear the cache before testing?

@bouweandela
Copy link
Member

@bouweandela
Copy link
Member

If it goes away, we either need to add the netcdf and hdf5 library versions to the cache key, it get rid of pickle altogether and create the cache using e.g. iris.

@valeriupredoi
Copy link
Contributor Author

@bouweandela cache clearing doesn't work for this test - it deletes the cache so the test fails since there is no data to run the test, I don't want to reconstruct the data while I am doing this because that is simply adding time to the test runs ie it fails once, then one needs to run the test without cache clearing, so we are in the same boat and we're just wasting time:

request = <SubRequest 'timeseries_cubes_month' for <Function test_io_1>>

    @pytest.fixture(scope="module")
    def timeseries_cubes_month(request):
        """Load representative timeseries data."""
        # cache the cubes to save about 30-60 seconds on repeat use
        cache_key = get_cache_key("sample_data/monthly")
        data = request.config.cache.get(cache_key, None)
>       cubes = pickle.loads(data.encode('latin1'))
E       AttributeError: 'NoneType' object has no attribute 'encode'

test_io_netcdf_2.py:27: AttributeError

@valeriupredoi
Copy link
Contributor Author

OK running pytest --clear-cache once (running through the entire test suite, accounting for the failed test as above, all else passing) then running 100 instances of pytest -n 2 test_io_netcdf_2.py where that script is the one below (to have only the cached stuffs that fail) results in no issues at all! Good idea to test the cache clearing @bouweandela 🍺

mport iris
import numpy as np
import pickle
import platform
import pytest

TEST_REVISION = 1

def get_cache_key(value):
    """Get a cache key that is hopefully unique enough for unpickling.

    If this doesn't avoid problems with unpickling the cached data,
    manually clean the pytest cache with the command `pytest --cache-clear`.
    """
    py_version = platform.python_version()
    return (f'{value}_iris-{iris.__version__}_'
            f'numpy-{np.__version__}_python-{py_version}'
            f'rev-{TEST_REVISION}')


@pytest.fixture(scope="module")
def timeseries_cubes_month(request):
    """Load representative timeseries data."""
    # cache the cubes to save about 30-60 seconds on repeat use
    cache_key = get_cache_key("sample_data/monthly")
    data = request.config.cache.get(cache_key, None)
    cubes = pickle.loads(data.encode('latin1'))

    return cubes


def test_io_1(timeseries_cubes_month):
    cubes = timeseries_cubes_month
    print("YYY")
    for i, c in enumerate(cubes):
        print("XXX", i)
        try:
            c.data
        except RuntimeError:
            print(c)
            print("SHIT")
            raise

this same test was used here #1727 (comment) with results in the "cached case" section. It is something to do with the cache, and the new netcdf4, since this test has never failed in the CI until now. Ideas further?

@valeriupredoi
Copy link
Contributor Author

OK I ran another set of 100 of those tests, just to be sure - bulletproof! No more issues after clearing the cache once

@valeriupredoi
Copy link
Contributor Author

also let's keep an eye on the (hourly) tests this #1730 is running just to be sure the only problematic 1.6.1-related failure is the cache test one

@bouweandela
Copy link
Member

If clearing the cache once solved the issue, then the solution is to either stop using pickling to store the cached cubes on disk or add the netcdf library version number to the cache key, so cached cubes are specific for each version of the library.

@bouweandela
Copy link
Member

Remember #1058?

@valeriupredoi
Copy link
Contributor Author

valeriupredoi commented Sep 22, 2022

Remember #1058?

A good reminder indeed, I seem to have closed that issue just because it was old and decrepit, we didn't really solve that problem. OK I'll add the netcdf4 version to the cache key and run a set of tests. Thanks for all the suggestions, Bouwe! I still don't understand the underlying process that's causing this behaviour, but I am happy if we find a fix 👍

@bouweandela
Copy link
Member

The cause is in the pickling. Pickling means that you save an instance of a class to a file, without the actual code that defines the class. If you then unpickle that instance using a different implementation of the class (e.g. with a newer library), chances are that things will go wrong.

@valeriupredoi
Copy link
Contributor Author

valeriupredoi commented Sep 22, 2022

nice try 😁 But the segfaults/HDF error creep in even after I have recreated the sample data with the new netcdf4 - how is the pickler picking up the older version?

@bouweandela
Copy link
Member

Now I'm confused, I thought you said clearing the cache once solved the issue? #1727 (comment)

@valeriupredoi
Copy link
Contributor Author

yes, citing myself (yay for modesty!)

running pytest --clear-cache once (running through the entire test suite, accounting for the failed test as above, all else passing) then running 100 instances of pytest -n 2 test_io_netcdf_2.py where that script is the one below (to have only the cached stuffs that fail) results in no issues at all!

but that doesn't physically delete the files:

-rw-rw-r-- 1 valeriu valeriu 23299 Sep 16 15:42 timeseries_daily_365_day-full-mean.nc
-rw-rw-r-- 1 valeriu valeriu 23299 Sep 16 15:42 timeseries_daily_365_day-overlap-mean.nc
-rw-rw-r-- 1 valeriu valeriu 25558 Sep 16 15:42 timeseries_daily_gregorian-full-mean.nc
-rw-rw-r-- 1 valeriu valeriu 25558 Sep 16 15:42 timeseries_daily_gregorian-overlap-mean.nc
-rw-rw-r-- 1 valeriu valeriu 25378 Sep 16 15:42 timeseries_daily_proleptic_gregorian-full-mean.nc
-rw-rw-r-- 1 valeriu valeriu 25378 Sep 16 15:42 timeseries_daily_proleptic_gregorian-overlap-mean.nc
-rw-rw-r-- 1 valeriu valeriu 22899 Sep 16 15:42 timeseries_monthly-full-mean.nc
-rw-rw-r-- 1 valeriu valeriu 22899 Sep 16 15:42 timeseries_monthly-overlap-mean.nc

(note that me comment and testing was done on Sep 22nd). Frankly, I don't know what that removed - the pytest_cache dir locally?

@bouweandela
Copy link
Member

You can look at the cache content as explained here: https://docs.pytest.org/en/7.1.x/how-to/cache.html#inspecting-cache-content

Note that the problem appears to be with pickled iris cubes, not with netcdf files.

@valeriupredoi
Copy link
Contributor Author

valeriupredoi commented Sep 30, 2022

for us - yes, but for the whole netCDF4 1.6.1 seems to be a bit more of a generalized problem, iris guys reporting SegFaults when no gerkining is done 🥒 What do you recommend doing @bouweandela ?

@bouweandela
Copy link
Member

I recommend adding the version of the netcdf4 library to the cache key, see if that solves the problem for us. If not, we take it from there.

@valeriupredoi
Copy link
Contributor Author

plan! Let me do that now..

@valeriupredoi
Copy link
Contributor Author

OK @bouweandela I have added the key this way:

def get_cache_key(value):
    """Get a cache key that is hopefully unique enough for unpickling.

    If this doesn't avoid problems with unpickling the cached data,
    manually clean the pytest cache with the command `pytest --cache-clear`.
    """
    py_version = platform.python_version()
    return (f'{value}_iris-{iris.__version__}_'
            f"netCDF44-{netCDF4.__version__}_"
            f'numpy-{np.__version__}_python-{py_version}'
            f'rev-{TEST_REVISION}')

and ran 100 tests without regenerating the sample data files -> got about 20% of them poop the bed with SegFaults. Then I recreated the sample data with the current env that contains netcdf4=1.6.1 and am running again the 100 MM tests, aaand, against all odds and the King's horse, I still see SegFaults - it's really the new netcdf4 that is being troublesome 😢

@valeriupredoi
Copy link
Contributor Author

OK finished the run - 23% SegFaulted tests - not a joke

@valeriupredoi
Copy link
Contributor Author

OK guys, couple updates here:

🍺

@bouweandela
Copy link
Member

bouweandela commented Nov 21, 2022

@ESMValGroup/technical-lead-development-team Should we unpin our NetCDF4, now that the correct pinning is in place upstream on conda-forge conda-forge/conda-forge-repodata-patches-feedstock#358 and in the upcoming iris release SciTools/iris#5075?

Tests on CircleCI are working fine again without the pin: https://app.circleci.com/pipelines/github/ESMValGroup/ESMValCore?branch=unpin-netcdf4

@valeriupredoi
Copy link
Contributor Author

valeriupredoi commented Nov 21, 2022

yes I am just about to do that now - note that for Tool iris=3.2.1 is picked up since we built the Core package against 3.2.1 (due to the pin on netCDF4) so that's why it's working

@valeriupredoi
Copy link
Contributor Author

valeriupredoi commented Nov 21, 2022

ESMValGroup/ESMValTool#2929 - if all goes well we want to go to Core and to the conda packages too

@valeriupredoi valeriupredoi mentioned this issue Nov 21, 2022
5 tasks
@valeriupredoi
Copy link
Contributor Author

ESMValGroup/ESMValTool#2929 - if all goes well we want to go to Core and to the conda packages too

OK this is done: unpinned neyCDF4 both for Tool and Core; note that we can't move upwards of netCDF4=1.6.0 via iris (repo patch) and we can not move to Python=3.11 for that matter too (although I would try a PR to include Python=3.11 just for s**ts and giggles)

@valeriupredoi
Copy link
Contributor Author

@ESMValGroup/technical-lead-development-team very promising news from iris folk - have a look at SciTools/iris#5095 and the test results I ran with Core and Martin's dev branch - we'll bin this soon! Props to @trexfeathers 🍺

@valeriupredoi
Copy link
Contributor Author

even better stuff now, see SciTools/iris#5095 (comment)

@valeriupredoi
Copy link
Contributor Author

this has been fully zapped by iris' fixing of their thread danger, closing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bear at a dinner party Something very unexpected testing
Projects
None yet
Development

No branches or pull requests

2 participants