SegFaults in the sample data tests with `netCDF4=1.6.1` #1727

valeriupredoi · 2022-09-20T16:46:08Z

@ESMValGroup/technical-lead-development-team I need your (rather quick) input here please: we have seen that the new netCDF4=1.6.1 is causing frequent segfaults in our CI tests, and we have pinned it to !=1.6.1 to brush the problem under the carpet for us. However, the good folk at Unidata/netCDF4 are scratching their heads and are wondering what the heck's going on, let's try and help them figure that out, even if we can provide a bit of a narrowed-down picture, it' still helpful. For that, I have opened

(have a read through the discussion there, it's a lot of paint thrown at a white wall)

and I have managed to isolate our side of the problem to the sample data testing. Simplifying the problem, this is how the toy model looks:

import iris
import numpy as np
import pickle
import platform
import pytest

TEST_REVISION = 1

def get_cache_key(value):
    """Get a cache key that is hopefully unique enough for unpickling.

    If this doesn't avoid problems with unpickling the cached data,
    manually clean the pytest cache with the command `pytest --cache-clear`.
    """
    py_version = platform.python_version()
    return (f'{value}_iris-{iris.__version__}_'
            f'numpy-{np.__version__}_python-{py_version}'
            f'rev-{TEST_REVISION}')


@pytest.fixture(scope="module")
def timeseries_cubes_month(request):
    """Load representative timeseries data."""
    # cache the cubes to save about 30-60 seconds on repeat use
    cache_key = get_cache_key("sample_data/monthly")
    data = request.config.cache.get(cache_key, None)
    cubes = pickle.loads(data.encode('latin1'))

    return cubes


# @pytest.mark.skip
def test_io_1(timeseries_cubes_month):
    cubes = timeseries_cubes_month
    _ = [c.data for c in cubes]  # this produces SegFaults


@pytest.mark.skip
def test_io_2(timeseries_cubes_month):
    cubes = timeseries_cubes_month
    loaded_cubes = []
    for i, c in enumerate(cubes):
        iris.save(c, str(i) + ".nc")
        lc = iris.load_cube(str(i) + ".nc")
        loaded_cubes.append(lc)
    _ = [c.data for c in loaded_cubes]  # this doesn't produce SegFaults

From my tests I found out test_io_1 has a tendency to produce segfaults at that listcomp step (can test with -n 0 or -n 2, doesn't really matter), whereas the other doesn't. Can we gauge anything from that without digging in the actual IO/threading (that is not our plot of land anyway)? Hive mind, folks! 🐝

UPDATE as of 20-Oct-2022 #1727 (comment)

The text was updated successfully, but these errors were encountered:

bouweandela · 2022-09-20T20:15:16Z

Have you tried removing the cache to simplify the problem even further?

valeriupredoi · 2022-09-21T10:19:16Z

Have you tried removing the cache to simplify the problem even further?

indeed one thing I will test with in the next hour, bud! 🍺

valeriupredoi · 2022-09-21T15:00:58Z

OK so I've done some serious testing with the cache test and without (simply reading files off disk):

cached case

7% fail rate:

5% HDF crappe
- always at second cube: saved it and accessed c.data on it 100 times - no issue
2% SegFaults

regular read off disk

no issues: 100% pass rate

So it is pretty clear that whatever happens when the cached sample data is read is the thing that causes 1.6.1 to produce SegFaults and stray HDF issues - what that exactly is, I have no idea? Any clues @bouweandela and maybe @stefsmeets ?

bouweandela · 2022-09-22T07:14:55Z

Does the problem persist if you clear the cache before testing?

bouweandela · 2022-09-22T07:16:14Z

https://docs.esmvaltool.org/projects/esmvalcore/en/latest/contributing.html#sample-data

bouweandela · 2022-09-22T07:18:17Z

If it goes away, we either need to add the netcdf and hdf5 library versions to the cache key, it get rid of pickle altogether and create the cache using e.g. iris.

valeriupredoi · 2022-09-22T11:22:32Z

@bouweandela cache clearing doesn't work for this test - it deletes the cache so the test fails since there is no data to run the test, I don't want to reconstruct the data while I am doing this because that is simply adding time to the test runs ie it fails once, then one needs to run the test without cache clearing, so we are in the same boat and we're just wasting time:

request = <SubRequest 'timeseries_cubes_month' for <Function test_io_1>>

    @pytest.fixture(scope="module")
    def timeseries_cubes_month(request):
        """Load representative timeseries data."""
        # cache the cubes to save about 30-60 seconds on repeat use
        cache_key = get_cache_key("sample_data/monthly")
        data = request.config.cache.get(cache_key, None)
>       cubes = pickle.loads(data.encode('latin1'))
E       AttributeError: 'NoneType' object has no attribute 'encode'

test_io_netcdf_2.py:27: AttributeError

valeriupredoi · 2022-09-22T12:01:16Z

OK running pytest --clear-cache once (running through the entire test suite, accounting for the failed test as above, all else passing) then running 100 instances of pytest -n 2 test_io_netcdf_2.py where that script is the one below (to have only the cached stuffs that fail) results in no issues at all! Good idea to test the cache clearing @bouweandela 🍺

mport iris
import numpy as np
import pickle
import platform
import pytest

TEST_REVISION = 1

def get_cache_key(value):
    """Get a cache key that is hopefully unique enough for unpickling.

    If this doesn't avoid problems with unpickling the cached data,
    manually clean the pytest cache with the command `pytest --cache-clear`.
    """
    py_version = platform.python_version()
    return (f'{value}_iris-{iris.__version__}_'
            f'numpy-{np.__version__}_python-{py_version}'
            f'rev-{TEST_REVISION}')


@pytest.fixture(scope="module")
def timeseries_cubes_month(request):
    """Load representative timeseries data."""
    # cache the cubes to save about 30-60 seconds on repeat use
    cache_key = get_cache_key("sample_data/monthly")
    data = request.config.cache.get(cache_key, None)
    cubes = pickle.loads(data.encode('latin1'))

    return cubes


def test_io_1(timeseries_cubes_month):
    cubes = timeseries_cubes_month
    print("YYY")
    for i, c in enumerate(cubes):
        print("XXX", i)
        try:
            c.data
        except RuntimeError:
            print(c)
            print("SHIT")
            raise

this same test was used here #1727 (comment) with results in the "cached case" section. It is something to do with the cache, and the new netcdf4, since this test has never failed in the CI until now. Ideas further?

valeriupredoi · 2022-09-22T12:22:55Z

OK I ran another set of 100 of those tests, just to be sure - bulletproof! No more issues after clearing the cache once

valeriupredoi · 2022-09-22T12:39:39Z

also let's keep an eye on the (hourly) tests this #1730 is running just to be sure the only problematic 1.6.1-related failure is the cache test one

bouweandela · 2022-09-22T13:34:43Z

If clearing the cache once solved the issue, then the solution is to either stop using pickling to store the cached cubes on disk or add the netcdf library version number to the cache key, so cached cubes are specific for each version of the library.

bouweandela · 2022-09-22T13:37:42Z

Remember #1058?

valeriupredoi · 2022-09-22T13:42:39Z

Remember #1058?

A good reminder indeed, I seem to have closed that issue just because it was old and decrepit, we didn't really solve that problem. OK I'll add the netcdf4 version to the cache key and run a set of tests. Thanks for all the suggestions, Bouwe! I still don't understand the underlying process that's causing this behaviour, but I am happy if we find a fix 👍

bouweandela · 2022-09-22T14:39:58Z

The cause is in the pickling. Pickling means that you save an instance of a class to a file, without the actual code that defines the class. If you then unpickle that instance using a different implementation of the class (e.g. with a newer library), chances are that things will go wrong.

valeriupredoi · 2022-09-22T14:51:47Z

nice try 😁 But the segfaults/HDF error creep in even after I have recreated the sample data with the new netcdf4 - how is the pickler picking up the older version?

bouweandela · 2022-09-23T09:39:39Z

Now I'm confused, I thought you said clearing the cache once solved the issue? #1727 (comment)

valeriupredoi · 2022-09-29T15:35:54Z

yes, citing myself (yay for modesty!)

running pytest --clear-cache once (running through the entire test suite, accounting for the failed test as above, all else passing) then running 100 instances of pytest -n 2 test_io_netcdf_2.py where that script is the one below (to have only the cached stuffs that fail) results in no issues at all!

but that doesn't physically delete the files:

-rw-rw-r-- 1 valeriu valeriu 23299 Sep 16 15:42 timeseries_daily_365_day-full-mean.nc
-rw-rw-r-- 1 valeriu valeriu 23299 Sep 16 15:42 timeseries_daily_365_day-overlap-mean.nc
-rw-rw-r-- 1 valeriu valeriu 25558 Sep 16 15:42 timeseries_daily_gregorian-full-mean.nc
-rw-rw-r-- 1 valeriu valeriu 25558 Sep 16 15:42 timeseries_daily_gregorian-overlap-mean.nc
-rw-rw-r-- 1 valeriu valeriu 25378 Sep 16 15:42 timeseries_daily_proleptic_gregorian-full-mean.nc
-rw-rw-r-- 1 valeriu valeriu 25378 Sep 16 15:42 timeseries_daily_proleptic_gregorian-overlap-mean.nc
-rw-rw-r-- 1 valeriu valeriu 22899 Sep 16 15:42 timeseries_monthly-full-mean.nc
-rw-rw-r-- 1 valeriu valeriu 22899 Sep 16 15:42 timeseries_monthly-overlap-mean.nc

(note that me comment and testing was done on Sep 22nd). Frankly, I don't know what that removed - the pytest_cache dir locally?

bouweandela · 2022-09-29T18:19:45Z

You can look at the cache content as explained here: https://docs.pytest.org/en/7.1.x/how-to/cache.html#inspecting-cache-content

Note that the problem appears to be with pickled iris cubes, not with netcdf files.

valeriupredoi · 2022-09-30T11:56:05Z

for us - yes, but for the whole netCDF4 1.6.1 seems to be a bit more of a generalized problem, iris guys reporting SegFaults when no gerkining is done 🥒 What do you recommend doing @bouweandela ?

bouweandela · 2022-09-30T12:15:46Z

I recommend adding the version of the netcdf4 library to the cache key, see if that solves the problem for us. If not, we take it from there.

valeriupredoi · 2022-09-30T12:20:54Z

plan! Let me do that now..

valeriupredoi · 2022-10-03T10:45:03Z

OK @bouweandela I have added the key this way:

def get_cache_key(value):
    """Get a cache key that is hopefully unique enough for unpickling.

    If this doesn't avoid problems with unpickling the cached data,
    manually clean the pytest cache with the command `pytest --cache-clear`.
    """
    py_version = platform.python_version()
    return (f'{value}_iris-{iris.__version__}_'
            f"netCDF44-{netCDF4.__version__}_"
            f'numpy-{np.__version__}_python-{py_version}'
            f'rev-{TEST_REVISION}')

and ran 100 tests without regenerating the sample data files -> got about 20% of them poop the bed with SegFaults. Then I recreated the sample data with the current env that contains netcdf4=1.6.1 and am running again the 100 MM tests, aaand, against all odds and the King's horse, I still see SegFaults - it's really the new netcdf4 that is being troublesome 😢

valeriupredoi · 2022-10-03T11:06:15Z

OK finished the run - 23% SegFaulted tests - not a joke

valeriupredoi · 2022-10-20T15:09:09Z

OK guys, couple updates here:

some workaround suggested upstream at the closed issue
iris folk are investigating ways to go around this proper, not a workaround or pin Investigate segfaults with NetCDF4 >1.6.0 SciTools/iris#5016 (comment)
we should keep eyes peeled because pinning the package is not a long term solution

🍺

bouweandela · 2022-11-21T11:13:49Z

@ESMValGroup/technical-lead-development-team Should we unpin our NetCDF4, now that the correct pinning is in place upstream on conda-forge conda-forge/conda-forge-repodata-patches-feedstock#358 and in the upcoming iris release SciTools/iris#5075?

Tests on CircleCI are working fine again without the pin: https://app.circleci.com/pipelines/github/ESMValGroup/ESMValCore?branch=unpin-netcdf4

valeriupredoi · 2022-11-21T11:41:50Z

yes I am just about to do that now - note that for Tool iris=3.2.1 is picked up since we built the Core package against 3.2.1 (due to the pin on netCDF4) so that's why it's working

valeriupredoi · 2022-11-21T11:55:18Z

ESMValGroup/ESMValTool#2929 - if all goes well we want to go to Core and to the conda packages too

valeriupredoi · 2022-11-28T16:56:34Z

ESMValGroup/ESMValTool#2929 - if all goes well we want to go to Core and to the conda packages too

OK this is done: unpinned neyCDF4 both for Tool and Core; note that we can't move upwards of netCDF4=1.6.0 via iris (repo patch) and we can not move to Python=3.11 for that matter too (although I would try a PR to include Python=3.11 just for s**ts and giggles)

valeriupredoi · 2022-12-09T15:59:28Z

@ESMValGroup/technical-lead-development-team very promising news from iris folk - have a look at SciTools/iris#5095 and the test results I ran with Core and Martin's dev branch - we'll bin this soon! Props to @trexfeathers 🍺

valeriupredoi · 2022-12-13T16:51:03Z

even better stuff now, see SciTools/iris#5095 (comment)

valeriupredoi · 2023-03-23T14:33:58Z

this has been fully zapped by iris' fixing of their thread danger, closing

valeriupredoi added testing bear at a dinner party Something very unexpected labels Sep 20, 2022

valeriupredoi mentioned this issue Sep 21, 2022

New netCDF4=1.6.1 (most probably) causing a fairly large number of Segmentation Faults conda-forge/netcdf4-feedstock#141

Closed

pp-mo mentioned this issue Nov 9, 2022

Investigate segfaults with NetCDF4 >1.6.0 SciTools/iris#5016

Closed

valeriupredoi mentioned this issue Nov 17, 2022

Segmentation fault when running tests on CircleCI #644

Closed

valeriupredoi mentioned this issue Nov 21, 2022

Unpin NetCF4 #1814

Merged

5 tasks

valeriupredoi closed this as completed Mar 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SegFaults in the sample data tests with `netCDF4=1.6.1` #1727

SegFaults in the sample data tests with `netCDF4=1.6.1` #1727

valeriupredoi commented Sep 20, 2022 •

edited

Loading

bouweandela commented Sep 20, 2022

valeriupredoi commented Sep 21, 2022

valeriupredoi commented Sep 21, 2022

bouweandela commented Sep 22, 2022

bouweandela commented Sep 22, 2022

bouweandela commented Sep 22, 2022

valeriupredoi commented Sep 22, 2022

valeriupredoi commented Sep 22, 2022

valeriupredoi commented Sep 22, 2022

valeriupredoi commented Sep 22, 2022

bouweandela commented Sep 22, 2022

bouweandela commented Sep 22, 2022

valeriupredoi commented Sep 22, 2022 •

edited

Loading

bouweandela commented Sep 22, 2022

valeriupredoi commented Sep 22, 2022 •

edited

Loading

bouweandela commented Sep 23, 2022

valeriupredoi commented Sep 29, 2022

bouweandela commented Sep 29, 2022

valeriupredoi commented Sep 30, 2022 •

edited

Loading

bouweandela commented Sep 30, 2022

valeriupredoi commented Sep 30, 2022

valeriupredoi commented Oct 3, 2022

valeriupredoi commented Oct 3, 2022

valeriupredoi commented Oct 20, 2022

bouweandela commented Nov 21, 2022 •

edited

Loading

valeriupredoi commented Nov 21, 2022 •

edited

Loading

valeriupredoi commented Nov 21, 2022 •

edited

Loading

valeriupredoi commented Nov 28, 2022

valeriupredoi commented Dec 9, 2022

valeriupredoi commented Dec 13, 2022

valeriupredoi commented Mar 23, 2023

SegFaults in the sample data tests with netCDF4=1.6.1 #1727

SegFaults in the sample data tests with netCDF4=1.6.1 #1727

Comments

valeriupredoi commented Sep 20, 2022 • edited Loading

bouweandela commented Sep 20, 2022

valeriupredoi commented Sep 21, 2022

valeriupredoi commented Sep 21, 2022

cached case

regular read off disk

bouweandela commented Sep 22, 2022

bouweandela commented Sep 22, 2022

bouweandela commented Sep 22, 2022

valeriupredoi commented Sep 22, 2022

valeriupredoi commented Sep 22, 2022

valeriupredoi commented Sep 22, 2022

valeriupredoi commented Sep 22, 2022

bouweandela commented Sep 22, 2022

bouweandela commented Sep 22, 2022

valeriupredoi commented Sep 22, 2022 • edited Loading

bouweandela commented Sep 22, 2022

valeriupredoi commented Sep 22, 2022 • edited Loading

bouweandela commented Sep 23, 2022

valeriupredoi commented Sep 29, 2022

bouweandela commented Sep 29, 2022

valeriupredoi commented Sep 30, 2022 • edited Loading

bouweandela commented Sep 30, 2022

valeriupredoi commented Sep 30, 2022

valeriupredoi commented Oct 3, 2022

valeriupredoi commented Oct 3, 2022

valeriupredoi commented Oct 20, 2022

bouweandela commented Nov 21, 2022 • edited Loading

valeriupredoi commented Nov 21, 2022 • edited Loading

valeriupredoi commented Nov 21, 2022 • edited Loading

valeriupredoi commented Nov 28, 2022

valeriupredoi commented Dec 9, 2022

valeriupredoi commented Dec 13, 2022

valeriupredoi commented Mar 23, 2023

SegFaults in the sample data tests with `netCDF4=1.6.1` #1727

SegFaults in the sample data tests with `netCDF4=1.6.1` #1727

valeriupredoi commented Sep 20, 2022 •

edited

Loading

valeriupredoi commented Sep 22, 2022 •

edited

Loading

valeriupredoi commented Sep 22, 2022 •

edited

Loading

valeriupredoi commented Sep 30, 2022 •

edited

Loading

bouweandela commented Nov 21, 2022 •

edited

Loading

valeriupredoi commented Nov 21, 2022 •

edited

Loading

valeriupredoi commented Nov 21, 2022 •

edited

Loading