-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SegFaults in the sample data tests with netCDF4=1.6.1
#1727
Comments
Have you tried removing the cache to simplify the problem even further? |
indeed one thing I will test with in the next hour, bud! 🍺 |
OK so I've done some serious testing with the cache test and without (simply reading files off disk): cached case7% fail rate:
regular read off diskno issues: 100% pass rate So it is pretty clear that whatever happens when the cached sample data is read is the thing that causes 1.6.1 to produce SegFaults and stray HDF issues - what that exactly is, I have no idea? Any clues @bouweandela and maybe @stefsmeets ? |
Does the problem persist if you clear the cache before testing? |
If it goes away, we either need to add the netcdf and hdf5 library versions to the cache key, it get rid of pickle altogether and create the cache using e.g. iris. |
@bouweandela cache clearing doesn't work for this test - it deletes the cache so the test fails since there is no data to run the test, I don't want to reconstruct the data while I am doing this because that is simply adding time to the test runs ie it fails once, then one needs to run the test without cache clearing, so we are in the same boat and we're just wasting time:
|
OK running mport iris
import numpy as np
import pickle
import platform
import pytest
TEST_REVISION = 1
def get_cache_key(value):
"""Get a cache key that is hopefully unique enough for unpickling.
If this doesn't avoid problems with unpickling the cached data,
manually clean the pytest cache with the command `pytest --cache-clear`.
"""
py_version = platform.python_version()
return (f'{value}_iris-{iris.__version__}_'
f'numpy-{np.__version__}_python-{py_version}'
f'rev-{TEST_REVISION}')
@pytest.fixture(scope="module")
def timeseries_cubes_month(request):
"""Load representative timeseries data."""
# cache the cubes to save about 30-60 seconds on repeat use
cache_key = get_cache_key("sample_data/monthly")
data = request.config.cache.get(cache_key, None)
cubes = pickle.loads(data.encode('latin1'))
return cubes
def test_io_1(timeseries_cubes_month):
cubes = timeseries_cubes_month
print("YYY")
for i, c in enumerate(cubes):
print("XXX", i)
try:
c.data
except RuntimeError:
print(c)
print("SHIT")
raise this same test was used here #1727 (comment) with results in the "cached case" section. It is something to do with the cache, and the new netcdf4, since this test has never failed in the CI until now. Ideas further? |
OK I ran another set of 100 of those tests, just to be sure - bulletproof! No more issues after clearing the cache once |
also let's keep an eye on the (hourly) tests this #1730 is running just to be sure the only problematic 1.6.1-related failure is the cache test one |
If clearing the cache once solved the issue, then the solution is to either stop using pickling to store the cached cubes on disk or add the netcdf library version number to the cache key, so cached cubes are specific for each version of the library. |
Remember #1058? |
A good reminder indeed, I seem to have closed that issue just because it was old and decrepit, we didn't really solve that problem. OK I'll add the netcdf4 version to the cache key and run a set of tests. Thanks for all the suggestions, Bouwe! I still don't understand the underlying process that's causing this behaviour, but I am happy if we find a fix 👍 |
The cause is in the pickling. Pickling means that you save an instance of a class to a file, without the actual code that defines the class. If you then unpickle that instance using a different implementation of the class (e.g. with a newer library), chances are that things will go wrong. |
nice try 😁 But the segfaults/HDF error creep in even after I have recreated the sample data with the new netcdf4 - how is the pickler picking up the older version? |
Now I'm confused, I thought you said clearing the cache once solved the issue? #1727 (comment) |
yes, citing myself (yay for modesty!)
but that doesn't physically delete the files:
(note that me comment and testing was done on Sep 22nd). Frankly, I don't know what that removed - the pytest_cache dir locally? |
You can look at the cache content as explained here: https://docs.pytest.org/en/7.1.x/how-to/cache.html#inspecting-cache-content Note that the problem appears to be with pickled iris cubes, not with netcdf files. |
for us - yes, but for the whole netCDF4 1.6.1 seems to be a bit more of a generalized problem, iris guys reporting SegFaults when no gerkining is done 🥒 What do you recommend doing @bouweandela ? |
I recommend adding the version of the netcdf4 library to the cache key, see if that solves the problem for us. If not, we take it from there. |
plan! Let me do that now.. |
OK @bouweandela I have added the key this way: def get_cache_key(value):
"""Get a cache key that is hopefully unique enough for unpickling.
If this doesn't avoid problems with unpickling the cached data,
manually clean the pytest cache with the command `pytest --cache-clear`.
"""
py_version = platform.python_version()
return (f'{value}_iris-{iris.__version__}_'
f"netCDF44-{netCDF4.__version__}_"
f'numpy-{np.__version__}_python-{py_version}'
f'rev-{TEST_REVISION}') and ran 100 tests without regenerating the sample data files -> got about 20% of them poop the bed with SegFaults. Then I recreated the sample data with the current env that contains netcdf4=1.6.1 and am running again the 100 MM tests, aaand, against all odds and the King's horse, I still see SegFaults - it's really the new netcdf4 that is being troublesome 😢 |
OK finished the run - 23% SegFaulted tests - not a joke |
OK guys, couple updates here:
🍺 |
@ESMValGroup/technical-lead-development-team Should we unpin our NetCDF4, now that the correct pinning is in place upstream on conda-forge conda-forge/conda-forge-repodata-patches-feedstock#358 and in the upcoming iris release SciTools/iris#5075? Tests on CircleCI are working fine again without the pin: https://app.circleci.com/pipelines/github/ESMValGroup/ESMValCore?branch=unpin-netcdf4 |
yes I am just about to do that now - note that for Tool iris=3.2.1 is picked up since we built the Core package against 3.2.1 (due to the pin on netCDF4) so that's why it's working |
ESMValGroup/ESMValTool#2929 - if all goes well we want to go to Core and to the conda packages too |
OK this is done: unpinned neyCDF4 both for Tool and Core; note that we can't move upwards of netCDF4=1.6.0 via iris (repo patch) and we can not move to Python=3.11 for that matter too (although I would try a PR to include Python=3.11 just for s**ts and giggles) |
@ESMValGroup/technical-lead-development-team very promising news from |
even better stuff now, see SciTools/iris#5095 (comment) |
this has been fully zapped by iris' fixing of their thread danger, closing |
@ESMValGroup/technical-lead-development-team I need your (rather quick) input here please: we have seen that the new netCDF4=1.6.1 is causing frequent segfaults in our CI tests, and we have pinned it to !=1.6.1 to brush the problem under the carpet for us. However, the good folk at Unidata/netCDF4 are scratching their heads and are wondering what the heck's going on, let's try and help them figure that out, even if we can provide a bit of a narrowed-down picture, it' still helpful. For that, I have opened
netCDF4=1.6.1
Unidata/netcdf4-python#1192netCDF4=1.6.1
(most probably) causing a fairly large number of Segmentation Faults conda-forge/netcdf4-feedstock#141(have a read through the discussion there, it's a lot of paint thrown at a white wall)
and I have managed to isolate our side of the problem to the sample data testing. Simplifying the problem, this is how the toy model looks:
From my tests I found out
test_io_1
has a tendency to produce segfaults at that listcomp step (can test with-n 0
or-n 2
, doesn't really matter), whereas the other doesn't. Can we gauge anything from that without digging in the actual IO/threading (that is not our plot of land anyway)? Hive mind, folks! 🐝UPDATE as of 20-Oct-2022 #1727 (comment)
The text was updated successfully, but these errors were encountered: