Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added additional validation for connectivity checks. #168

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion docs/releases/development.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2,4 +2,5 @@
Next release (in development)
=============================

* ...
* Added additional validation for ugrid
connectivity(:issue:`165`, :pr:`168`).
25 changes: 25 additions & 0 deletions src/emsarray/conventions/ugrid.py
Original file line number Diff line number Diff line change
Expand Up @@ -748,6 +748,31 @@ def has_valid_face_edge_connectivity(self) -> bool:
)
return False

try:
fill_value = data_array.encoding['_FillValue']
except KeyError:
return True
Comment on lines +751 to +754
Copy link
Contributor

@mx-moth mx-moth Jan 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because of the nature of check functions like this, an early return of True can lead to bugs in the future. If another developer comes along and adds another check below this check - just as you've added a check below the existing checks - then that new check will be skipped if there is a missing _FillValue. An early return of False is always valid as any failed check renders the whole connectivity array invalid. Consider instead:

if '_FillValue' in data_array.encoding:
    fill_value = data_array.encoding['_FillValue']
    ...
    if lower_bound < fill_value < upper_bound:
        warnings.warn(...)
        return False

return True

New checks can be safely added later by appending them below the existing set of checks without risk of the check being skipped.


lower_bound = _get_start_index(data_array)
theoretical_upper_bound = self.face_count * self.max_node_count
actual_upper_bound = numpy.nanmax(data_array)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The maximum edge ID can be found more precisely with

upper_bound = self.edge_count + lower_bound

Because of the exact issue of variables being incorrectly masked and therefore missing valid data, any information gleaned by introspecting the data is potentially suspect. This check as written would fail if _FillValue was exactly the maximum edge ID. numpy.nanmax() would not find this value as it has been masked, returning the second-to-last value which would then incorrectly cause the checks below to pass.


if lower_bound < fill_value < actual_upper_bound:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These should be less than or equal checks. A fill value of 0 and a lower bound of 0 is invalid as it will mask out the first edge ID:

if lower_bound <= fill_value <= upper_bound:

warnings.warn(
f"Got a face_edge_connectivity variable {data_array.name!r} with "
f"a _FillValue inside the actual index range",
ConventionViolationWarning,
)
return False

if lower_bound < fill_value < theoretical_upper_bound:
warnings.warn(
f"Got a face_edge_connectivity variable {data_array.name!r} with "
f"a _FillValue inside the theoretical index range",
ConventionViolationWarning,
)
return False
Comment on lines +768 to +774
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This shouldn't be checked. The theoretical upper bound is a conservative over estimate, useful when computing our own _FillValue as a value that is guaranteed never to collide. However in practice lower values can safely be used as long as the actual _FillValue is lower than actual upper bound. This test will give superfluous warnings on valid data sets.


return True

@cached_property
Expand Down
37 changes: 37 additions & 0 deletions tests/conventions/test_ugrid.py
Original file line number Diff line number Diff line change
Expand Up @@ -924,3 +924,40 @@ def da(attrs: dict) -> xarray.DataArray:

with pytest.raises(ConventionViolationError):
_get_start_index(da({'start_index': 2}))


def test_has_valid_face_edge_connectivity():
# Create dataset with face_edges
dataset = make_dataset(width=3, make_edges=True, make_face_coordinates=True)
topology = dataset.ems.topology
topology.mesh_variable.attrs.update({
'face_edge_connectivity': 'Mesh2_face_edges',
})

mesh2_face_edges_array = topology.face_edge_array

mesh2_face_edges = xarray.DataArray(
mesh2_face_edges_array,
dims=[topology.face_dimension, topology.max_node_dimension],
)

dataset = dataset.assign({
'Mesh2_face_edges': mesh2_face_edges,
})

dataset_fill_value_in_actual_range = dataset.copy()

dataset_fill_value_in_theoretical_range = dataset.copy()

# Make sure original dataset is valid
assert dataset.ems.topology.has_valid_face_edge_connectivity is True

dataset_fill_value_in_actual_range['Mesh2_face_edges'].encoding['_FillValue'] = 2

with pytest.warns(ConventionViolationWarning):
assert dataset_fill_value_in_actual_range.ems.topology.has_valid_face_edge_connectivity is not True

dataset_fill_value_in_theoretical_range['Mesh2_face_edges'].encoding['_FillValue'] = 88

with pytest.warns(ConventionViolationWarning):
assert dataset_fill_value_in_theoretical_range.ems.topology.has_valid_face_edge_connectivity is not True
Loading