-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sync series and dataframe in statistical_inefficieny() #199
Sync series and dataframe in statistical_inefficieny() #199
Conversation
This adds a number of tests for slicing() and statistical_inefficieny(): * Test that slicing() respects upper and lower time bounds (currently passes) * Test that statistical_inefficieny() respects upper and lower time bounds when it is used without series to subsample (currently passes) * Test that statistical_inefficieny() respects upper and lower time bounds when it is used without series to subsample (currently fails) * Test that first using slicing() on the data frame, then statistical_inefficieny() without time bounds yields the same results as a single call to statistical_inefficieny() with time bounds (currently fails) Refs alchemistry#198
Make sure that both the subsampling series and the dataframe are sliced when calling statistical_inefficieny(). This ensure that the series and the dataframe are in sync before the subsampled indices of the series are applied to the dataframe. Fixes alchemistry#198
Codecov Report
@@ Coverage Diff @@
## master #199 +/- ##
=======================================
Coverage 98.15% 98.15%
=======================================
Files 24 24
Lines 1352 1353 +1
Branches 295 295
=======================================
+ Hits 1327 1328 +1
Misses 5 5
Partials 20 20
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Many thanks for the contribution. This fix is very important.
I think the test could be simplified as bit using the @pytest.mark.parametrize
You could do something like
@pytest.mark.parametrize('data', (gmx_benzene_dHdl(), gmx_benzene_u_nk())
@pytest.mark.parametrize('test_series', (True, False))
def test_lower_and_upper_bound_slicer(self, data, test_series):
"""
Test that the lower and upper time is respected when using statistical_inefficiency
without a series. In this case, statistical_inefficiency should behave like slicing
"""
lower, upper = 1000, 34000
original_length = len(data)
# Check that the input data is appropriate for the test
assert any(data.reset_index()['time'] < lower)
assert any(data.reset_index()['time'] > upper)
if test_series:
series=data.sum(axis=1)
else:
series=None
# Slice data, and check that we don't observe times outside
# the prescribed range
sliced = self.slicer(data,
series= series,
lower=lower,
upper=upper,
step=5)
assert all(sliced.reset_index()['time'] >= lower)
assert all(sliced.reset_index()['time'] <= upper)
# Make sure we didn't change input data
assert len(data) == original_length
Then for test_slicing_inefficiency_equivalence just strip away all the checking.
# Slice data, and check that we don't observe times outside | ||
# the prescribed range | ||
sliced = self.slicer(data, | ||
series=None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I didn't quite see how is this different from test_lower_and_upper_bound
as the series is default to None?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, this test class tests statistical_inefficiency(...)
, the one above tests slicing(...)
.
So this tests makes sure that statistical_inefficiency
respects the upper and lower bounds when it's used without series. The test above makes sure that slicing
respects the upper and lower bounds.
This test could be rewritten as equivalence between statistical_inefficiency
and slicing
, if you prefer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ptmerz Why do you need to check slicing when you are changing the code in statistical_inefficiency?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't need to. While investigating the bug I reported I noticed that there was no test for ranges in slicing
or in statistical_inefficiency
, which made it possible for the bug in statistical_inefficiency
to go unnoticed until now. I thought it would be helpful to broaden the test coverage also for slicing
to avoid any future bugs. Do you prefer me to remove them?
# Slice data, and check that we don't observe times outside | ||
# the prescribed range | ||
sliced = self.slicer(data, | ||
series=None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, this test class tests statistical_inefficiency(...)
, the one above tests slicing(...)
.
So this tests makes sure that statistical_inefficiency
respects the upper and lower bounds when it's used without series. The test above makes sure that slicing
respects the upper and lower bounds.
This test could be rewritten as equivalence between statistical_inefficiency
and slicing
, if you prefer.
assert all(sliced.reset_index()['time'] >= lower) | ||
assert all(sliced.reset_index()['time'] <= upper) | ||
# Make sure we didn't change input data | ||
assert len(data) == original_length |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: We should treat this in the same way as the test above (see comment line 92).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ptmerz It might be good to abstract out the core of these three tests out as fixture and use parametrize to go through them
something like
@pytest.mark.parametrize(
'output_data,original_length',
[
(lazy_fixture('inefficiency'), 100),
(lazy_fixture('slicing'), 1000),
]
def test_original_length(output_data, original_length):
assert len(data) == original_length
Agreed that this could unify the second and third added test ( So I'd suggest to wait until we have agreed on how to proceed with the remaining points raised, and I will implement this simplification if it's still viable! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, looks good to me — just small comments on the tests.
(gmx_benzene_dHdl(), 1000, 34000), | ||
(gmx_benzene_u_nk(), 1000, 34000), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not great to call the data provider functions here because this adds to the time during which tests are gathered in serial as opposed during test execution (which can be parallelized). I know that we already have plenty examples of doing exactly that elsewhere so it's really not your fault.
I am just wondering if there are better ways to do this, e.g., passing the function itself as data_generator
and then calling it as data = data_generator()
might be easiest. It would be better if we could generate them as fixtures (but parametrized fixtures are still awkward IIRC).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@orbeckst The parametrized fixtures could be made easy with pytest-lazy-fixture
I have posted an example here as is noted in their website https://pypi.org/project/pytest-lazy-fixture/. This is conda installable as well.
@pytest.fixture(params=[
pytest.lazy_fixture('one'),
pytest.lazy_fixture('two')
])
def some(request):
return request.param
@pytest.fixture
def one():
return 1
@pytest.fixture
def two():
return 2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes a lot of sense @orbeckst, I wasn't aware of this issue. Unfortunately I haven't had time to look into this yet. Current version therefore still has the greedy evaluation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lazy_fixture
looks like a convenient solution. I looked at the package and I am not sure how well maintained it is. There's a commit from Jan 2022 in https://github.com/tvorog/pytest-lazy-fixture but the pypi page https://pypi.org/project/pytest-lazy-fixture/ states that the highest supported Python version is 3.7. @xiki-tempula , do you have experience with it in production code? It would be annoying to rewrite quite a few of our tests, just to discover that it breaks with a new release of pytest...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had a look around in our tests and we do have examples of evaluating the loading functions later: see e.g.
alchemlyb/src/alchemlyb/tests/test_fep_estimators.py
Lines 145 to 168 in 72622c9
class TestMBAR(FEPestimatorMixin): | |
"""Tests for MBAR. | |
""" | |
cls = MBAR | |
@pytest.fixture(scope="class", | |
params=[(gmx_benzene_coul_u_nk, 3.041, 0.02088), | |
(gmx_benzene_vdw_u_nk, -3.007, 0.04519), | |
(gmx_expanded_ensemble_case_1, 75.923, 0.14124), | |
(gmx_expanded_ensemble_case_2, 75.915, 0.14372), | |
(gmx_expanded_ensemble_case_3, 76.173, 0.11345), | |
(gmx_water_particle_with_total_energy, -11.680, 0.083655), | |
(gmx_water_particle_with_potential_energy, -11.675, 0.083589), | |
(gmx_water_particle_without_energy, -11.654, 0.083415), | |
(amber_bace_example_complex_vdw, 2.40200, 0.0618453), | |
(gomc_benzene_u_nk, -0.79994, 0.091579), | |
]) | |
def X_delta_f(self, request): | |
get_unk, E, dE = request.param | |
return get_unk(), E, dE | |
def test_mbar(self, X_delta_f): | |
self.compare_delta_f(X_delta_f) |
This is not too bad.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I raised #206 for fixing the other instances of upfront data loading.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@orbeckst yes, I used the lazy_fixture in our production code but the reason that I use it is that it was there. It works fine for me but I do share your concern about it not being updated for years.
On the other hand, one could also use the official pytest solution
@pytest.fixture(params=['dir1_fixture', 'dir2_fixture'])
def dirname(request):
return request.getfixturevalue(request.param)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then let's use the pytest
-internal solution!
I'll update #206.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks straightforward
@pytest.mark.parametrize(('dataloader', 'size'), [('gmx_benzene_dHdl', 661),
('gmx_benzene_u_nk', 661)])
def test_basic_slicing(self, dataloader, size, request):
data = request.getfixturevalue(dataloader)
assert len(self.slicer(data, lower=1000, upper=34000, step=5)) == size
* Extract test for unchanged data in separate test * Parametrize tests of statistical_inefficiency to merge cases which use `series` with the ones that don't * Parametrize all tests to use `conservative` `True` and `False` * Remove check for unchanged data that are now unnecessary
Thank you for your reviews. I did a first pass, which should address most of your comments. I'm copying the git log here as a summary:
This doesn't yet address the issue with greedy parametrization, I unfortunately did not have time to look into this yet. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks for the work and for addressing the comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for all the work @ptmerz and explaining your choices to us — very much appreciated!
-
After looking at the up-front loading I'd ask you to change your code to load during run time and not set up. I posted an example from our code that should make it relatively straightforward to implement.
I will raise an issue to eventually change the other parts of the tests in a similar fashion.
-
Please add
.. versionchanged:: 0.7.0
entries to the docs to summarize the change in behavior. We count it as a bug fix but because people may have unwittingly relied on the old behavior and produced results with it, we want to be clear that numbers will change. -
Please also add yourself to AUTHORS and add an entry to CHANGES.
Thank you!!
@orbeckst , I uploaded new commits that should hopefully address your points. Note that currently, the data loading functions had to be duplicated, as they can't be normal functions and fixtures at the same time. If after resolving #206, you only use them as fixtures, you could remove this duplication. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have some small changes that I will apply directly. Everything looks really good!
One source code line change, lots of other stuff around it ... but that's what makes the difference for a well-tested library. Thank you for your willingness to see the whole PR through to the end! We very much appreciated your contribution.
This adds a number of tests for slicing() and statistical_inefficieny():
passes)
when it is used without series to subsample (currently passes)
when it is used with series to subsample (currently fails)
statistical_inefficieny() without time bounds yields the same results as
a single call to statistical_inefficieny() with time bounds (currently
fails)
It then makes sure that both the subsampling series and the dataframe are sliced
when calling statistical_inefficieny(). This ensure that the series and
the dataframe are in sync before the subsampled indices of the series
are applied to the dataframe. This fixes the above tests.
Fixes #198