Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add pV and U to reduced potential when they are given #62

Merged
merged 11 commits into from
Nov 26, 2018

Conversation

harlor
Copy link
Contributor

@harlor harlor commented Oct 21, 2018

As explained in #59 The current parser assumes U and pV to be given in the second respectively last column - which is not very robust.

With these changes the parser adds them to the potential differences only when they are given.

Moreover I am interested if these changes pass the tests.

@codecov-io
Copy link

codecov-io commented Oct 21, 2018

Codecov Report

Merging #62 into master will increase coverage by 0.04%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #62      +/-   ##
==========================================
+ Coverage   98.12%   98.17%   +0.04%     
==========================================
  Files           9        9              
  Lines         481      493      +12     
  Branches       94      100       +6     
==========================================
+ Hits          472      484      +12     
  Misses          4        4              
  Partials        5        5
Impacted Files Coverage Δ
src/alchemlyb/parsing/gmx.py 98.49% <100%> (+0.14%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update eba1f9a...bfbe0b0. Read the comment docs.

Copy link
Member

@orbeckst orbeckst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks sensible. However, there should also be a test case without either/or pV and U columns. You might be able to edit the benzene files or add new datasets. Either way, open a PR in https://github.com/alchemistry/alchemtest/ to add the data and then add tests here, please.

src/alchemlyb/tests/test_preprocessing.py Show resolved Hide resolved
@harlor
Copy link
Contributor Author

harlor commented Oct 26, 2018

This looks sensible. However, there should also be a test case without either/or pV and U columns.
You might be able to edit the benzene files or add new datasets. Either way, open a PR in https://github.com/alchemistry/alchemtest/ to add the data and then add tests here, please.

Agree - when I find time to work my way into the testing system I will do that!

@orbeckst
Copy link
Member

orbeckst commented Oct 26, 2018

I think you are right regarding our tests having been wron regarding parsing (or lack thereof) of the correct columns. However, why does only statistical inefficiency change and not the actual free energies? (Sorry, I'm typing from mobile and it is difficult to look at code to answer my question myself).

@dldotson @ianmkenney can you you also have a quick look and confirm?

@orbeckst
Copy link
Member

Er, Ping @dotsdl --- sorry, mistyped your handle in previous comment.

Copy link
Member

@dotsdl dotsdl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my comment on the possible appearance of "Potential Energy". We'll need to at least address this before proceeding. Thanks for this!

col_match = r"\xD\f{}H \xl\f{}"
h_col_match = r"\xD\f{}H \xl\f{}"
pv_col_match = 'pV'
u_col_match = 'Total Energy'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checking the files you provided @harlor (https://userpage.fu-berlin.de/dominikwille/nvt300K.tgz), I see "Potential Energy (kJ/mol)" in the ti_0.xvg file. Should we also try and match that?

Copy link
Contributor Author

@harlor harlor Nov 1, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mrshirts explained that in #59 (comment). So I think we need the total energy.

I uploaded a dataset with total energy here: http://dominikwille.userpage.fu-berlin.de/nvt300K_u.tgz

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, but I think depending on MDP options given to gromacs it's possible to have a dataset without "Total Energy" but with "Potential Energy". I was thinking that we could handle this in another PR with kwargs, but I don't believe this should be neglected.

Could you add a check for the existence of a "Potential Energy" column as you did for the "Total Energy" column, and if present, use this instead of the "Total Energy" column? That would satisfy me for the purposes of this PR

src/alchemlyb/tests/test_preprocessing.py Show resolved Hide resolved
@dotsdl
Copy link
Member

dotsdl commented Nov 1, 2018

@harlor, thanks for putting this together. I see the issue as well. The Gromacs benzene dataset is indeed missing the potential energy (U_i) assumed by the u_nk parser, so currently it's giving really wrong results as in this test.

I am very much in favor of these changes. I'm trying to think of how we should alert users to issues where the parser can still give meaningful results, but that the result may not be in absolute terms (such as when potential energy is missing).

Perhaps the best approach is to throw an exception when either the pV or Potential Energy column is missing? We can then have keyword arguments such as use_pV=False (default True) and use_U=False (default True) to make the parser proceed with ignoring these values, giving "reduced potentials" without them. We can do the same with Total Energy with a use_H (default False).

Exceptions will be raised if one of these keyword args is given but can't be honored with the given dataset, or if two are mutually exclusive (such as use_U=True and use_H=True).

Thoughts? This doesn't have to be in the scope of this PR, but I think it's time we add some knobs to these parsers.

@harlor
Copy link
Contributor Author

harlor commented Nov 1, 2018

@dotsdl I like this Idea! And I think in principle we also know what we need when we parse gromacs datasets.

U is needed when the temperature is not the same for all states (Temperature is given in the header).
pV is needed when the volume is not constant (And as far as I know Gromacs always writes out pV in this case)

This doesn't have to be in the scope of this PR

Strongly agree! This PR is a Bugfix - What goes beyond that I suggest to do in a new PR.

@dotsdl
Copy link
Member

dotsdl commented Nov 3, 2018

I think if we address the presence of "Potential Energy", we'll be good to go on a merge for this.

@orbeckst, is there anything else you see? The benzene dataset doesn't have either "Total Energy" or "Potential Energy", so I think it satisfies your earlier comment already.

@orbeckst
Copy link
Member

orbeckst commented Nov 4, 2018

I agree with the above – do this one as a bare fix. However, I'd like to have tests with pV and U as part of this PR. I don't like adding code that is untested and deferring tests "to later".

Could you please raise an issue/PR in https://github.com/alchemistry/alchemtest/issues , reference this issue, and then we try to get the test data quickly into MDAnaysisTests. I know, this is all a bit of extra work but one of the basic ideas of alchemlyb was to make heavy use of best software engineering practices to keep the library well-tested at all times. Issues still slip through as you found, but at least by adding tests for the bug right away we will make sure that this bug does not show up again.

@dotsdl
Copy link
Member

dotsdl commented Nov 4, 2018

Agreed with @orbeckst. I look forward to accepting this contribution, so we'll need to make sure we test against examples that exhibit the problem it is addressing. @harlor, can you trim down the dataset you've linked to here and add to alchemtest as a PR? This will then allow us to build tests against it here in alchemlyb. That dataset will then be part of the infrastructure stabilizing alchemlyb going forward.

Check out the structure of some of the existing Gromacs datasets for examples.

@harlor
Copy link
Contributor Author

harlor commented Nov 15, 2018

I opened a PR in alchemtest with datasets that allow to write a test for this PR.

@orbeckst Do you consider it sufficient to check a certain value of the reduced potential?

@orbeckst
Copy link
Member

I reviewed alchemistry/alchemtest#30.

For tests I would generally use aggregate calculations such as a mean or a sum, in addition to specific values.

@harlor
Copy link
Contributor Author

harlor commented Nov 16, 2018

Ok I added tests for specific elements and for the sum of elements on the diagonal.

Btw: Is it possible/a good idea to use hash values of dataframes to write tests?

@orbeckst
Copy link
Member

Btw: Is it possible/a good idea to use hash values of dataframes to write tests?

Not a good idea if it contains floating point numbers.

For those you can never test for equality. Instead you do

from numpy.testing import assert_almost_equal

assert_almost_equal(calculated, reference)

Copy link
Member

@orbeckst orbeckst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

be careful when checking floats!!

assert _diag_sum(dataset) == 16674041445589.646

# Check one specific value in the dataframe
assert float(extract_u_nk(dataset['data']['AllStates'][0], T=300).iloc[0][0]) == -15659.655560881085
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

needs to use assert_almost_equal or similar. You cannot compare floats for equality because this depends on hardware etc.

assert _diag_sum(dataset) == 20572988148877.555

# Check one specific value in the dataframe
assert float(extract_u_nk(dataset['data']['AllStates'][0], T=300).iloc[0][0]) == 18.134225023007403
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

float comparison!

dataset = load_water_particle_without_energy()

# Check if the sum of values on the diagonal has the correct value
assert _diag_sum(dataset) == 20572988148877.555
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

float comparison!

dataset = load_water_particle_with_potential_energy()

# Check if the sum of values on the diagonal has the correct value
assert _diag_sum(dataset) == 16674041445589.646
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

float comparison!

@harlor
Copy link
Contributor Author

harlor commented Nov 16, 2018

As you suggested I use assert_almost_equal now.

Btw: Is it possible/a good idea to use hash values of dataframes to write tests?

Not a good idea if it contains floating point numbers.

Good to know!

@orbeckst
Copy link
Member

The tests fail because the accuracy is not high enough. Add

assert_almost_equal(..., decimal=6)

to decrease accuracy.

When you work with continuous integration, check the output of the tests and if it failed (red X), fix it/ask for advice for how to fix.

You should also be able to run the tests locally with

pytest

to catch anything obvious so that you don't have to wait for Travis.

@orbeckst
Copy link
Member

P.S.: I think it was decimal, could also be decimals.

@orbeckst
Copy link
Member

P.P.S.: Also merge master into your branch.

@harlor
Copy link
Contributor Author

harlor commented Nov 17, 2018

I will adopt local testing into my development workflow!

It turned out that I get the datasets from alchemtest in a different order on my machine - which causes not only these tests but also TestStatisticalInefficiency to give different results.

Is using something like sorted(glob(...)) instead of glob(...) in alchemtest, which sorts the datasets by filename, a good idea?

See: https://stackoverflow.com/questions/6773584/how-is-pythons-glob-glob-ordered

@orbeckst
Copy link
Member

yes!

(I just merged your PR alchemistry/alchemtest#31)

@harlor
Copy link
Contributor Author

harlor commented Nov 19, 2018

Thanks - is this PR also ready then?

Copy link
Member

@orbeckst orbeckst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great.

Minor comments – please address as necessary and ping me or @dotsdl when you're done.


# Check one specific value in the dataframe
assert_almost_equal(
float(extract_u_nk(dataset['data']['AllStates'][0], T=300).iloc[0][0]),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is float needed?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should probably work without.


# Check one specific value in the dataframe
assert_almost_equal(
float(extract_u_nk(dataset['data']['AllStates'][0], T=300).iloc[0][0]),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is float needed?


# Check one specific value in the dataframe
assert_almost_equal(
float(extract_u_nk(dataset['data']['AllStates'][0], T=300).iloc[0][0]),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is float needed?

for i in range(len(dataset['data'][leg])):
ds += u_nk.iloc[i][i]

return ds
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add newline

@orbeckst
Copy link
Member

@dotsdl you will also need to check and approve. Apparently, we have strict merge rules here... not enough to only satisfy one reviewer ;-).

@orbeckst
Copy link
Member

@dotsdl do you want to review or should I go ahead and merge? If I don't hear anything by 6pm AZ I'll merge using admin powers.

@dotsdl
Copy link
Member

dotsdl commented Nov 22, 2018

@orbeckst I should be able to review this during the weekend. If I don't manage to by Monday 11/26, please proceed with the merge.

@harlor
Copy link
Contributor Author

harlor commented Nov 23, 2018

I don't know why the build of the tests failed - it worked after a restart.

@dotsdl
Copy link
Member

dotsdl commented Nov 23, 2018

It appeared to be a problem with building scikit-learn wheels, so something upstream was broken. Looks resolved now. I'll be reviewing this later today. Thanks @harlor!

@dotsdl
Copy link
Member

dotsdl commented Nov 26, 2018

Looks good to me @harlor! Thanks for all the work you put into on this. And thanks @orbeckst for moving this along! Merging!

@dotsdl dotsdl merged commit 82ce951 into alchemistry:master Nov 26, 2018
@orbeckst orbeckst mentioned this pull request May 3, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants