Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Various contributions #86

Draft
wants to merge 24 commits into
base: master
Choose a base branch
from
Draft

Various contributions #86

wants to merge 24 commits into from

Conversation

salotz
Copy link
Collaborator

@salotz salotz commented Sep 20, 2023

Opening this up for the possibility of merging as some interesting features were added and various bug fixes were made.

This likely shouldn't be merged as-is. The HDF5 file IIRC has some compatibility issues as it added support for variable numbers of walkers.

salotz and others added 24 commits March 30, 2021 19:18
This is needed for testing
Also removes the checks on the images for shapes and dtypes so that
any object can be used as an image.
This is a WIP and subject to change.

WARNING: This does not yet make all of the APIs aware of variable
numbers of walkers. There are likely bugs in the analysis code if you
have variable number of walkers. This support will likely get added
after this is merged to master.

What has changed?

First in the WepyHDF5 class itself:

- There are a number of new methods to support proper and robust frame/cycles:
  - run_cycle_idxs
  - run_cycle_walker_idxs
  - run_cycle_walker_traces
  - get_traj_field_names
  - get_traj_field_names_observables
  - get_traj_field_names_alt_reps
  - get_traj_cycle_idxs
  - get_traj_field_num_frames

- A number of methods were altered to not be lazy about getting
  frame/cycle idxs etc.
  - num_traj_frames
  - get_traj_field_cycle_idxs

The main method of supporting variable numbers of walkers is through
sparse fields over all of the fields.

These latter methods also now take into account the offset in the
beginning of a trajectory for a new larger number of
walkers.

This is both for contiguous and sparse fields.

This relies on the presence of a metadata attribute in each traj group
that is the 'start_cycle_idx' that is set in the reporter.

Additionally some small changes:

- add_traj now returns the traj_grp that it created

For the WepyHDF5Reporter there are some additions to the API.

Mainly this is the semantics around specifying the 'sparse_fields'
argument. It should be backwards compatible.

But now there is the option of both specifying a single Ellipsis,
which makes all fields sparse.

This would be the default/base recommendation for a variable number of
walker simulation.

And now also there are 3 total choices for a field by field specification:

- `int` for a frequency
- `None` if it is not sparse (ensures it is not sparse. Can act as an "exclude")
- `Ellipsis` which will initialize the field as sparse but will still save everything that it can.

It is the `Ellipsis` option which is used for the variable number of walkers.

It should be possible to always make this the default. But likely
there is too much code relying on the datasets being uncomplicated by
sparse fields.

Additionally the alt_reps and all_atoms_rep_freq options support the
Ellipsis option as well now.

The other major change is that now the reporter will record much more
metadata on the cycle indices for the runs and trajectories.

The new trajectory group metadata fields are:

- 'first_cycle_idx' :: previously it did set this as 'cycle_idx' kind
  of as a hail mary for supporting variable walkers.
- 'num_cycles' :: Which gets updated with each call to report
- 'last_cycle_idx' :: which also gets updated with each call to report

This allows for some consistency checks with both the shape of the
data and in a run as a whole which is much more efficient than
actually counting them.

Currently 'first_cycle_idx' is the only actualized dependency in the
WepyHDF5 class used as offsets for cycle indices.

The new run group metadata attributes are:

- 'num_cycles'
- 'last_cycle_idx'
also removed the sparse_fields_freq argument that was left in by mistake
The performance for both `run_cycle_walker_idxs` was abysmal because
of the repeated calls to `get_traj_field_cycle_idxs`.

So I added in a shim function `_get_all_run_traj_cycle_idxs` which
gets all of the trajectory cycles in one batch. This can be used as a
datastructure without having to read them all the time from disk.

`run_cycle_walker_idxs` was changed to `run_cycles_walker_idxs` and
supports getting multiple cycles at a time, which makes only one call
to `_get_all_run_traj_cycle_idxs` each time.

This was then used from `run_cycle_walker_traces` and dramatically
improves performance.
There was a mistake where the number of frames in the data assumed it
wasn't sparse. This was corrected.
The `get_traj_field` was broken with the new cycle_idxs semantics.

First the weights were not sparse as they should always be saved.

The current commit is a HACK to get around this. There is a shim
function for getting the weights specifically as a field which will
look at the sparse idxs of the 'positions' field.

As noted in the issue this should change in the future in a more
robust mechanism.

We also have a split in the semantics of the 'frames' argument to
`get_trajectory_field`. For that function it is actually the
"cycle_idxs" as one would logically expect.

The lower level functions for contiguous and sparse fields accept a
"frames" but this is actually the 'frame_idx' which is just the
counting index of the frames in that lane.
Previously there was just a 'frames' argument for choosing a subset of
the structures in a trajectory.

With the introduction of "sparse trajectories" and not just "sparse
fields" there is now a distinction between 'frame_idxs' and
'cycle_idxs'.

This was addressed in a lower level in earlier commits but the
`get_traj_fields` API stayed the same with the 'frames' argument
simply meaning 'cycle_idxs'.

Components that were using `get_traj_fields` (like in the distributed
analysis chunker) were using 'frames' in an ambiguous way since there
functionally wasn't a difference.

Now this difference matters. For tools like the chunker where the
actual time dependence between walkers is irrelevant they needed a way
to query just on the data shape itself, or the 'frame_idxs'.

'frame_idxs' was added as an argument to support this.

The 'frames' argument is a bit redundant and confusing with this (and
not clear it means actuall cycle_idxs) so we deprecated it and
replaced it with 'cycle_idxs'. It still exists for now but raises a
deprecation warning on usage and may be removed in the future.
Credit to @tom.dixon for the fix.
add/fix the progress section RDAS-37
port fixes for just the REVO merge bug RDAS-142
I also included all of the new networkx refactoring to get an extra
test that it is indeed working
@salotz salotz added the invalid label Sep 20, 2023
@salotz salotz self-assigned this Sep 20, 2023
@salotz salotz marked this pull request as draft September 20, 2023 04:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants