Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reorganizing stat.py #3535

Merged
merged 57 commits into from
Dec 14, 2020
Merged

Reorganizing stat.py #3535

merged 57 commits into from
Dec 14, 2020

Conversation

spxiwh
Copy link
Contributor

@spxiwh spxiwh commented Nov 25, 2020

stat.py has become a little bit difficult to work with due to the large amount of classes now being needed to implement even minor changes. I want to rework this according to the pasted image. So far I've implemented this basically as illustrated below, with only those classes. It's possible this could be compressed further, but perhaps this is better? Clearly this PR will not work alone as we will need to remove completely the old stat executables, and hook the new ones up to the renamed methods here. However, I think it's worth seeing first if this is how we want this to look. I think there are some design choices here (most of the questions in red/green in the image) which are not obviously clear.

There's also the question of how to handle singles. At the moment, as I am removing the "sngl_statistic" concept from stat.py there are two places singles can get information. Either via the single_multiifo method for getting a full ranking statistic, or via the get_ranking function in ranking.py to access anything that can be computed from the trigger alone. Are these two places sufficient? If only pycbc_multiifo_sngls_findtrigs uses the single_multiifo method, is it okay that the plotting codes cannot get to exp_fitted singles ranking? Is a requirement to hook that up in some way?

When implementing get_sngls_ranking_from_trigs I first thought of doing something complicated allowing arbitrary extensions to the statistic name (or kwargs) to turn off/on things like sgveto, cuts, psdvar etc. However as the PSDvar affects the chi-squared these things are not all nicely orthogonal. As there are only a small enough number of these rankings at the moment anyway, I did not do the more involved thing. Perhaps at some point we would need to swap over, but I think that could be done ... Or maybe it needs doing now??

IMG_20201124_142120

@spxiwh spxiwh requested review from ahnitz and tdent November 25, 2020 15:34
@tdent
Copy link
Contributor

tdent commented Nov 26, 2020

Thanks Ian. Are the failing builds meaningful?

@spxiwh
Copy link
Contributor Author

spxiwh commented Nov 27, 2020

@tdent I would expect failures on this, as the changes here are going to require small updates in other parts of the code to match. I'll address those before removing the Work-In-Progress flag ... For now I just post this to ensure that anyone has a chance to speak up if you think this should be done differently.

the threshold for each of the input triggers.
# FIXME: Why does this function not just call straight into single?
# FIXME: Should this be renamed for clarity if multiifo is being removed?
def single_multiifo(self, single_info):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The single_multiifo() function is to generate a ranking statistic for a single-detector candidate which is equivalent to the ranking statistic of a coincident event.

This is distinct from the single() function as that is used as the intermediate step in the likelihood-based statistics. In sum-of-squares statistics, it is entirely equivalent.

I agree that a better name should be used for this. e.g. rank_stat_single (and coinc_multiifo/coinc renamed to rank_stat_coinc)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@GarethDaviesGW The naming is fine with me. Would you be okay with having rank_stat_single just call straight through to the single method and replace this:
https://github.com/gwastro/pycbc/blob/master/bin/hdfcoinc/pycbc_multiifo_sngls_findtrigs#L144
with just the one call?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that makes sense

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I missed this comment. 'rank_stat_coinc' seems like an unnecessarily long name given that the module and classes in it and instances in the rest of the code already have the name 'ranking' or 'stat'. I'd suggest 'coinc_rank' and 'single_rank'.

Also if we're renaming things, how about renaming 'ranking.py' -> 'sngl_statistic.py' ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy to rename these to whatever is most useful.

I would avoid single_rank to avoid confusion with the stuff defined in ranking.py. I'm not quite sure what the right words are to distinguish between "single detector ranking using only information from the trigger itself", which is being referred to as "single ranking" and "single detector ranking using bulk information and fits". Maybe coinc_stat and single_stat??

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For similar reasons renaming ranking.py to sngl_statistic confuses in my head the distinction between these two things.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what I'm suggesting would end up with the logic that 'single detector number calculated using only trigger information' is a 'sngl stat(istic)' and things analogous to the coinc ranking stat which may use global information are called 'rank(ing)'
i.e. something calculated at single trigger label is always a 'statistic', but in general won't be a ranking.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and yes, this is a bit backwards on wrt the current naming of stat.py and ranking.py, but i think makes more sense. (also consistent with naming instances of a ranking-stat object as 'rank_method' in scripts)

@spxiwh
Copy link
Contributor Author

spxiwh commented Dec 3, 2020

This is closer now to something that works, but I'll try to separate off parts of this to get that merged first.

@tdent
Copy link
Contributor

tdent commented Dec 3, 2020

ok, sorry, i haven't managed to take much of a detailed look at this yet. presumably the thing to look at first is still ranking.py and stat.py.

@spxiwh
Copy link
Contributor Author

spxiwh commented Dec 3, 2020

@tdent Yeah, stat.py and ranking.py are the things to look at.

Really there are now 3 ways to get information here:

  • The coinc method of the stat class for coinc statistic
  • The single_multiifo (needs renaming) method of the stat class for single statistic (some followup codes e.g. minifollowups could call into this to decide on most interesting singles if they so desire).
  • The ranking function in ranking.py. This must be computable from the values contained in a single-detector trigger and nothing else.

I find this quite a nice breakdown of information (and the old sngl_statistic_dict is now removed, as Gareth basically duplicated it anyway with the single_multiifo method).

All statistics will have these methods, but in some cases they are not implemented so will fail if called (but one can always add them in if needed).

@spxiwh
Copy link
Contributor Author

spxiwh commented Dec 8, 2020

@ahnitz @tdent @GarethDaviesGW I've now rebased this against master. I have successfully tested this in a couple of end-to-end workflows now. I think I might still have a few loose ends to connect to the new code, but the main workflow runs and this is ready for review at this point.

Some things to highlight:

  • Obviously stat.py has had a major reorganization, but it still does the same things (excepting the removal of the old stat code).
  • When initializing statistics a new argument is now needed: the single-detector ranking. This avoids having to define 48 statistics to be able to do the various versions of PSD_Var on/off SgChisq on/off etc. The single-detector stuff in ranking.py could be reorganized if this starts to get out of hand, but stat.py can handle this.
  • A new command line group is added for stat.py. This makes it easier for things like the minifollowup code to call into this (very few codes could take the mchirp key-word added some time ago).
  • Single-detector statistics/rankings: If you want to rank/get statistic values for single-detector triggers you now have two choices. Either the stuff in ranking.py which just takes the trigger and must be usable even by pycbc_inspiral. Or the single-detector statistic as used in 2-OGC, the latter requires using the stat.py command line options.
  • This does necessitate changes to config files, and cannot be backwards compatible, but the changes are not too much.

@tdent
Copy link
Contributor

tdent commented Dec 9, 2020

@spxiwh thanks - I assume you mean "When initializing statistics a new argument is now needed".

@spxiwh
Copy link
Contributor Author

spxiwh commented Dec 9, 2020

@spxiwh thanks - I assume you mean "When initializing statistics a new argument is now needed".

Indeed, thanks .. .Fixed above.

@spxiwh
Copy link
Contributor Author

spxiwh commented Dec 9, 2020

I'm not now aware of any more changes that need to be added to this (other than deciding appropriate names for some of the methods as discussed above).

For the remaining CC points, I'll deal with the FIXME in a separate PR and the long line is better than splitting according to (https://black.now.sh/).

@@ -67,11 +62,12 @@ sngls = hdf.SingleDetTriggers(args.single_trig_file, args.bank_file,\

if args.statname is not None:
logging.info('Calculating stat')
statengine = stat.sngl_statistic_dict[args.statname](args.statistic_files)
# FIXME: inefficient, as we are calculating the stat on all
statengine = stat.get_statistic_from_opts(args, [args.instrument])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can this be called 'rank_method' as in other scripts

@tdent
Copy link
Contributor

tdent commented Dec 9, 2020

@spxiwh actually, it might make more sense to merge this once the functionality / testing aspects have been dealt with and then do any renaming in a subsequent PR (which someone else could deal with).

@tdent
Copy link
Contributor

tdent commented Dec 9, 2020

@spxiwh I think rather than ValueError the placeholder methods of the base class should raise NotImplementedError

be applied (unused in this statistic)
err_msg = "This function is a stub that should be overridden by the "
err_msg += "sub-classes. You shouldn't be seeing this error!"
raise ValueError(err_msg)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as noted in the main PR thread, this and analogous exceptions should preferably be NotImplementedErrors.

Copy link
Contributor

@tdent tdent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

apart from the request to change the 'statengine' instance name to 'rank_method' for consistency, and change ValueError to NotImplementedError for stub methods, the changes look ok to me

@spxiwh
Copy link
Contributor Author

spxiwh commented Dec 9, 2020

Thanks @tdent I'll make the requested changes, and leave the renaming for a further commit as you suggest.

@tdent
Copy link
Contributor

tdent commented Dec 9, 2020

fwiw if you want to make pycbc generally into a '90-ish-character-long-line' codebase as opposed to '79-character-long-line' that should be taken up at a higher level. However in this case it looks like there's only actually one exceptionally long line, which is ignorable.

@GarethCabournDavies
Copy link
Contributor

N.B: Just seen another bit of old-workflow-compatible code in add_statmap

@titodalcanton
Copy link
Contributor

I am doing a quick test of PyCBC Live with this patch applied and the phasetd_newsnr coinc statistic, as the test in /examples/live does not use the PTA stat files. I switched the PyCBC Live call to use these options:

--sngl-ranking newsnr \
--ranking-statistic phasetd_newsnr \
--statistic-files /same/as/previous/working/tests.hdf \

I get this error:

  File "/somewhere/bin/pycbc_live", line 755, in <module>
    len(bank), args.analysis_chunk, list(combo)))
  File "/somewhere/lib/python3.6/site-packages/pycbc/events/coinc.py", line 864, in from_cli
    **kwargs)
  File "/somewhere/lib/python3.6/site-packages/pycbc/events/coinc.py", line 776, in __init__
    **kwargs
  File "/somewhere/lib/python3.6/site-packages/pycbc/events/stat.py", line 310, in __init__
    ifos=ifos, **kwargs)
  File "/somewhere/lib/python3.6/site-packages/pycbc/events/stat.py", line 58, in __init__
    f = h5py.File(filename, 'r')
  File "/somewhere/lib/python3.6/site-packages/h5py/_hl/files.py", line 395, in __init__
    name = filename_encode(name)
  File "/somewhere/lib/python3.6/site-packages/h5py/_hl/compat.py", line 111, in filename_encode
    filename = fspath(filename)
TypeError: expected str, bytes or os.PathLike object, not list

Apparently this happens because of this (this is output from a print() I added to pycbc_live just before instantiating the LiveCoincTimeslideBackgroundEstimator class, not actual Python code):

args.statistic_files = [['/home/pycbc.live/production/data/H1L1-PHASE_TIME_AMP_v2.hdf']]

Where does the list-of-lists comes from? I would expect a (correct) list of strings by looking at the add_argument() call. @spxiwh any idea?

@titodalcanton
Copy link
Contributor

PyCBC Live now gets past the initialization and crashes as soon as some triggers show up:

Traceback (most recent call last):
  File "/usr/lib64/python3.6/multiprocessing/pool.py", line 119, in worker
    result = (True, func(*args, **kwds))
  File "/usr/lib64/python3.6/multiprocessing/pool.py", line 44, in mapstar
    return list(map(*args))
  File "/somewhere/lib/python3.6/site-packages/pycbc/pool.py", line 42, in _lockstep_fcn
    return fcn(args)
  File "/somewhere/bin/pycbc_live", line 766, in get_coinc
    r = c.add_singles(results)
  File "/somewhere/lib/python3.6/site-packages/pycbc/events/coinc.py", line 1180, in add_singles
    _, coinc_results = self._find_coincs(results, ifos=valid_ifos)
  File "/somewhere/lib/python3.6/site-packages/pycbc/events/coinc.py", line 1053, in _find_coincs
    [0, -1]
  File "/somewhere/lib/python3.6/site-packages/pycbc/events/stat.py", line 628, in rank_stat_coinc
    raise ValueError(err_msg)
ValueError: Sorry! No-one has implemented this method yet!

Am I right that you expect this error @spxiwh? Do you prefer to fix this in a followup PR?

@titodalcanton
Copy link
Contributor

Also, FWIW maybe that ValueError should be a NotImplementedError?

@spxiwh
Copy link
Contributor Author

spxiwh commented Dec 11, 2020

@titodalcanton Yes, I expected this error. It should be fixed in a separate PR, as it would be adding new functionality, not migrating old. (I think PyCBC Live was running the old coinc code with an old version of the PT statistic, neither of which are supported now). Adding this back in should not take long, but I'm not quite sure what the rank_stat_coinc function should be here so I'm leaving the NotImplemented stuff to be added in as people need it.

... And yes, this should be a NotImplemented error. Tom already asked for that and I forgot it, I will change that.

@spxiwh
Copy link
Contributor Author

spxiwh commented Dec 11, 2020

This should now pass all tests, and I think I've addressed all comments/feedback given.

@spxiwh spxiwh changed the title [WIP] Reorganizing stat.py Reorganizing stat.py Dec 11, 2020
Copy link
Member

@ahnitz ahnitz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please merge once Travis / any final cc passes.

@spxiwh spxiwh merged commit bde15e7 into gwastro:master Dec 14, 2020
@spxiwh spxiwh deleted the pr_stat_reworking branch December 14, 2020 16:01
OliverEdy pushed a commit to OliverEdy/pycbc that referenced this pull request Apr 3, 2023
* First cut at breaking down stat module

* Documentation

* More documentation

* More documentation

* More documentation

* Some more doc updates

* More doc updating

* Updating more docs

* Updates

* More updating

* More updates

* More updating stat.py

* CodeClimate-ing

* More codeclimating

* Some more codeclimating

* Update coinc code to use reorganized stat.py

* Updating exes to new interface

* Update eventmgr to new interface

* Remove some old stuff from ranking.py

* Remove old sngl_statistic concept

* Updating hdf.py to new interface

* Change name in exe

* Fix to ranking.py

* Typo fix in hdf.py

* Typo in stat.py

* Adding in some more missing info

* Update in coinc_findtrigs

* Missed a line in pycbc_plot_singles_vs_params

* Rewire single_multiifo

* Hook up mask_to_n_loudest properly

* Add new option to sngl_minifollowup

* Simplify sngls_findtrigs

* Implement a stat_option_group for convenience

* Add reminder FIXME

* Fix typo

* Fix another typo

* Fix typo in fit_sngls_by_template

* Fix bug in sngl_minifollowup

* CodeClimating

* Fix typo

* Hook up pycbc_live

* Fix typo

* Some codeclimating

* Rename coinc and single_multiifo as suggested

* Live must use new interface

* Typo

* More typos

* Fix another typo

* More CodeClimating

* Fix to pycbc_process_sngls

* Update testing config files

* Some minor renaming (remove "New")

* Fix PyCBC Live interface

* Make ValueErrors NotImplementedErrors

* Fix nasty typo

* Rename as requested by Tom

* Hopefully final CodeClimating
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants