More index optimisations #923

shawnlaffan · 2024-02-26T01:07:28Z

This PR includes several optimisations. Chief among them are:

Implement _calc_abc_any for cases where a depending method only needs the label hash keys or element lists
calc_abc grabs and modifies results from calc_abc2 or calc_abc3 if already run, thus saving some processing
calc_abc2 or calc_abc3 are always run before calc_abc
add a hierarchical _calc_abc variant for cases such as cluster node calcs where the label hashes can be generated from the child node results instead of processing the full list of terminals
some other general index optimisations such as result sharing when nbr set 2 is empty

This commit special cases the dependency calcs so at least one calc_abc sub is always run first when _calc_abc_any is a dependency. Many indices only need the abc counts or the keys of the label hashes. These can use any of calc_abc, calc_abc2 and calc_abc3. Recent commits added the capacity to grab any local precalc result, and this can be used to support a calc_abc_any sub.

Also fix a typo in metadata description.

It needs hash values of 1, which abc2 and abc3 do not usually provide.

If calc_abc2 or calc_abc3 have already been calculated then we can just grab their results and set the label hashes to have values of 1. This avoids a lot of looping in several circumstances, for example where large neighbour sets in a spatial analysis or where cluster indices are calculated per node.

This will save some processing given calc_abc can now adapt their results.

Caches can be deleted with impunity so might be interfered with in the middle of processing.

If we have no second neighbour set then there is no need to run all the hash processing.

If labels_hash2 is empty then the central and whole variants are the same. So just grab them and remap the keys.

This will avoid a lot of processing when both are being calculated as the results will be the same in such cases.

No need to go recalculate everything in these cases.

This avoids a lot of extra computation for calculations on cluster nodes as internal nodes can combine their child node results instead of iterating over the all the terminal elements.

shawnlaffan added 20 commits February 24, 2024 20:00

Indices metadata: better feedback on error

c2741dc

Indices: fix cache logic in _calc_abc_any, start using it

17d0834

Indices precalcs: more now use _calc_abc_any

8b2e965

Indices: calc_matrix_stats now uses _calc_abc_any

1acc8f5

Indices: several phylo calcs now depend on _calc_abc_any

a9c60e7

Indices: calc_labels_not_on_trimmed_tree now uses _calc_abc_any

edde81d

Also fix a typo in metadata description.

Indices::Rarity: Simplify inheritance and some metadata

3fe31fd

Indices: revert calc_labels_on_tree dep back to calc_abc

fd012ef

It needs hash values of 1, which abc2 and abc3 do not usually provide.

Indices: ensure calc_abc2 and 3 are run before calc_abc

466eeaa

This will save some processing given calc_abc can now adapt their results.

Indices: make AS_RESULTS_FROM_LOCAL a proper param, not a cache entry

b7456f0

Caches can be deleted with impunity so might be interfered with in the middle of processing.

Indices: optimise _calc_phylo_abc_lists

b321c48

If we have no second neighbour set then there is no need to run all the hash processing.

Indices: minor refactor of _calc_rarity_(central|whole)

496687a

Indices: rarity whole and central copy results when needed

0aa0afd

If labels_hash2 is empty then the central and whole variants are the same. So just grab them and remap the keys.

Indices: calc_phylo_rpe_central can copy calc_phylo_rpe2 if no set2 nbrs

2cd3978

This will avoid a lot of processing when both are being calculated as the results will be the same in such cases.

Indices: rpe_central and rpe2 can share results if no nbr set2

71ef863

No need to go recalculate everything in these cases.

Indices: micro-optimise the aed calcs

c1a3f9d

Indices: add a hierarchical calc_abc variant

f1d872d

This avoids a lot of extra computation for calculations on cluster nodes as internal nodes can combine their child node results instead of iterating over the all the terminal elements.

rand reintegration tests: change numeric tolerance to 1e-8

a046fe1

shawnlaffan merged commit 8216111 into master Feb 26, 2024
8 checks passed

shawnlaffan deleted the calc_abc_any branch February 26, 2024 01:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

More index optimisations #923

More index optimisations #923

shawnlaffan commented Feb 26, 2024

More index optimisations #923

More index optimisations #923

Conversation

shawnlaffan commented Feb 26, 2024