Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Addition of new graph traversal tools #3686

Merged
merged 5 commits into from
Jan 10, 2020

Conversation

ramirezfranciscof
Copy link
Member

This PR incorporates two new tools for graph traversal:

  1. The AiiDA Graph Explorer (AGE) package of classes: this is a general tool that allows to do complex graph traversal operations in an AiiDA 'expanded graph' (in which both the AiiDA nodes and groups are considered to be graph nodes) by defining 'rules' from generic querybuilder instances. Then, given an initial set of nodes (that have to be set inside generic 'baskets'), these queries will be successively applied on top of the results of the previous one, repeating the whole cycle the desired number of times (which could be 'until no new nodes are added').

  2. The traverse_graph function: this is a simplified interface to use the AGE to search for AiiDA nodes and links using a reduced set of customizable rules. Contrary to the AGE, which retains basically the same versatility of the querybuilder and thus allows for very complex traversals, with traverse_graph you can only specify which type of links will be allowed to be traversed and in which direction, and these are all traversed in each iteration. The advantage is that one no longer needs to manually define and handle baskets and querybuilders.

The function traverse_graph (which uses AGE as its search engine) is now used by the delete, the export and the graph visualization procedures. For the first one, the delete_nodes function now uses this function instead of doing its own search but the procedure is similar to how it was performed before. On the other hand, for the export procedure every node added to the set was queried separately for ancestors and descendants, whereas now the whole sets of new nodes found are queried for more nodes (all of this happens inside the retrieve_linked_nodes function). Finally, the graph class methods recurse_descendants and recurse_ancestors used to work by calling the add_incoming and add_outgoing many times, inside of which nodes were loaded to get their pks. Now these are all independent and each has its own call to traverse_graph, where pks are obtained directly from the query projection.

This closes #3331

@ramirezfranciscof
Copy link
Member Author

(I specifically tagged @CasperWA because he had requested to review this once it was finished, but all reviews are most welcome)

Copy link
Member

@ltalirz ltalirz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @ramirezfranciscof for pushing forward with this!
I've had a quick read through a couple of files, some minor comments here and there.

Perhaps @chrisjsewell would like to give a brief look at what is happening in graph.py ?

aiida/common/links.py Outdated Show resolved Hide resolved
aiida/tools/graph/age_entities.py Outdated Show resolved Hide resolved
aiida/tools/graph/age_entities.py Outdated Show resolved Hide resolved
aiida/tools/importexport/dbexport/utils.py Outdated Show resolved Hide resolved
aiida/tools/importexport/dbexport/utils.py Outdated Show resolved Hide resolved
aiida/tools/graph/graph_traversers.py Outdated Show resolved Hide resolved
aiida/tools/graph/graph_traversers.py Show resolved Hide resolved
aiida/tools/graph/age_rules.py Outdated Show resolved Hide resolved
@chrisjsewell
Copy link
Member

Perhaps @chrisjsewell would like to give a brief look at what is happening in graph.py ?

For the graph visualisation, obviously the principle thing is that test_graph.py still passes. But also you need to check whether visualising_graphs.ipynb still runs or requires any changes.
If there are any changes, these also need to be translated to visualising_graphs.rst.

@zooks97
Copy link
Contributor

zooks97 commented Dec 18, 2019

Perhaps @chrisjsewell would like to give a brief look at what is happening in graph.py ?

For the graph visualisation, obviously the principle thing is that test_graph.py still passes. But also you need to check whether visualising_graphs.ipynb still runs or requires any changes.
If there are any changes, these also need to be translated to visualising_graphs.rst.

All of the tests in test_graph.py do pass, but I have not tested visualising_graphs.ipynb. I purposefully did not change the interface for the Graph class or any of its methods to maintain full backward compatibility (except that providing a print function does not print the graph nodes and links as they are traversed), so I expect that there should be no issues with this notebook or the documentation that it produces.

@ltalirz
Copy link
Member

ltalirz commented Dec 19, 2019

@ramirezfranciscof Are you working on updating the branch? Do you need help?
Moving forward with this would be quite important since it addresses an major usability issue

@ramirezfranciscof
Copy link
Member Author

@ltalirz Yes, I will now apply the requested changes. If it is ok with you, I will make modify the respective existing commits with the corresponding change, so I will force push the modifications.

@ramirezfranciscof
Copy link
Member Author

ramirezfranciscof commented Dec 19, 2019

@ltalirz I applied all corrections except for the comment in the links (waiting for the ok) and the export_tree that you said should be a different PR (and also I didn't really got what you meant). Please check, specially if the keyset property (to replace get_keys) was done correctly (I would have liked to use self.keyset = None in the __init__ but pylint complains if _keyset is not initialized explicitly there =S).

@ramirezfranciscof
Copy link
Member Author

Ok, docs are now failing because it doesn't recognize a type (make html worked locally, but I didn't check doing just make). How can I specify in the type docstring that something can have several types and/or that it can also be None?

@greschd
Copy link
Member

greschd commented Dec 19, 2019

How can I specify in the type docstring that something can have several types and/or that it can also be None?

I think you can use the typing module: typing.Union[str, int] means "str or int", and typing.Optional[str] for "str or None".

@ramirezfranciscof
Copy link
Member Author

@greschd But you mean in the docstring? Like this?

"""
(...)
    :type max_iterations:  typing.Optional[int]
    :param max_iterations:
        The number of iterations to apply the set of rules (a value of 'None' will
        iterate until no new nodes are added).
(...)
"""

Because this is not working for me, its just rendering literally 'typing.Optional[int]'

@greschd
Copy link
Member

greschd commented Dec 19, 2019

Ah, no.. apparently that does not work 🙄

The following works, though:

:type max_iterations:  int or None

If we start adding type hints, we can consider the following sphinx plugin to avoid duplicating the info: https://pypi.org/project/sphinx-autodoc-typehints/

@ramirezfranciscof
Copy link
Member Author

Thanks @greschd ! That seemed to have worked.

@ltalirz ltalirz self-requested a review December 19, 2019 14:41
Copy link
Member

@ltalirz ltalirz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ramirezfranciscof

@ltalirz I applied all corrections except for the comment in the links (waiting for the ok)

Made suggestion here

Please check, specially if the keyset property (to replace get_keys) was done correctly

Looks fine.

(I would have liked to use self.keyset = None in the init but pylint complains if _keyset is not initialized explicitly there =S).

Also this is correct - the constructor should initialize the underlying variable.

@sphuber @CasperWA From my side this looks good and brings enormous speedup to the export.
Who still wants to have a look before this goes in?

@sphuber
Copy link
Contributor

sphuber commented Dec 19, 2019

Thanks guys, I would like to give it a look still and will do so today.

@ltalirz
Copy link
Member

ltalirz commented Dec 19, 2019

Just for reference, here again the results from the test I did.

Test set: Exporting 67k groups with 5 nodes each (which expands to 1.4M nodes with provenance)

"Old implementation"[1]: ~450 minutes
AGE implementation: 6 minutes

[1] This is on a modified version of the progress bar PR, where I already removed slow-down from the progress bar.
It took 5h 15min to retrieve the linked nodes for 70% of the nodes, then I canceled. I extrapolated from there for the total time.

@ramirezfranciscof
Copy link
Member Author

(I would have liked to use self.keyset = None in the init but pylint complains if _keyset is not initialized explicitly there =S).

Also this is correct - the constructor should initialize the underlying variable.

I'm a little confused by this. Part of the idea of the setter is to avoid duplication of checks, parsings, etc correct? The typical example being a property Temp and checking that it is being set to some value > -273. If the __init__ has to set _Temp itself, then I have to repeat the check in there. Sure, in the case this is just hardcoded to a value, you know what to use, but what if the initializer takes an input for this? You either have to make the checks yourself or initialize _Temp to a random valid value just to appease pylint (or more generally to appease the guideline) and then call the setter to put the real value? I don't know, somehow it feels cleaner if the hidden attribute _Temp could only appear inside the property methods.

@sphuber
Copy link
Contributor

sphuber commented Dec 19, 2019

I'm a little confused by this. Part of the idea of the setter is to avoid duplication of checks, parsings, etc correct? The typical example being a property Temp and checking that it is being set to some value > -273. If the __init__ has to set _Temp itself, then I have to repeat the check in there. Sure, in the case this is just hardcoded to a value, you know what to use, but what if the initializer takes an input for this?

This may be just a bug in pylint. If you are initializing a class instance attribute through a setter property in the constructor then that is perfectly fine and indeed the right practice. You probably mean that pylint about something like "warning: class attribute initialized outside-constructor" or something to that effect? In that case it is simply a false positive and I think that pylint does not realize that you are in fact initializing it.

@ltalirz
Copy link
Member

ltalirz commented Dec 19, 2019

@ramirezfranciscof I think you make a valid point.
What I meant to say was: It is good practice to explicitly initialize instance variables in the constructor, since this is easy to read and you are less likely to run into cases where the variable is not defined.

I agree that there are also cases, where it makes sense to use the setter already for initialization, in particular if you were passing along arguments of the constructor to the variable.

Here, you are setting it to None, however, so I think both solutions are fine.

@ramirezfranciscof
Copy link
Member Author

This may be just a bug in pylint. If you are initializing a class instance attribute through a setter property in the constructor then that is perfectly fine and indeed the right practice. You probably mean that pylint about something like "warning: class attribute initialized outside-constructor" or something to that effect? In that case it is simply a false positive and I think that pylint does not realize that you are in fact initializing it.

Yes, this is indeed the case. All I found regarding this problem was this issue in the pylint github.

Copy link
Contributor

@sphuber sphuber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My compliments to the AGE team, this is some great code! This should really clean a lot of the code base and as @ltalirz has noticed make it a lot faster too. I still have some minor requests but these should be relatively easy to address

aiida/tools/visualization/graph.py Outdated Show resolved Hide resolved
aiida/tools/visualization/graph.py Outdated Show resolved Hide resolved
aiida/tools/visualization/graph.py Outdated Show resolved Hide resolved
aiida/tools/visualization/graph.py Outdated Show resolved Hide resolved
aiida/tools/visualization/graph.py Outdated Show resolved Hide resolved
aiida/tools/graph/age_entities.py Outdated Show resolved Hide resolved
aiida/tools/graph/age_entities.py Outdated Show resolved Hide resolved
aiida/tools/graph/age_entities.py Outdated Show resolved Hide resolved
aiida/tools/graph/age_entities.py Outdated Show resolved Hide resolved
aiida/tools/graph/age_entities.py Outdated Show resolved Hide resolved
Copy link
Contributor

@CasperWA CasperWA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to submit my review - and GitHub failed with a Unicorn! ...

It seems it has saved all my review changes, but this message was not cached, so I'll try again.... hmm...

Very good job on this massive PR!

I have been quite meticulous in my comments, considering changes to comments and variable names and more ... Don't worry about it, but do try and consider my comments fairly.

Be careful not to add Python2-specific code (e.g., imports from __future__).
Be aware that pylint is only trying to help, not annoy - although it definitely can be annoying. It is only trying to suggest that your code may be better - which is not always the case, sometimes you need certain things to be the way they are, but at least it should make you reconsider your code.

Concerning the export utility functions, we've discussed this offline, so I'm looking forward to seeing the update from that.

Otherwise, good job! And thanks for the hard work on implementing this in AiiDA.

aiida/backends/tests/tools/graph/test_age.py Outdated Show resolved Hide resolved
aiida/backends/tests/tools/graph/test_age.py Outdated Show resolved Hide resolved
aiida/backends/tests/tools/graph/test_age.py Outdated Show resolved Hide resolved
aiida/backends/tests/tools/graph/test_age.py Outdated Show resolved Hide resolved
aiida/backends/tests/tools/graph/test_age.py Outdated Show resolved Hide resolved
aiida/tools/visualization/graph.py Outdated Show resolved Hide resolved
aiida/tools/visualization/graph.py Outdated Show resolved Hide resolved
aiida/tools/visualization/graph.py Outdated Show resolved Hide resolved
aiida/tools/visualization/graph.py Show resolved Hide resolved
aiida/tools/visualization/graph.py Outdated Show resolved Hide resolved
@ramirezfranciscof
Copy link
Member Author

Thanks to @ltalirz @sphuber and @CasperWA for the reviews! I know reviewing such a big PR (+2.5K lines) can be a PITB so I appreciate it. I will now work on these and as I modify things in my local code I will mark the comments resolved to better keep track of stuff (just letting you know in case you notice this before actually seeing any modifications applied and you wonder why).

@csadorf csadorf removed their request for review December 20, 2019 17:07
@csadorf
Copy link
Contributor

csadorf commented Dec 20, 2019

@ramirezfranciscof Unfortunately I won't be able to review this before the holidays anymore. Feel free to re-request review on Jan 6 if it's not done by then.

@ramirezfranciscof
Copy link
Member Author

I have incorporated almost all of the changes requested; a few "unresolved" comments remain on which I am still waiting for confirmation or some kind of feedback. I must warn that due to some of the comments and discussions with @sphuber , I have made some important changes in the traverse_graph function:

  1. It now knows nothing of TraverseRules, as it directly receives the links to traverse forward and backward.
  2. Export and delete functionalities now do not use the traverse_graph directly, but go through get_nodes_delete and get_nodes_links_export instead: these functions will receive the optional flags and pass those together with the adequate GraphTraversalRule to a common verification and translation functionality parse_traversal_rules, and then will use the traverse_graph with these options. The get_xxx functions return different things.
  3. Visualization does use the traverse_graph and parse_traversal_rules directly, and, I must admit, somewhat unnecessarily: actually in some sections the links are received and parsed into traversal rules and the de-parsed into links again, for example. Although this is not ideal, I didn't want to get into this now because I realized I was starting to re-write the whole graph class and didn't want to delay this PR any further (while also maybe get a bit out of scope). Perhaps this would be better for a posterior graph class refactor and cleanup PR?
  4. I didn't add tests for the get_xxx functions because their functionality is currently being tested by the export and delete tests (since the checks and such were previously in this functions). Eventually the tests should be re-organized, but again I didn't wanted to invest too much time in this now since the functionality was actually being tested, just not in the ideal place.

Let me know what yo think and if you think it would be important to do now any of the things I mentioned that I was leaving to tackle later in a different PR (tests and/or graph class).
(BTW: I am also not particularly married with the name of any of these functions, so if anybody has a better idea please do share)

@ramirezfranciscof
Copy link
Member Author

Ah, also, I get this weird error when compiling the documentation, where I cannot add set to the supported types because it won't recognize it (hence some of the docstrings clarify that sets are accepted in the description of the parameter instead of adding it to the :type xxx:). I tried looking for this problem and this might be a bug in the python documentation? Feedback is welcome.

Copy link
Contributor

@sphuber sphuber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost there, but still some small things and questions about design

aiida/tools/graph/graph_traversers.py Show resolved Hide resolved
aiida/tools/graph/graph_traversers.py Show resolved Hide resolved
aiida/tools/graph/graph_traversers.py Outdated Show resolved Hide resolved
aiida/tools/visualization/graph.py Outdated Show resolved Hide resolved
aiida/tools/visualization/graph.py Outdated Show resolved Hide resolved
aiida/tools/graph/age_rules.py Outdated Show resolved Hide resolved
aiida/tools/graph/age_entities.py Outdated Show resolved Hide resolved
aiida/tools/graph/age_entities.py Outdated Show resolved Hide resolved
aiida/tools/graph/graph_traversers.py Outdated Show resolved Hide resolved
aiida/tools/graph/graph_traversers.py Outdated Show resolved Hide resolved
Copy link
Contributor

@CasperWA CasperWA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking great. Though I still have some comments and suggested changes.

aiida/backends/tests/tools/graph/test_age.py Outdated Show resolved Hide resolved
aiida/backends/tests/tools/graph/test_age.py Outdated Show resolved Hide resolved
aiida/backends/tests/tools/graph/test_age.py Show resolved Hide resolved
aiida/backends/tests/tools/graph/test_age.py Outdated Show resolved Hide resolved
aiida/tools/graph/age_entities.py Outdated Show resolved Hide resolved
aiida/tools/graph/age_entities.py Outdated Show resolved Hide resolved
aiida/tools/graph/age_rules.py Outdated Show resolved Hide resolved
to_be_exported = traverse_output['nodes']
graph_traversal_rules = traverse_output['rules']

# I create a utility dictionary for mapping pk to uuid.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is done already some other place in this function.
Please try to realign this so we're not doing it several times.

So it's fine to do it here, but then it shouldn't be done again later.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But maybe it is fine to do it like this, and then I'll try to patch the export function with a separate PR later.

lekah and others added 5 commits January 10, 2020 10:54
The AiiDA Graph Explorer (AGE) is a general purpose tool to perform
graph traversal of AiiDA graphs. It considers AiiDA nodes and groups
(eventually even computers and users) as if they were both 'graph nodes'
of an 'expanded graph', and generalizes the exploration of said graph.
The 'rules' that indicate how to traverse this graph are configured by
using generic querybuilder instances (i.e. with information about the
connections but without specific initial nodes/groups and without any
projections). The initial set of nodes/groups is provided directly to
the rule, which then will perform successive applications of the query,
each on top of the results of the previous one. This cycle is repeated
for a specified number of time, which can be specified to be 'until no
new nodes are found'.

The current implementation works with the following (public) classes:

* Basket: generic container class that can store sets of nodes, groups,
node-node edges (aiida links) and group-node edges. These are the
objects that the rule-objects receive and return.

* UpdateRule: initialized with a querybuilder instance (and optionally a
max number of iterations and the option to track edges), it can then be
run with an initial set of nodes to obtain the result of the accumulated
traversal procedure described by the iterations of the query.

* ReplaceRule: same as the update rule, except that at the end of the
procedure the returned basket contains not the accumulation of the
traversal steps but only the nodes obtained during the last step. This
is rule is not compatible with the 'until no new nodes are found' end
iteration criteria.

* RuleSequence: this can concatenate the application of different rules
(it basically works like an UpdateRule that iterates over a chain of
rules instead of a single querybuilder instance).

* RuleSaveWalkers and RuleSetWalkers: rules that can be provided in a
chain of rules given to a RuleSequence to save a given state of the
current basket (Save) that can later be used to overwrite the content
of said working basket (Set). This is useful in the case where one might
need to do two operations 'in parallel' (i.e. on the same set of nodes)
instead of doing the second on the results of the first one.

Co-Authored-By: ramirezfranciscof <ramirezfranciscof@users.noreply.github.com>
The function traverse_graph works as a simplified interface to interact
with the AGE that also removes the need to manually handle the basket
and the querybuilder instance:

 * The price to pay for hiding the basket is that this function can only be
   used with sets of nodes and links (so, no groups).

 * The price to pay for hiding the querybuilder is that complex traversal
   procedures can no longer be specified, the user simply defines which
   links can be traversed forwards and which backwards, and this criteria
   is then applied in every iteration (so one could not, in a single call,
   search only for all called calc nodes of the  called work nodes of an
   initial workflow node, as one will also obtain the calc nodes directly
   called by that initial workflow).

Besides the starting nodes (pks) and links, the user can also provide the
number of max iterations desired (which by default is None, which means
'until no new nodes are found') and a boolean that indicates if the links
(edges) should be returned.

Additionally, two other interfaces are included for ease of use when
deleting and exporting. These functions only take the starting set of
pks and the rules provided by the user (as 'rule_name_dir' = False/True)
and can automatically check if the rule is toggable, set defaults (using
aiida.common.links.GraphTraversalRules), and also parse the ruleset into
two lists with the links for forward and backward traversal. They will
return a dictionary containing the 'nodes' list, the 'links' list (if
this was requested, else this will contain `None`) and a dict with the
way in which all the rules were applied (using the following format:
'rule_name' = True/False).

Co-Authored-By: Leonid Kahle <leonid.kahle@epfl.ch>
The node deletion function now uses the get_nodes_delete function (with
the traverse_graph underlying interface using AGE as main engine) to
collect the extra nodes that are needed to keep a consistent provenance.
The procedure is not very different than the one that was initially
implemented so no significant performance improvement is expected, but
this is an important first step to homogenize graph traversal throughout
the whole code.
The export function now uses the get_nodes_delete function (with the
traverse_graph underlying interface using AGE as the main engine) to
collect the extra nodes that are needed to keep a consistent provenance.
This is performed, more specifically, by the 'retrieve_linked_nodes'
function. Whereas previously a different query was performed for each
new node added in the previous query step, this new implementation
should do a single new query for all the nodes that were added in the
previous query step. So these changes are not only important as a first
step to homogenize graph traversal throughout the whole code: an
improvement in the export procedure is expected as well.
The graph visualization feature now uses the traverse_graph function
(with AGE as the main engine) to collect the requested nodes to be
visualized. This was implemented in the methods of the graph class:
previously, `recurse_descendants` and `recurse_ancestors` used to
work by calling `add_incoming` and `add_outgoing` many times, which
in turn have to load nodes during the procedure. Now these are all
independent and they all call the traverse_graph function, so the
information is obtained directly from the query projections and no
nodes are loaded. So these changes are not only important as a first
step to homogenize graph traversal throughout the whole code: an
improvement in the visualization procedure is expected as well.
@sphuber sphuber dismissed CasperWA’s stale review January 10, 2020 10:09

Agreed in private that everything was addressed

@ramirezfranciscof ramirezfranciscof merged commit 47cfe34 into aiidateam:develop Jan 10, 2020
@ramirezfranciscof ramirezfranciscof deleted the agepr branch January 13, 2020 11:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Graph consistent/recursive exploration tool
9 participants