-
Notifications
You must be signed in to change notification settings - Fork 192
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
QueryBuilder: with_ancestors does not work as some might expect (+ speed issues) #3755
Comments
I guess the simple solution would be to use |
I'd say your example works exactly as expected. Same if you have 2 links connecting two Nodes (1 and 2), you'd get node2 twice traversing from node1. It's because the Cartesian product (the join) has 2 solutions, as is the case in above example. If you want unique solutions, use qb.distinct().all(), that is asking the backend to return unique solutions. |
Hehe, it may do what you expect, but let me assure you that this is not what a new AiiDA user expects when asking for a node X that has ancestor Y. I also believe that finding unique nodes with a given set of ancestors is the most common use case of I believe we should support this use case.
Thanks, I wasn't aware of this. In our example, we are querying a DB of 500 materials for a single property each: qb = QueryBuilder()
qb.append(Group, filters = {"label":{"like": "curated-cof_%"}}, tag = "group")
qb.append(CifData, filters={'extras.curated-cof_tag': 'opt_cif_ddec'}, with_group="group", tag="cif_ddec")
qb.append(Dict, filters={'extras.curated-cof_tag': 'pe_out'}, with_ancestors='cif_ddec')
qb.distinct().all() This query takes 30s, which is about 400 times slower than getting the same info via an alternative method where we just put the corresponding Dict in a Group and look up via the group (73.5ms). @giovannipizzi : Before we are going to suggest this to users, I guess we'll need another implementation of this that stops after having found the first path. |
After further testing we found a significant discrepancy in the timing of these two equivalent queries:
and
they both return the material's label and the property of interest, e.g.,
but while the first takes 27ms the seconds takes as much as 144second (ca. 4 orders of magnitude more!). Mentioning @giovannipizzi for discussion. |
With a more realistic testing on 500 materials (ca. 1000 nodes for each material) the parsing of a single property takes 10x more when using the "with_ancestors" query, instead of picking these property nodes directly from a group (the method we are currently using for https://www.materialscloud.org/discover/curated-cofs#mcloudHeader): 5s vs 0.5s, meaning that, for real use, one can consider some caching to think about using the "with_ancestors" query. |
In terms of functionality, I agree with @lekah that the current behaviour is the intended one (find all possible paths that connect the two nodes). On the other hand:
For both of these: @danieleongari, could you do a Great if you could report here, and possibly then we can meet in person if we need to do more debugging. Finally, I feel that we are going in the right direction in terms of speed improvement, and we can still gain something with some improvements, after which the speed should be more than good for the website. |
For the slowness: as @danieleongari reports, it is very weird - the slowness does not come from the ancestors part, but from its combination with a group. This is a bit surprising for me. The explanation is the following: the with_ancestors/with_descendant query works with a common table expression, essentially it needs to construct - in memory - a table with the necessary paths on the fly when you ask for it, since storing all these paths is prohibitive (see #258). The question is now, how big is the size of this table in memory storing all the paths? The worst-case scenario is all possible paths in the AiiDA graph, which is probably the case in the second, slow example. (The PSQL engine does not by itself figure out that this is much more expensive than first figuring out which nodes are necessary). As to why the first example is so fast: If the user provides specific filters on the origin node(s) of the paths, these are incorporated in the common-table-expression (explicitly by us, see QueryBuilder._join_descendants_recursive). Therefore, the engine is told to construct this table for a subset of origin nodes given by the query. Again, this difference in speed is therefore (for me) expected behavior, although maybe not well documented. As to for fixing these speed issues: we were aware of it and couldn't figure out a way to tell the engine to perform the query-ops in a specific order. If someone manages to force the engine to create the paths-"table" last for only the nodes that are necessary, that would solve the issue (but maybe create other issues). It's not easy though... |
Indeed, I had forgotten what @lekah writes:
(link to the function)
This is also confirmed by the following, where I paste the actual SQL queries (different PKs): %%timeit
qb = QueryBuilder()
qb.append(Node, project=['label'], filters={'id': 638}, tag='cif')
qb.append(Node, project=['id'], with_ancestors='cif')
qb.distinct().all() output of WITH RECURSIVE anon_2(ancestor_id, descendant_id, depth) AS
(SELECT db_dblink_1.input_id AS ancestor_id, db_dblink_1.output_id AS descendant_id, CAST(0 AS INTEGER) AS depth
FROM db_dbnode AS db_dbnode_3 JOIN db_dblink AS db_dblink_1 ON db_dblink_1.input_id = db_dbnode_3.id
WHERE CAST(db_dbnode_3.node_type AS VARCHAR) LIKE '%%' AND db_dbnode_3.id = 638 AND db_dblink_1.type IN ('create', 'input_calc') UNION ALL SELECT anon_2.ancestor_id AS ancestor_id, db_dblink_2.output_id AS descendant_id, anon_2.depth + CAST(1 AS INTEGER) AS current_depth
FROM anon_2 JOIN db_dblink AS db_dblink_2 ON db_dblink_2.input_id = anon_2.descendant_id
WHERE db_dblink_2.type IN ('create', 'input_calc'))
SELECT DISTINCT db_dbnode_1.label, db_dbnode_2.id
FROM db_dbnode AS db_dbnode_1 JOIN anon_2 AS anon_1 ON anon_1.ancestor_id = db_dbnode_1.id JOIN db_dbnode AS db_dbnode_2 ON anon_1.descendant_id = db_dbnode_2.id
WHERE CAST(db_dbnode_1.node_type AS VARCHAR) LIKE '%%' AND db_dbnode_1.id = 638 AND CAST(db_dbnode_2.node_type AS VARCHAR) LIKE '%%' With this instead: %%timeit
qb = QueryBuilder()
qb.append(Group, filters={'id': '25'}, tag='group')
qb.append(Node, project=['label'], with_group='group', tag='cif')
qb.append(Node, project=['id'], with_ancestors='cif')
qb.distinct().all() we get WITH RECURSIVE anon_2(ancestor_id, descendant_id, depth) AS
(SELECT db_dblink_1.input_id AS ancestor_id, db_dblink_1.output_id AS descendant_id, CAST(0 AS INTEGER) AS depth
FROM db_dbnode AS db_dbnode_3 JOIN db_dblink AS db_dblink_1 ON db_dblink_1.input_id = db_dbnode_3.id
WHERE CAST(db_dbnode_3.node_type AS VARCHAR) LIKE '%%' AND db_dblink_1.type IN ('create', 'input_calc') UNION ALL SELECT anon_2.ancestor_id AS ancestor_id, db_dblink_2.output_id AS descendant_id, anon_2.depth + CAST(1 AS INTEGER) AS current_depth
FROM anon_2 JOIN db_dblink AS db_dblink_2 ON db_dblink_2.input_id = anon_2.descendant_id
WHERE db_dblink_2.type IN ('create', 'input_calc'))
SELECT DISTINCT db_dbnode_1.label, db_dbnode_2.id
FROM db_dbgroup AS db_dbgroup_1 JOIN db_dbgroup_dbnodes AS db_dbgroup_dbnodes_1 ON db_dbgroup_dbnodes_1.dbgroup_id = db_dbgroup_1.id JOIN db_dbnode AS db_dbnode_1 ON db_dbnode_1.id = db_dbgroup_dbnodes_1.dbnode_id JOIN anon_2 AS anon_1 ON anon_1.ancestor_id = db_dbnode_1.id JOIN db_dbnode AS db_dbnode_2 ON anon_1.descendant_id = db_dbnode_2.id
WHERE db_dbgroup_1.id = '25' AND CAST(db_dbnode_1.node_type AS VARCHAR) LIKE '%%' AND CAST(db_dbnode_2.node_type AS VARCHAR) LIKE '%%' Note the difference: in the first one, there is the filter So, I think there are 2 aspects, independent and both relevant:
|
As a note, if you try this on a small, disconnected DB the timings will be very similar (e.g. I get ~20ms for both on a DB of only 2000 nodes) |
One comment for @danieleongari - for performance, it would be interesting to check if you can rewrite your query using A second comment, also for @danieleongari: I was looking at how to avoid to get duplicates. I simplified a bit the query (removing the last part that is not really interesting for the purpose of this discussion) and I tried to remove the depth and unify results. Here is the result (I changed SELECT to SELECT DISTINCT, even if probably this is not needed; and most importantly I replaced UNION ALL with UNION, that removes duplicates. One needs to replace the PK of the node in the query, but it would be interesting to know if this is (much?) faster than the query with the depth in your case (note that here I removed also all filters, so maybe the query below needs some adaptation).
EDIT: next optimisation strategy - see if it's possible in some way to 'stop' the query as soon as a relevant node is found? Not sure it's possible, though. |
Just for reference, we had a discussion with @danieleongari , @ramirezfranciscof & @giovannipizzi , notes can be found here Recommendations:
Todo
|
I think I also found a performance issue of If I have a group of nodes and I query their children using:
is extremely slow, and almost never finishes. Here, the structure data can have multiply relaxed (intermediate structures). On the other hand, search in the reverse direction seems to work better:
and actually can finish. There about 40 nodes and 2000 children structures. In my computer, it took about 16s. So far the best way is achieved by querying each
as it takes less than a seconds to finish and gives the same results. I supposed the difference lies in the internals of how the SQL queries are constructed and can be difficult to change. |
Yes, unfortunately with 'standard' PostgreSQL syntax, and without injecting postgres functions, it's hard to make it very efficient (or at least I didn't manage, but I'm getting convinced it's not possible). Could you try to reconvert your last query using the AiiDA Graph Explorer (AGE)? Maybe @ramirezfranciscof can help you. |
Consider the following simple graph
With the following query
I would expect exactly one result (the node at the bottom).
However, I get this:
The reason is that there are two paths to arrive at the final node (one through the left, and one through the right). Discovered with @danieleongari
Mentioning @giovannipizzi and @lekah for comment.
P.S. @giovannipizzi We did test the
with_ancestors
approach for the CURATED COFs database and with the current implementation it is very slow. As you can imagine, you can easily get back the same node hundreds of times.P. P.S. Python script to create the graph
The text was updated successfully, but these errors were encountered: