Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix import/export for workchains and workfunctions #1102

Closed
sphuber opened this issue Feb 2, 2018 · 17 comments
Closed

Fix import/export for workchains and workfunctions #1102

sphuber opened this issue Feb 2, 2018 · 17 comments

Comments

@sphuber
Copy link
Contributor

sphuber commented Feb 2, 2018

Currently the export and import do not respect all the properties of WorkCalculation nodes. We will have to make sure that all links are properly exported and imported and that the DbLog entries are also copied over, which are used for the report functionality of the WorkChains

@sphuber
Copy link
Contributor Author

sphuber commented Feb 13, 2018

The entities that are definitely not preserved when exporting/importing a workchain node:

  • The DbLog entries, visible through verdi work report
  • The output nodes of the workchain, through the RETURN links
  • The called nodes of the workchain, through the CALL links

@szoupanos szoupanos self-assigned this Feb 19, 2018
@giovannipizzi
Copy link
Member

A few other things that should be imported/exported are

  • comments of calculations (DbComment)

@szoupanos
Copy link
Contributor

szoupanos commented Feb 19, 2018

From #687 we know that:

  • Data -> Calculation: INPUT
  • Calculation -> Data: CREATE
  • Work -> Data: RETURN
  • Data -> Work: INPUT
  • Work -> Calculation: CALL
  • Code -> Calculation: INPUT

The initial code for descendants was the following
qb = QueryBuilder()
qb.append(Calculation, tag='high_node', filters={'id': {'in': [185]}})
qb.append(Data, output_of='high_node', project=['*'], edge_filters={'type': {'==': LinkType.CREATE.value}})
qb.all()

We need (regarding the links)

  • The output nodes of the workchain, through the RETURN links
  • The called nodes of the workchain, through the CALL links

After discussing with Sebastiaan (@sphuber):

  • We need to follow links recursively. E.g.
    • a workcalculation A that calls another workcalculation B
    • a workcalculation C that calls another calculation D
  • We need the data created by B & D

So we need to be in a loop where we follow CREATE, RETURN and CALL links until the set does not change anymore

@giovannipizzi
Copy link
Member

Just a couple of comments:

  1. the constraint on link types goes in the other direction: e.g. a CREATE is always between a calc and a data, but the reverse is not true (between calc and data there could be a CREATE or a RETURN, for instance). Also, at the moment I think one should "merge" Work into Calculation (it's a subclass). E.g. Work can also CREATE data, WORK can call WORK, ...

  2. in the import, depending on the cases one should look for the ancestors rather than descendants. I.e.:

  • if I export a data node, I also want its parents, recursively, to keep the provenance (following CREATE+INPUT links).
  • On the other hand, I think if I have a calculation, beside exporting its inputs (and possibly recursively up, as discussed in the previous point), I want to see all its outputs (CREATE+RETURN) but only 1 level deep, and typically not go further down.
  • If I have a calculation, I want to follow (recursively) all the CALL links going down (I don't think that by default one also wants to go up, but maybe this is a useful option for users).
  • And as you say, this should be done in a loop (until the set does not change).

For reference, one can reuse probably some of the logic of the 'verdi node delete' implemented recently, that also has a very similar recursive logic: https://github.com/aiidateam/aiida_core/pull/1083/files (even if the rules might be slightly different).

I think it would be also good to divide 'following' rules in three groups, so we make sure we don't forget use cases:

  • compulsory (to ensure consistency etc.), see e.g. above
  • optional for the user (see e.g. following CALL links upwards)
  • forbidden (if any)
  • possibly other links that are possible but not implemented as an option yet

And for the optional ones we have user options.

One think to discuss is: should we allow (with some "big" warning) to export stuff explicitly breaking the provenance? Originally I was against, but there have been a few use cases where this is actually needed... (e.g. exporting data but leaving out some inputs because they have some license; or exporting only the results to have slimmer databases, if I want to study something, and I want to have the same UUIDs, but I don't care having the full provenance in my local DB). In this case, it would be good to still have somewhere a 'note' that the provenance was in some way cut or broken (in verdi node delete, we actually add an entry to the DbLog table:
https://github.com/aiidateam/aiida_core/pull/1083/files#diff-1ddc7a2c3920cfbae75335771b1d0522R164
but probably it is not a good idea for the export, as this will be logged "forever" even if later we import back the provenance. Maybe instead use extras? to discuss

@szoupanos
Copy link
Contributor

Available AiiDA link discussion can be found at #1174

@szoupanos
Copy link
Contributor

szoupanos commented Apr 20, 2018

This is an overview of the export algorithm (that enriches a given set of nodes by the user). It decides which is the final set of nodes and the links to be exported based on the user choices (private nodes, & rules that can be overridden).


Private nodes
The set of private nodes is defined by the user. It can contain Data nodes but also Calculation nodes

Node export rules

Link type Tail & head of the link Link traversal Export direct successor (forward link traversal) / direct predecessor (reverse link traversal) Exceptions Comments
INPUT (Data, Calculation) Forward No User can modify default behaviour By default we don’t care if a Data node candidate for export is used as input to Calculations that are not selected yet for export.
    Reversed Yes Except Data nodes that are marked explicitly private We need the predecessors of a Calculation node. Note that the non-direct predecessors will be found by following either INPUT or CREATE links depending on the type predecessor found in the reverse link traversal.
CREATE (Calculation, Data) Forward Yes   We need the direct Data node successors of the Calculation nodes.
    Reversed Yes ? We need the predecessors of a Data node. Note that the non-direct predecessors will be found by following either INPUT or CREATE links depending on the type predecessor found in the reverse link traversal.
RETURN (Calculation, Data) Forward Yes   The direct successors of a calculation node are exported
    Reversed No   We are not interested in Calculations that just returned a Data node but not created it.
CALL (Calculation [caller], Calculation [called]) Forward Yes   We need to know how the top WorkCalculation obtained its results
    Reversed No User can modify default behaviour  

Node export algorithm

  • There are three sets. The to_be_visited set, the to_be_exported set and the private set
  • Initially the to_be_exported and the to_be_visited sets are populated by the user (by explicitly passing node ids, group of nodes etc). Any node in the private set is not added to the to_be_exported and the to_be_visited sets
  • We repeat the following until the to_be_visited set is empty
    • Remove a Node n from the to_be_visited set
      • For every link tp(n, h), where tp is the type of the link, n is the tail of the link and h is its head, add h to the to_be_exported and to the to_be_visited sets if
        • h is not in the private set
        • tp(n, h) satisfies the aforementioned rules
        • h is not already in the to_be_exported set
      • For every link tp(t, n), where tp is the type of the link, t is the tail of the link and n is its head, add t to the to_be_exported and to the to_be_visited sets if
        • t is not in the private set
        • tp(t, n) satisfies the aforementioned rules
        • t is not already in the to_be_exported set
  • Return the to_be_exported set

Link export rules

For every node in the to_be_exported, export the corresponding links based on the following rules

  • The Calculation nodes should be shielded
Link type Tail & head of the link Link export
INPUT (Data, Calculation) Backward, by the Calculation node / Also forward too by the Data node, if the user modifies the default behaviour
CREATE (Calculation, Data) Forward, by the Calculation node / Backward, by the Data node
RETURN (Calculation, Data) Forward, by the Calculation node
CALL (Calculation [caller], Calculation [called]) Forward, by the Calculation [caller] node / Also backward too by the Calculation [called] node, if the user modifies the default behaviour

At the receiver side:

  • The import is performed only if the links point to nodes that exist at the receiver side.
  • The import file should work in an empty database and it should not crash, when it is not overridden

@giovannipizzi
Copy link
Member

Thanks! A few comments:

  • Create reversed: I think one should stop at private calculations and not export them
  • input reversed: here we have to be a bit careful - I'm not sure it's a good idea to export a calculation directly running out of a private node, as most probably the "private" input data inside the private node will be copied in the raw_input and possibly also in the raw_output depending on the code - moreover, we would store a calculation with a subset of its input nodes. Maybe in this case (with a user option to decide this behavior?) we don't export at all the calculation having private data inputs (and log this to the output)
  • return reversed: maybe there should be an option to follow also this (e.g. in combination with the option of following call/reverse) when the user is explicitly interested in exporting the whole workflow. Maybe following return-reversed can be limited to only those nodes provided explicitly in input by the user, and it is not extended recursively

As a remark, as I'm sure there will be new/slighlty different rules and requirements as soon as new users will use this, we should have each link direction/type with an option, to make it easy to change the rules for different usecases.

Also, in the last sentence, I think it should read "The import file should work in an empty database and it should not crash, when the default link/node selection rules are not overridden"

@szoupanos
Copy link
Contributor

Thanks for the comments!

I think that the exceptions and the private nodes the whole procedure starts to be more complicated than it should be.

I tend to believe that these should be 2 independent issues:
a) rules for graph traversal
b) private nodes that when found in traversal should not be exported

I also tend believe for the (b) that the private node set should not stop the graph exploration procedure (i.e. a) . I think that it should continue and just the nodes in the private node set should not be exported.

I also tend to believe that if we would like not to export some nodes because some others are marked as private (i.e. a calculation that has as input private data), we should just enrich the set of private nodes with more nodes based on these assumptions.

That way we separate the problem in (clearly defined) sub-problems:
(a) graph exploration rules that enrich an initial set of nodes based on graph connectivity
(a.1) how we override these rules based on user desires on graph exploration
(b) private nodes that should not be exported
(b.2) how we enrich the private node set based on how these nodes are connected and our assumptions

Maybe it would be better to discuss face-to-face with some concrete examples.

@szoupanos
Copy link
Contributor

What we discussed & the override rules can be found at https://github.com/szoupanos/aiida_core/tree/issue_1102_b

The hidden nodes have not been implemented yet.
We can check if what we have works as we would like.

@szoupanos
Copy link
Contributor

@ltalirz @giovannipizzi
Did you find the time to test this? Do you continue to experience the problems that you had before?

@ltalirz
Copy link
Member

ltalirz commented May 2, 2018

When I tried to export the DB with 70k CifData, WorkFunction and ParameterData, I got this:

STARTING EXPORT...
STORING DATABASE ENTRIES...
Exporting a total of 209596 db entries, of which 209593 nodes.
STORING NODE ATTRIBUTES...
STORING NODE LINKS...
STORING GROUP ELEMENTS...
STORING DATA...
STORING FILES...
Traceback (most recent call last):
  File "/Users/leopold/Applications/miniconda3/envs/aiida_master/bin/verdi", line 9, in <module>
    sys.exit(run())
  File "/Users/leopold/Personal/Postdoc-MARVEL/repos/aiida/aiida_core/aiida/cmdline/verdilib.py", line 1050, in run
    aiida.cmdline.verdilib.exec_from_cmdline(sys.argv)
  File "/Users/leopold/Personal/Postdoc-MARVEL/repos/aiida/aiida_core/aiida/cmdline/verdilib.py", line 1035, in exec_from_cmdline
    CommandClass.run(*argv[command_position + 1:])
  File "/Users/leopold/Personal/Postdoc-MARVEL/repos/aiida/aiida_core/aiida/cmdline/verdilib.py", line 878, in run
    exec (f, globals_dict)
  File "script_new.py", line 20, in <module>
    export(nodes)
  File "/Users/leopold/Personal/Postdoc-MARVEL/repos/aiida/aiida_core/aiida/orm/importexport.py", line 2472, in export
    export_tree(what, folder=folder, silent=silent, **kwargs)
  File "/Users/leopold/Personal/Postdoc-MARVEL/repos/aiida/aiida_core/aiida/orm/importexport.py", line 2221, in export_tree
    dest_name='.')
  File "/Users/leopold/Personal/Postdoc-MARVEL/repos/aiida/aiida_core/aiida/common/folders.py", line 216, in insert_path
    raise ValueError("insert_path can only insert files or paths, not symlinks or the like")
ValueError: insert_path can only insert files or paths, not symlinks or the like

I'm not sure whether this is a problem of the state of my DB or of the new export. l'll check with @yakutovicha that he tests the export with a completely new DB again.

@szoupanos
Copy link
Contributor

Hello, is there any update on this? Did you have the time to check?

@sphuber
Copy link
Contributor Author

sphuber commented May 7, 2018

I tried to check out your branch and run a simple verdi export create but get the following exception:

Traceback (most recent call last):
  File "/home/sphuber/.virtualenvs/aiida_legacy/bin/verdi", line 11, in <module>
    sys.exit(run())
  File "/home/sphuber/code/aiida/env/legacy/aiida-core/aiida/cmdline/verdilib.py", line 1050, in run
    aiida.cmdline.verdilib.exec_from_cmdline(sys.argv)
  File "/home/sphuber/code/aiida/env/legacy/aiida-core/aiida/cmdline/verdilib.py", line 1035, in exec_from_cmdline
    CommandClass.run(*argv[command_position + 1:])
  File "/home/sphuber/code/aiida/env/legacy/aiida-core/aiida/cmdline/baseclass.py", line 217, in run
    function_to_call(*args[1:])
  File "/home/sphuber/code/aiida/env/legacy/aiida-core/aiida/cmdline/commands/exportfile.py", line 31, in cli
    verdi()
  File "/home/sphuber/.virtualenvs/aiida_legacy/local/lib/python2.7/site-packages/click/core.py", line 722, in __call__
    return self.main(*args, **kwargs)
  File "/home/sphuber/.virtualenvs/aiida_legacy/local/lib/python2.7/site-packages/click/core.py", line 697, in main
    rv = self.invoke(ctx)
  File "/home/sphuber/.virtualenvs/aiida_legacy/local/lib/python2.7/site-packages/click/core.py", line 1066, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/sphuber/.virtualenvs/aiida_legacy/local/lib/python2.7/site-packages/click/core.py", line 1066, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/sphuber/.virtualenvs/aiida_legacy/local/lib/python2.7/site-packages/click/core.py", line 895, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/sphuber/.virtualenvs/aiida_legacy/local/lib/python2.7/site-packages/click/core.py", line 535, in invoke
    return callback(*args, **kwargs)
  File "/home/sphuber/code/aiida/env/legacy/aiida-core/aiida/cmdline/commands/exportfile.py", line 126, in create
    outfile=outfile, overwrite=overwrite, **additional_kwargs
  File "/home/sphuber/code/aiida/env/legacy/aiida-core/aiida/orm/importexport.py", line 2438, in export_zip
    export_tree(what, folder=folder, silent=silent, **kwargs)
TypeError: export_tree() got an unexpected keyword argument 'also_calc_outputs'

@szoupanos
Copy link
Contributor

@giovannipizzi The above algorithm expects that the Code is Data instance which is not the case for AiiDA v0.12. I will modify it accordingly and I will change it back when we change the AiiDA hierarchy to what we have agreed (Code is an instance, and subclass, of Data)

@szoupanos
Copy link
Contributor

szoupanos commented Jul 12, 2018

Things go well. I exported the following groups from a test database, I imported them to an empty one and all nodes and links are exported as expected (checked all nodes and all links of the imported nodes of the imported groups)

 PK  GroupName                                                                                NumNodes  User
----  -------------------------------------------------------------------------------------  ----------  -----------------------
 200  structures_final_test_bench_small_v2_insulating_magnetic                                       12  nm@l.h
 201  structures_final_test_bench_small_v2_insulating                                                15  nm@l.h
 202  structures_final_test_bench_small_v2                                                           54  nm@l.h
 203  pw_scf_final_test_bench_small_v2_metallic_magnetic                                             12  nm@l.h
 204  pw_scf_final_test_bench_small_v2_metallic                                                      15  nm@l.h
 205  pw_scf_final_test_bench_small_v2_insulating_magnetic                                           12  nm@l.h
 206  pw_scf_final_test_bench_small_v2_insulating                                                    15  nm@l.h
 207  structures_final_test_bench_small_v2_metallic                                                  15  nm@l.h
 208  structures_final_test_bench_small_v2_metallic_magnetic                                         12  nm@l.h
 209  pw_scf_final_test_bench_small_v2                                                               54  nm@l.h
 211  seekpath_structures_final_test_bench_small_v2                                                  54  am@l.h
 213  workchain_out_1                                                                                54  am@l.h
 233  w90wc_trial_1                                                                                  20  am@l.h
 270  w90_scdm_insulators_v1                                                                          6  am@l.h
 273  w90_scdm_insulators_v2                                                                         15  am@l.h
 278  w90_scdm_insulators_v3                                                                         14  am@l.h
 284  w90_scdm_metals_v1                                                                             12  am@l.h
 294  w90_scdm_insulators_v3_mlwf                                                                    14  am@l.h
 333  test_topowf                                                                                     2  am@l.h
 335  w90_scdm_insulators_v3_mlwf_convtol                                                            13  am@l.h
 338  w90_scdm_insulators_v3_mlwf_convtol_denser                                                      4  am@l.h
 365  w90_scdm_metals_v1_mlwf                                                                        12  am@l.h
 368  w90_scdm_metals_v1_dis                                                                         12  am@l.h
 369  w90_scdm_metals_v1_dis_mlwf                                                                    12  am@l.h
 373  structures_gptest_1                                                                             1  gp@l.h
 374  gptest_out_1                                                                                    2  gp@l.h
 380  w90_scdm_v4_dense_insulating_mlwf                                                              14  gp@l.h
 383  w90_scdm_v4_dense_withconduction_mlwf_insulating                                               10  gp@l.h
 390  gptest_out_w90_1_withconduction_mlwf                                                            1  gp@l.h
 394  w90_scdm_v4_dense_withconduction_mlwf_insulating_projectability090                             10  gp@l.h
 395  w90_scdm_v4_dense_withconduction_mlwf_metallic_projectability090                               12  gp@l.h
 398  w90_scdm_v4_dense_withconduction_mlwf_insulating_projectability095-sigma4                      10  gp@l.h
 399  w90_scdm_v4_dense_withconduction_mlwf_metallic_projectability095-sigma4                        12  gp@l.h
 406  w90_scdm_v4_dense_withconduction_mlwf_insulating_projectability095-sigma4-sigmashift3          14  gp@l.h
 407  w90_scdm_v4_dense_withconduction_mlwf_metallic_projectability095-sigma4-sigmashift3            14  gp@l.h
 414  w90_scdm_v4_dense_withconduction_mlwf_insulating_musigmaauto1-sigmashift3                      14  gp@l.h
 415  w90_scdm_v4_dense_withconduction_mlwf_insulating_musigmaauto1-sigmashift0                      14  gp@l.h
 416  w90_scdm_v4_dense_withconduction_mlwf_metallic_musigmaauto1-sigmashift0                        14  gp@l.h
 417  w90_scdm_v4_dense_withconduction_mlwf_metallic_musigmaauto1-sigmashift3                        14  gp@l.h
 418  gptest_out_w90_1_withconduction_mlwf_musigmaauto1-sigmashift0                                   1  gp@l.h
 419  gptest_out_w90_1_withconduction_mlwf_musigmaauto1-sigmashift3                                   1  gp@l.h
 420  workchain_out_2                                                                                54  am@l.h

I will do some ping-pong exchange of data to see if I get correct results too.

@szoupanos
Copy link
Contributor

The work on the new export algorithm that supports workcalculation/workchain nodes and the respective link rules will be implemented under the ticket #1758

@szoupanos
Copy link
Contributor

The work of the export & import of DbLog entries will be performed at #1759

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants