Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial implementation of the BaseRestartWorkChain #79

Conversation

JPchico
Copy link
Collaborator

@JPchico JPchico commented May 25, 2023

Initial implementation of the BaseLammpsWorkChain this is based on the BaseRestartWorkChain.

The idea of this workchain is to handle the most basic calculations and then to be used as a building block to build more specific workchains to handle minimization, md and other more property specific workchains.

@JPchico JPchico requested review from chrisjsewell and sphuber May 25, 2023 14:07
@JPchico JPchico self-assigned this May 25, 2023
@JPchico
Copy link
Collaborator Author

JPchico commented May 25, 2023

Hi @chrisjsewell and @sphuber, this is an initial implementation of the BaseRestartWorkChain for a LammpsBaseCalculation the tests and the docs are missing. It is just a basic implemetation to deal with very basic restarts for now. But we can use it as a place to discuss what needs to be added and/or fixed.

@codecov
Copy link

codecov bot commented May 25, 2023

Codecov Report

Merging #79 (85c2b62) into develop (22cf69c) will decrease coverage by 2.68%.
The diff coverage is 82.58%.

@@             Coverage Diff             @@
##           develop      #79      +/-   ##
===========================================
- Coverage    88.65%   85.98%   -2.68%     
===========================================
  Files           19       17       -2     
  Lines         1455     1356      -99     
===========================================
- Hits          1290     1166     -124     
- Misses         165      190      +25     
Flag Coverage Δ
pytests 85.98% <82.58%> (-2.68%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
aiida_lammps/parsers/inputfile.py 87.07% <75.00%> (-0.16%) ⬇️
aiida_lammps/workflows/base.py 75.24% <75.24%> (ø)
aiida_lammps/parsers/base.py 85.00% <83.67%> (-0.23%) ⬇️
aiida_lammps/calculations/base.py 96.62% <96.66%> (+0.35%) ⬆️
aiida_lammps/parsers/parse_raw/lammps_output.py 90.66% <100.00%> (+1.60%) ⬆️
aiida_lammps/utils.py 83.33% <100.00%> (+83.33%) ⬆️

... and 1 file with indirect coverage changes

Copy link
Member

@sphuber sphuber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @JPchico . Gave it a first read and added some comments. Haven't looked at the actual restart logic because I am not too familiar with how this is done in LAMMPS anyway.

"""Define the process specification."""
# yapf: disable
super().define(spec)
spec.expose_inputs(LammpsBaseCalculation)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should definitely use a namespace here, because now you are replacing the metadata inputs of the workchain with those of the LammpsBaseCalculation.

Suggested change
spec.expose_inputs(LammpsBaseCalculation)
spec.expose_inputs(LammpsBaseCalculation, namespace='lammps')

# yapf: disable
super().define(spec)
spec.expose_inputs(LammpsBaseCalculation)
spec.input(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are you redefining this input? It is already defined on the base class. Same goes for the clean_workdir.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I thought I still needed to make them accessible at his level, at least that is how it was made in the aiida-vasp But maybe it is because that restart workchain was written before the BaseRestartWorkChain was officially out.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The aiida-vasp has indeed been around for a while and it may be doing some things that are no longer optimal. Frankly, for a recent example of a BaseRestartWorkChain implementation, I would recommend having a look at this one: https://github.com/microsoft/aiida-pyscf/blob/main/src/aiida_pyscf/workflows/base.py

That should follow all latest best practices and should provide a good example I think.

""",
)
spec.input(
"verbosity",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think I see this used anywhere. Anyway, I think it might be better to just use the logger with the various levels that already exist. You can always use self.logger.warning or self.logger.info to log messages at various levels and then the user can control the level with the global logging.aiida_loglevel flags.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, I did not know about this, in the aiida-vasp there is this flag that can be used to decide whether some messages should be printed or not via the self.report.

Does the logger work in a similar way?

I know it does nothing right now but I wanted to start planning for it, but if it can be done via the logger it is much better as it does not generate more nodes that one does not really need

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self.report simply forwards to self.logger.report. It simply logs a message on the REPORT level, which is a level AiiDA defines in between INFO and WARNING. If you want to log verbose information that is not shown by default, I recommend you simply use self.logger.info(message). This will not show up by default, but the user can set verdi config set logging.aiida_loglevel INFO and then they will show. This controls logging for the entire application. Think this is better then each workflow adding custom solutions.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fantastic, I was not aware of it being perfectly honest. Will also forward this to aiida-vasp, I think that this is nice, since if we do not need to store those nodes it is much better.

@JPchico
Copy link
Collaborator Author

JPchico commented May 26, 2023

I was discussing with my colleagues about what to do when a minimization does not reach the tolerances in the maximum number of iterations. And it seems to be that what one does is pretty much the same that if one runs out of walltime. Take the restartfile if it exists and use it as the initial configuration, otherwise use the last step of the trajectory.

As this is exactly the same I'll se that one uses the same restart strategy, unless you guys have some idea of what could be done differently.

@JPchico
Copy link
Collaborator Author

JPchico commented May 31, 2023

Hi @sphuber , I have added a couple of tests based on what is done in the quantum-espresso plugin. I was thinking that maybe one could add a couple of tests for restarting from a stored restarfile, or a file in the remote.

The one with the stored restartfile should be possible if one can create a file using the method that can create a file from a string.

The file in the remote might be harder, as one would have to create a file in the remote folder and that is not a good idea.

@JPchico JPchico marked this pull request as ready for review May 31, 2023 16:12
@JPchico
Copy link
Collaborator Author

JPchico commented Jun 1, 2023

I have now also added a test that checks that if a restartfile is found that is used when restarting, and another one checking that if no restart is found and the trajectory is stored that is used instead.

The only case that is not covered is using the remote, since that one requires a restartfile to be present, however it indirectly checks that it is not supposed to use that option if the file is not found, so that is something i guess.

image

@JPchico JPchico requested a review from sphuber June 1, 2023 18:23
@JPchico
Copy link
Collaborator Author

JPchico commented Jun 5, 2023

ping @sphuber and @chrisjsewell . If there are no objections I'll merge this today.

@@ -93,39 +93,43 @@ def define(cls, spec):
spec.input(
"metadata.options.input_filename",
valid_type=str,
default=cls._INPUT_FILENAME,
default=cls._DEFAULT_VARIABLES.get("input_filename"),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why use .get("input_filename") instead of just ["input_filename"]?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No reason to be honest, it can easily be changed to just the normal access.

Comment on lines +78 to +75
if "store_restart" in self.ctx.inputs:
self.ctx.inputs.settings.store_restart = self.ctx.inputs.store_restart.value
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This belongs more in the setup method as it is not really validating anything

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure thing, will move it there.

Comment on lines 81 to 88
if "script" not in self.ctx.inputs and any(
key not in self.ctx.inputs
for key in ["parameters", "potential", "structure"]
):
raise InputValidationError(
"Unless `script` is specified the inputs `structure`, "
"`potential` and `parameters` have to be specified."
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This validation is already done by LammpsBaseCalculation is it not? In that case you don't have to do it here. If you expose LammpsBaseCalculation its validators are called when you launch LammpsBaseWorkChain

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aha, okay so when exposing the inputs the validators are called also? Okay that means that this is unnecessary indeed.

Copy link
Member

@sphuber sphuber Jun 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly. This was necessary, because otherwise the exposing functionality as a way to compose workchains together would be pretty much useless, since having to copy all validation each time would be untenable. Note that there are some tricky caveats. If some process defines a validator that checks validity of values of port a and b, and then another process exposes that subprocess but excludes port b, the validator will no longer work. It is possible to write validators in such a way that they guard against this and won't fail but just skip the validation, but it is not immediately obvious to users.

"`potential` and `parameters` have to be specified."
)

if "parameters" in self.ctx.inputs:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rest of this validation should really be done in an actual validator that is set on the relevant input port. The advantage is that the validation in this way is launched as soon as the process is submitted and if it is invalid, the nodes won't even be created. With the current approach, the process is always submitted and the node is created, and the process starts running, only to except in the first steps.

For example the first one, which is valdiating the parameters input, should really be done in the LammpsCalculation class.

class LammpsBaseCalculation(CalcJob)

    @classmethod
    def define(cls, spec):
        super().define(spec)
        spec.input('parameter', validator=cls.validate_parameters)
    
    @clasmmethod
    def validate_parameters(cls, value, ctx) -> str | None:
        # validate here, value contains value passed to the port.
        # If there is a problem, return the error message as a string

The ctx variable will have the Port or PortNamespace that is being validated. This is useful if you need to validate a PortNamespace, for example the top-level inputs. You can use that to access multiple port values in case the validation depends on the value of multiple port.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure thing will move those to the calculation, I was not aware of that difference, that is nice to know.

Comment on lines 310 to 312
if calculation.exit_status == 401:
self.report("Energy not converged during minimization, attempting restart")
if calculation.exit_status == 402:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably best to use LammpsBaseCalculation.exit_codes.ERROR_FORCE_NOT_CONVERGED.status instead of hard-coding the numbers, just in case they ever were to change

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure thing will change it to that.


restartfiles = [
entry
for entry in calculation.outputs.remote_folder.listdir()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two potential problems with using RemoteData.listdir as it will open a transport to the remote computer:

  1. The opening of the transport can fail (for example connection problems) and the exceptions are not caught here, which therefore will except the entire workchain, making this approach very vulnerable to connection problems.
  2. The opening of the transport is done directly and not through the transport pool that is maintained by the engine for CalcJobs. This means that each workchain will open its own connection. If you run many workchains, that means you can hit the remote server too often and get banned.

Unfortunately, there is no API that allows to reuse all the engine's safety mechanisms surrounding transports that is used for CalcJobs, here in WorkChains. If possible, I would strongly recommend not using this approach, but if there is no other option, then we should maybe document this so that people are aware of these pitfalls.

Copy link
Collaborator Author

@JPchico JPchico Jun 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is a tricky one.
If you look in the calculation I do a similar check

        # check if there is a parent folder to restart the simulation from a previous run
        if "parent_folder" in self.inputs:
            # Check if one should do symlinks or if one should copy the files
            # By default symlinks are used as the file can be quite large
            symlink = settings.pop("parent_folder_symlink", self._default_symlink_usage)
            # Find the name of the previous restartfile, if none is given the default one is assumed
            # Setting the name here will mean that if the input file is generated from the parameters
            # that this name will be used
            _read_restart_filename = settings.pop(
                "previous_restartfile", self._DEFAULT_VARIABLES.get("restart_filename")
            )

            if not _read_restart_filename in self.inputs.parent_folder.listdir():
                raise exceptions.InputValidationError(
                    f'The name "{_read_restart_filename}" for the restartfile is not present in the '
                    f'remote folder "{self.input.parent_folder.uuid}"'
                )

I have no idea of how else to check if the file is present in the remote. Because that is basically what this is trying to do, check if the file exists in the parent folder. In the case of the RestartWorkChain of course it is a bit trickier since we also need to find the name of the restartfile. Why one needs to do this is due to the fact that you can write the restartfile at regular intervals during your calculation, this will then write files that look like {BASE_RESTARTFILE_NAME}.{step}. Hence, you have the problem that you do not know exactly how your restartfile will be named, and then you need to check.

The only way I can possibly think of solving this, is to somehow store this information somewhere in the parsing of the calculation, since if you give the instruction that those intermediate restartfiles are printed they are placed in the retrieve_temporary_list. Then you could find out what is the name of the restartfile, if present, and then add it somewhere in the parsing output. Maybe in the results dictionary? I think that one should add a section there with WARNINGS and so on, so maybe there?

What do you think @sphuber ? That would mean that we know the name, we would however not check for its existence. But maybe that is an acceptable compromise

@JPchico
Copy link
Collaborator Author

JPchico commented Jun 5, 2023

I have removed the calls to listdir in the workchain, I only keep one in the CalcJob which checks if the file actually exists. Otherwise it parses the filename of the restart and adds it in the results, perhaps that is sufficient. Would that be okay?

I also added the validation as suggested in the actual input ports and added a couple of more tests.

@JPchico JPchico requested a review from sphuber June 7, 2023 05:32
@JPchico JPchico force-pushed the 39-sparkles-add-workflows-for-basic-types-of-runs branch from 70130b7 to b581598 Compare June 7, 2023 06:10
@JPchico
Copy link
Collaborator Author

JPchico commented Jun 7, 2023

@sphuber I have rebased this PR to try to make sure that your changes to the base parser are taken into account.

I can remove the inputs and parsing of the raw parser that were before the in the base calcjob, and the same for the base workchain. Or I can do it in another PR, not sure what is cleaner.

I think I have addressed most of your concerns, as I mentioned the only thing that remains is the checking of the parent folder in the calculation. That can be removed if it is a big problem. But then we will need to make a warning in the documentation explaining the issue.

@sphuber
Copy link
Member

sphuber commented Jun 7, 2023

Or I can do it in another PR, not sure what is cleaner.

Cleanest would be to do that first in a separate PR. And then update this one so it no longer assumes those inputs. But feel free to do whatever you prefer.

I think I have addressed most of your concerns, as I mentioned the only thing that remains is the checking of the parent folder in the calculation. That can be removed if it is a big problem. But then we will need to make a warning in the documentation explaining the issue.

I think you can put it in for now, I don't have the time to dive more deeply into how LAMMPS restart functionality works in order to try and come up with another method that doesn't rely on having to use the transport. We can always come back to it later. No need to hold up this PR.

@JPchico
Copy link
Collaborator Author

JPchico commented Jun 7, 2023

Cleanest would be to do that first in a separate PR. And then update this one so it no longer assumes those inputs. But feel free to do whatever you prefer.

Can do that, it should not be too much of a hassle.

I think you can put it in for now, I don't have the time to dive more deeply into how LAMMPS restart functionality works in order to try and come up with another method that doesn't rely on having to use the transport. We can always come back to it later. No need to hold up this PR.

Yeah the lammps restart is kinda tricky, as it uses a binary file that is generated in one of two ways, either whenever you call the command (usually after your run loop) or at regular intervals during the run loop.

This file can then be read in by another simulation and it will take the structure, forces, and other quantities that were being calculated at run time and initialize the simulation with them.

As the file can be very very large, we give two choices, either you store it in the database/repo, or you can use the RemoteData from a previous calculation. Checking for the first case is pretty simple, it is just a SinglefileData that is passed as an input (we do not do any checking on the validity of this file since it is a binary file that depends on the version of lammps, compiler, etc). With the remote the only thing that is done now is to check if the expected file exists in the remote and that is it. Maybe it is overkill and it should be the users responsibility to make sure that the file exists in the node that they are passing.

@sphuber
Copy link
Member

sphuber commented Jun 7, 2023

With the remote the only thing that is done now is to check if the expected file exists in the remote and that is it. Maybe it is overkill and it should be the users responsibility to make sure that the file exists in the node that they are passing.

If that is really the only thing, then I would indeed recommend that approach. Simply allow passing a RemoteData as parent calculation to LammpsBaseCalculation, have that symlink the restart file (which will be ignored by the engine logging a warning if the file doesn't exist) and let the calculation fail. The parser should then simply detect this specific error and return an exit code indicating that the remote didn't contain the expected file. I think this is fair enough and gets rid of the problems with having to open transports which is really not ideal

@JPchico
Copy link
Collaborator Author

JPchico commented Jun 7, 2023

If that is really the only thing, then I would indeed recommend that approach. Simply allow passing a RemoteData as parent calculation to LammpsBaseCalculation, have that symlink the restart file (which will be ignored by the engine logging a warning if the file doesn't exist) and let the calculation fail.

This we do we symlink the file unless explicitly asked to copy it, basically I looked up how the QE plugin does it and did it in the same way.

The parser should then simply detect this specific error and return an exit code indicating that the remote didn't contain the expected file. I think this is fair enough and gets rid of the problems with having to open transports which is really not ideal

Sure I think that lammps itself will write that error in the log will try to produce the error and just add an exit code in the parser then for it.

Jonathan Chico added 7 commits June 7, 2023 14:49
Added special entries in the parsing for the minimization stop criterion, energy and force threshold values.
Changing how the reference values are stored to make use of the pytest-regression package.
… This should help for some of the older versions of lammps.
Jonathan Chico added 3 commits June 7, 2023 15:09
…s arising from listdir.

Setting the validation of the parameters in the input ports and moving it out of the input generation.

Adding test for the validation of the parameters in the calculation generation.
@JPchico JPchico force-pushed the 39-sparkles-add-workflows-for-basic-types-of-runs branch from b581598 to 635df61 Compare June 7, 2023 13:09
@JPchico
Copy link
Collaborator Author

JPchico commented Jun 7, 2023

Funny thing, it turns out that errors are not printed to the log, if not to the screen, we are storing that in the file lammps_output. It is basically the same than the log but with the errors and so on.

It is a bit annoying, since I'm unsure if one needs to parse both files, for the time being I'll add a minimal parser of the lammps_output that checks for errors and warnings.

If any of your has an idea on whether then one could consolidate the log parsing to use this file instead that would be great. I honestly have no idea.

@sphuber
Copy link
Member

sphuber commented Jun 7, 2023

Looking at the plugin, it uses tells AiiDA to redirect all stdout to lammps_output and then adds manually the -log log.lammps. If you are saying that the contents of lammps_output is simply content of log.lammps + errors, then I think we can simply get rid of the log.lammps file and rely only on stdout. Now we are retrieving the same content twice. You can then have a single parser that just parses lammps_output (which btw I would just keep as aiida.out which is the default and used across other plugins) for relevant data and errors.

@JPchico
Copy link
Collaborator Author

JPchico commented Jun 7, 2023

Looking at the plugin, it uses tells AiiDA to redirect all stdout to lammps_output and then adds manually the -log log.lammps. If you are saying that the contents of lammps_output is simply content of log.lammps + errors, then I think we can simply get rid of the log.lammps file and rely only on stdout. Now we are retrieving the same content twice. You can then have a single parser that just parses lammps_output (which btw I would just keep as aiida.out which is the default and used across other plugins) for relevant data and errors.

Yes I think that indeed that is the case. I'll double check a bit more and add some tests for the warnings and errors parsing. I think if that is the case then one can then consolidate. I'm just unsure since lammps has so many commands with the output that I'm a bit confused.

…different depending on the lammps version.
@JPchico
Copy link
Collaborator Author

JPchico commented Jun 8, 2023

Looking at the LAMMPS documentation I think it is indeed the case that the screen and log files are basically the same
https://docs.lammps.org/Run_output.html

The only cases I can see where the screen and log files will be different is that you can in the script decide that you want to close the current log file and open a new one. I can see this being useful for very long simulations. But right now those cases are not even possible. So I think that the best solution is to just retrieve and parse the screen output and ignore the log file.

If for some reason someone wants to add the logfile it should be possible to indicate that it should be retrieved when one submits via the additional_retrieve_list.

A question though, does one need to pass the additional_retrieve_list inside the CalcJob to tell the engine to actually retrieve it? Or is it done automatically if the option is passed?

I'll make a different PR tackling the changes to the parsing.

@JPchico
Copy link
Collaborator Author

JPchico commented Jun 9, 2023

So I have now fixed this to take into account the parsing from the output file. Removed the listdir checks and if a lammps error is detected the parser will except with an exit code referring to which error is raised by lammps.

@JPchico
Copy link
Collaborator Author

JPchico commented Jun 11, 2023

@sphuber, @chrisjsewell any other comment that should be fixed? Or could I just merge this?

Copy link
Member

@sphuber sphuber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @JPchico . Noticed one minor thing, feel free to fix it or not and then merge this.

Comment on lines 44 to 45
for index, line in enumerate(data):
line = line.strip()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not doing anything really

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aha yes I think that is a remnant from a conflict resolution during a merge. Well spotted. Removing it now

@JPchico JPchico merged commit 5f56b29 into aiidaplugins:develop Jun 12, 2023
@JPchico JPchico deleted the 39-sparkles-add-workflows-for-basic-types-of-runs branch June 12, 2023 05:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants