Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SandboxFolder: remove option to create in profile repository #5496

Merged
merged 2 commits into from
May 13, 2022

Conversation

sphuber
Copy link
Contributor

@sphuber sphuber commented Apr 23, 2022

Fixes #5494

The SandboxFolder by default would create the temporary directory in
the sandbox subfolder of the repository directory, defined by the
repository_uri key in the profile configuration.

However, this folder path should mainly serve to designate the location
of the file repository of the storage backend. So far the sandbox folder
hijacking the same key wasn't a problem as there was only one storage
backend implementation. However, when we allow other storage backend
implementations, having to keep this config key just for the sandbox
folder doesn't make sense. It would be more logical to always create the
sandbox folders in the temporary directory created through the builtin
method tempfile.mkdtemp.

@sphuber
Copy link
Contributor Author

sphuber commented Apr 25, 2022

@giovannipizzi do you maybe still know what the original reasoning was to having the sandbox folder in a subpath of repository_uri as opposed in the temp directory of the OS? Was this out of concern for guaranteeing writing rights? Or maybe for backup reasons? I don't see why you would want to backup the contents of the sandbox folders though.

@sphuber sphuber force-pushed the fix/5494/repository-uri branch 2 times, most recently from d835cf0 to 6228dfc Compare April 27, 2022 20:32
@giovannipizzi
Copy link
Member

In general you want to ensure the sandbox is on the same filesystem (and /tmp is often in another one), as this can guarantee atomic move of the files once you want to transfer them from sandbox to actual repository. This was needed before, and is needed inside the disk-objectstore. For the sandbox of AiiDA, (possibly?) this might not be needed anymore (but I'm not sure). Note also that we need to check with method we use to move files from sandbox to repository - some that guarantee atomicity don't work (=fail) between different filesystems (I think it's os.rename, but I'd need to double check).

Even if the limitation is not there, also note that people might put the repository on a large filesystem, while TMP might be on a different one (even in RAM, or a smaller FS). So I could imagine some rare but possible situations (I actually had one when trying to migrate an archive - I had to set an ENV VAR to change the location of the TMP folder) where there's not enough space on the TMP partition and the process fails (in this case adding large files to a node), even if there is a lot of empty space in the partition of the repository. So I'm a bit hesitant - maybe we should have an optional key for the tmp location, and by default if missing we keep the old (current) behaviour?

@sphuber
Copy link
Contributor Author

sphuber commented May 3, 2022

For the sandbox of AiiDA, (possibly?) this might not be needed anymore (but I'm not sure).

This is no longer needed, because we no longer use an atomic move. The reason is that we no longer assume that the file repository interface can use move operations. The files are therefore always copied through streams.

some that guarantee atomicity don't work (=fail) between different filesystems (I think it's os.rename, but I'd need to double check).

Indeed it is os.rename that requires it to be on the same filesystem. Across filesystems one should use shutil.move instead. But as mentioned above, this is no longer applicable.

So I'm a bit hesitant - maybe we should have an optional key for the tmp location, and by default if missing we keep the old (current) behaviour?

It is ok to keep it I guess, I just think it shouldn't be hijacking the repository_uri key from the profile. We should just make it a default path in the profile folder. It can simply be statically defined as Config.dirpath / Profile.name / 'sandbox'.

@chrisjsewell
Copy link
Member

I just think it shouldn't be hijacking the repository_uri key from the profile.

I think you mean, you agree what I'd already said should happen in #5330 😉

It can simply be statically defined as Config.dirpath / Profile.name / 'sandbox'.

No, it should be a configuration option on the profile, with a default, not hard-coded

@sphuber sphuber force-pushed the fix/5494/repository-uri branch from 6228dfc to 8070bd6 Compare May 3, 2022 20:46
@sphuber
Copy link
Contributor Author

sphuber commented May 3, 2022

No, it should be a configuration option on the profile, with a default, not hard-coded

Fine by me. It was never configurable before, but I have added it.

@chrisjsewell
Copy link
Member

Fine by me. It was never configurable before, but I have added it.

Cool cheers, the point is that it was always in an "accessible" place before, where you could inspect it if necessary, but now it will not be.

where there's not enough space on the TMP partition and the process fails

I'd note also, "cleaning" of sandbox folders is not particularly reliable, since it's only deleted on (python) garbage collection. So if the process doesn't stop cleanly it won't be removed.
I guess this is improved by having it in a tmp folder, since at least it will always automatically be removed when restarting your system

@chrisjsewell
Copy link
Member

Note, the goal (in this PR or another) should be to fully remove from Profile:

def repository_path(self) -> pathlib.Path:

and place it on the PsqlDosBackend, since it is purely specific to the disk-objectstore

@sphuber
Copy link
Contributor Author

sphuber commented May 4, 2022

I guess this is improved by having it in a tmp folder, since at least it will always automatically be removed when restarting your system

This was also my argument indeed to use that as a default. I personally have a lot of junk in the repository/sandbox folders, and I doubt that a lot of users even know that it exists and look there.

Note, the goal (in this PR or another) should be to fully remove from Profile

Agree, happy to create a follow-up PR to do this.

@chrisjsewell can you also review the code changes perhaps?

@giovannipizzi
Copy link
Member

For the junk in the sandbox - probably we should add the cleaning of the sandbox to the verdi storage maintain - with the full option, we already know the profile is locked so we can clean up safely that folder. If you agree, OK to do in a different PR for me (then, maybe open an issue if you agree?)

@sphuber
Copy link
Contributor Author

sphuber commented May 4, 2022

@giovannipizzi See #5514

@ramirezfranciscof
Copy link
Member

Just commenting to mention that #5514 turned out to be more complex than anticipated, mainly because when doing the maintenance there is no easy way to offer safety for other profiles that use the same folder. You can check the issue for more details, but for this feature would we agree to move forward even without that functionality or is there anything that needs to be adapted?

@sphuber
Copy link
Contributor Author

sphuber commented May 10, 2022

I think we should move ahead with this and leave the advanced functionality for #5514 . These changes are not making things worse. The sandbox folder is currently also not cleaned automatically or performed by some CLI command

@sphuber sphuber force-pushed the fix/5494/repository-uri branch from 8070bd6 to f6dc92b Compare May 12, 2022 15:15
The `SandboxFolder` by default would create the temporary directory in
the `sandbox` subfolder of the repository directory, defined by the
`repository_uri` key in the profile configuration.

However, this folder path should mainly serve to designate the location
of the file repository of the storage backend. So far the sandbox folder
hijacking the same key wasn't a problem as there was only one storage
backend implementation. However, when we allow other storage backend
implementations, having to keep this config key just for the sandbox
folder doesn't make sense. Additionally, the `SandboxFolder` can and
should be a generic utility that is not tightly coupled to the concept
of a `Profile`, as it was, by directly using it in the constructor.

The original reason for storing the sandbox folder in the folder of the
file repository is that this path is configurable and so users could
define a location on file system with plenty space available or a
filesystem that is in any other way more suited than the default
temporary directory of the operating system.

To keep the possibility for this configuration, the `SanboxFolder`
constructor now takes the optional `filepath` argument. If defined, the
sandbox folder will be created in this directory. By default, the folder
will be created in the default temporary directory of the operating
system. The `tempfile.mkdtemp` built in method is used for this purpose.
@sphuber sphuber force-pushed the fix/5494/repository-uri branch from f6dc92b to c34d633 Compare May 13, 2022 07:37
Copy link
Member

@ramirezfranciscof ramirezfranciscof left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think is all good in general, just have a couple of minor comments / questions / checks.

aiida/cmdline/commands/cmd_archive.py Show resolved Hide resolved
aiida/engine/processes/calcjobs/calcjob.py Show resolved Hide resolved
aiida/storage/sqlite_temp/backend.py Outdated Show resolved Hide resolved
Comment on lines -10 to 13
"""Tests for `verdi import`."""
# pylint: disable=redefined-outer-name
"""Tests for `verdi archive import`."""
from click.exceptions import BadParameter
from click.testing import CliRunner
import pytest
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Referring to the whole file, not just these imports: Oh wow, github really goes haywire when you change a couple of indentations 😅. Why did you change this here? I don't see any modifications related to the sandbox folder feature, just the change from a class with an init to methods using the run_cli_command, am I missing anything?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this is just a refactoring to use the pytest fixtures. I think that since I removed the class and just made them normal functions, the deindent makes almost every line seem changed. There are no functional changes though, just refactoring.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I see. In the future maybe it would be good to separate the commits in which we do this kind of refactoring into a different PR, specially if it is not using the feature changed in the PR. I hope it doesn't sound to nagging, its just that it is a bit more mentally costly to be going over this without being clear on the type of changes I'm supposed to be looking for, more so since github is not properly identifying the changes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, it's just that I came across it and it was so tempting to refactor it as it made it so much more succinct. But I agree that it is probably best left to separate refactor PRs.

@sphuber sphuber force-pushed the fix/5494/repository-uri branch from c34d633 to 38bb426 Compare May 13, 2022 09:20
@sphuber sphuber requested a review from ramirezfranciscof May 13, 2022 09:20
This config option can be used to define the directory into which all
sandbox folders will be created. This can be useful if the default
temporary directory of the operating system that would otherwise be
used is not suited, for example because it is on a slower file system or
it has insufficient empty space.

The option is passed to the `SandboxFolder` constructor whenever it is
used by AiiDA's internal code. Since the option requires the presence of
a configuration, the retrieval of the option is added as high up in the
API as possible. This is in order to not couple components to the
configuration that should remain independent.
@sphuber sphuber force-pushed the fix/5494/repository-uri branch from 38bb426 to fbd1c22 Compare May 13, 2022 09:37
Copy link
Member

@ramirezfranciscof ramirezfranciscof left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me, @sphuber don't know if you were expecting another review, otherwise maybe set a deadline for anybody else to take a look or hold their peace.

@sphuber
Copy link
Contributor Author

sphuber commented May 13, 2022

Cheers @ramirezfranciscof . More reviews would have been great, but it has been open for almost three weeks with plenty of invitations and have had no response, so I will go ahead and merge this.

@sphuber sphuber merged commit 96a4a6b into aiidateam:main May 13, 2022
@sphuber sphuber deleted the fix/5494/repository-uri branch May 13, 2022 13:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Do not use Profile.repository_uri as the location for sandbox creation
4 participants