Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

verdi: add repository maintenance commands to operate on the repository #4321

Closed
sphuber opened this issue Aug 26, 2020 · 15 comments
Closed
Assignees
Labels
priority/critical-blocking must be resolved before next release topic/repository topic/verdi type/accepted feature approved feature request
Milestone

Comments

@sphuber
Copy link
Contributor

sphuber commented Aug 26, 2020

Once the new repository implementation will be merged, the disk object storage (DOS) will need some maintenance now and again to improve performance, that needs to be controllable from verdi. Operations include but are potentially not limited to:

  • Packing all loose objects
  • Deleting packed loose objects
  • Repacking
  • Deleting unreferenced objects
  • Vacuuming the index database

Some of these operations can be executed during operation of AiiDA, whereas some can only be executed as a maintenance operation where the DOS cannot be used. There is also a difference in operation efficiency, where some can be very quick and efficient, whereas others (such as repacking) can potentially be very expensive. Currently the idea is that there should be one command that will be used for the majority of users that just needs to be called periodically to optimize the repository. However, it would be good to expose additional commands for the sub operations listed above such that it provides more control and for example can be called in CRON jobs.

Current proposal of subcommands for the new command verdi repository:

  • backup: perform an rsync of the DOS.
    • ? Should this perform rsync of the state as is, or should it automatically perform some maintenance operations first, or alternatively propose the user with these operations and ask for confirmation
  • maintenance: single-stop-shop command for most users. Will automatically or through prompting optimize the DOS in the most effective way possible.
  • pack: Pack all loose objects
  • clean: Delete packed loose objects, delete unreferenced objects and vacuum the database
    • --delete-loose/--no-delete-loose:
    • --delete-unreferenced/--no-delete-unreferenced:
    • --vacuum-index/--vacuum-index:
    • --repack: will fully repack the DOS, which is not always necessary and potentially expensive, but can greatly improve performance

This is an incomplete proposal and names are in no way definitive. Let's discuss below and I will keep updating the design here once certain things become agreed upon.

@sphuber
Copy link
Contributor Author

sphuber commented Aug 26, 2020

Pinging @ramirezfranciscof and @giovannipizzi

@giovannipizzi
Copy link
Member

Do you prefer to add comments here, or that we modify directly the initial comment?
One comment is that I would move --repack as an option to clean - it's the most aggressive operation and needed to recover space after deleting/cleaning, so logically for me it's the 'last' of clean. I would keep pack instead only to pack loose and to things that can be executed while AiiDA is in use.

@sphuber sphuber changed the title verdi: add vrepository maintenance commands to verdi: add repository maintenance commands to operate on the repository Aug 27, 2020
@sphuber
Copy link
Contributor Author

sphuber commented Aug 27, 2020

I would add comments here as new posts and once we agree I will update the OP to reflect the current design proposal

@ramirezfranciscof
Copy link
Member

  1. I would keep backup separated and have it only do the rsync, no extra optimization operations. Any reasons to include that other stuff?
  2. I'm guessing maintenance is the equivalente of pack + clean --all?
  3. How would clean work by default? All flags are on?

@giovannipizzi
Copy link
Member

giovannipizzi commented Aug 27, 2020

Some tentative answers:

  1. ok to rsync only but there are a few things that it might be good to do before backing up - e.g., calling clean_storage to remove loose objects that have been packed in the meantime. Maybe we can add these as (optional) flags
  2. I think it should prompt through all possible steps down to the most aggressive repacking, with hints on whether it's needed (e.g. quickly compute if the SQLite needs vacuuming, or how many holes there are and if repacking is worthwhile). If the user answers no at some point and it does not make sense to go to the next step it would then just stop asking questions. Essentially, an interactive prompting guide through all maintenance commands.
  3. I think I would have clean just call clean_storage internally, with no vacuum, by default (quick and quite safe). Then add flags for --vacuum, --delete-unrereferenced, and --repack, in order of complexity. I do not know if for these we want to have also the --no version, because I think it would be good e.g. to always re-vacuum after repacking, for instance (I don't remember if this is also enforced), so if not specified, it will be vacuumed if --repack is specified (but only after). It might make things a bit more complex in terms of logic, but we can try to do the right thing for the user (in terms of operations to do and order in which to do them).

@giovannipizzi giovannipizzi added this to the v2.0.0 milestone Mar 24, 2021
@chrisjsewell chrisjsewell self-assigned this May 5, 2021
@sphuber sphuber added priority/critical-blocking must be resolved before next release and removed priority/important labels May 5, 2021
@chrisjsewell
Copy link
Member

chrisjsewell commented May 6, 2021

Correct me if I'm wrong, but there currently seems no performant way to retrieve all file hashes from the AiiDA DB?

Take this example:

n = Dict()
n.put_object_from_filelike(StringIO("a"), "a/b/c/d")
n.put_object_from_filelike(StringIO("a"), "a/b/c/e")
n.put_object_from_filelike(StringIO("b"), "a/b/c/f")
n.store()
pprint.pprint(n.repository_metadata)
{'o': {'a': {'o': {'b': {'o': {'c': {'o': {'d': {'k': 'ca978112ca1bbdcafac231b39a23dc4da786eff8147c4e72b9807785afee48bb'},
                                           'e': {'k': 'ca978112ca1bbdcafac231b39a23dc4da786eff8147c4e72b9807785afee48bb'},
                                           'f': {'k': '3e23e8160039594a33894f6564e1b1348bbd7a0088d42c4acb73eeaed59c009d'}}}}}}}}}

To my knowledge, this is the only place these hashes are stored in the DB (i.e. DbNode.repository_metadata = JSONField(default=dict, null=True)), and so there is no "direct" SQL approach to retrieve every hash.
Instead, you have to query though every node and run a (complex/costly) function to flatten the recursive dictionary. Something like:

import collections.abc

def flatten(dictionary, parent_key=False, separator='/'):
    items = []
    for key, value in dictionary.items():
        new_key = str(parent_key) + separator + key if parent_key else key
        if isinstance(value, collections.abc.MutableMapping):
            items.extend(flatten(value, new_key, separator).items())
        elif isinstance(value, list):
            for k, v in enumerate(value):
                items.extend(flatten({str(k): v}, new_key).items())
        else:
            items.append((new_key, value))
    return dict(items)

qb = QueryBuilder()
qb.append(Node, project=["repository_metadata"])
hashes = set()
for data, in qb.iterall():
	hashes.update(flatten(data).values())

why did we choose to store the metadata in this way, rather than just a simpler:

repository_metadata = {
	"a/b/c/d": "ca978112ca1bbdcafac231b39a23dc4da786eff8147c4e72b9807785afee48bb"
}

it seems like there are multiple places where we have to unnecessarily go through this flattening/unflattening process, when all you want to do is just map an object path to its key/hash.

Related, I find it a bit odd that we have aiida.repository.common.File that can be either a file or a directory.

(also also, on a minor nitpick we use pathlib.Path in the repository code, whereas technically we should be using pathlib.PurePosixPath)

Lastly, I'm not going to push this, because I know its falling on deaf ears, but as I already said in #2046 (comment), if you were coming at this purely from a performance perspective, then you would take the SQLite table (https://github.com/aiidateam/disk-objectstore/blob/a2561a40cf2e9a4d58325387f913ccf08111f5fd/disk_objectstore/models.py#L8) and move it to the AiiDA (postgres) database, from where you could implement a more powerful many-to-one relationship between the node files and their existence/location in the pack files.

@giovannipizzi
Copy link
Member

You'r not falling of deaf ears, it's just that before changing a design decision we should avoid that we then incur on other performance bottlenecks so it would be good to discuss and see pros and cons before changing.

Two comments:

  1. you don't necessarily need to move the whole SQLite table to PSQL; the table you mention in Proposal to add DbFile table #4919 would already make queries fast (but could make other things slower so needs benchmarking as discussed there). Also, importantly: the SQLite DB only contains references to the packed objects; you still need to look to the disk to know if a loose version exists. Therefore, I would still keep the SQLite DB as an internal implementation detail of the object store, and possibly look into strategies like Proposal to add DbFile table #4919 or similar if performance needs to be improved
  2. Do you know how long (seconds) it takes to iterate over everything for a large DB? When I tested, it was still quite fast and acceptable also for large DBs. It would be important to create the structure for a large DB (me, or even better @sphuber or @mbercx can provide you with all the JSON data from one of the big DB after migration), and I'd be interested to compare the speed for various operations (not only querying but also storing, and fetching all UUIDs) both with the current schema, and any future proposal (Proposal to add DbFile table #4919 or others) to compare and decide if we need to change or not

@giovannipizzi
Copy link
Member

(Note that I had benchmarked a preliminary implementation of the repository in this test (now archived) repo and the performance was acceptably fast for all operations that I tested at the time. This included converting to the new format, and back to the old one.
Admittedly, I don't know if I benchmarked only extracting all UUIDs, but it would be good to get an order of magnitude; if it's of the order of 30s even for very large DBs, I think it's acceptable for a maintenance clean operation; if it's much longer, I agree we need to find a better solution

@sphuber
Copy link
Contributor Author

sphuber commented May 6, 2021

We should also keep in mind that querying for specific file names was never a requirement for the design. This has never been a feature and I don't remember anyone asking for it. In addition, collecting all file hashes from the database is indeed only necessary for a maintenance operation as @giovannipizzi stated, and so we can expect that it does not have to be executed very often and also can take a bit longer if necessary. On the other hand, introducing a new table might introduce a slowdown in every day operations, such as retrieving files from a node, because now there is a join involved. I am not saying that that penalty is by definition not acceptable, but I think we should keep this in mind that there are tradeoffs to both designs probably.

@chrisjsewell
Copy link
Member

chrisjsewell commented May 6, 2021

thanks for the responses, although it seems like both of them are more aimed at #4919, rather than the question I asked here, which was:
"why did we choose to store the metadata in this way, rather than just a simpler (path -> key)" (given that this is mainly what all the operations like get_object/put_object require)
and related to that, is it necessary to store empty folders?

@sphuber
Copy link
Contributor Author

sphuber commented May 6, 2021

Empty folders have been a use case for at least RemoteData nodes for CalcJob plugins, so that is why kept the possibility of representing empty folders. Granted, I don't think the current Repository interface actually provides a direct mkdir method but that would be trivial to add. Similarly, there was a discussion whether the door should be kept open to store metadata on both files, directories included. This may have been a premature optimization, but not completely unthinkable given the recent discussion on #2046

Related, I find it a bit odd that we have aiida.repository.common.File that can be either a file or a directory.

Traditionally, a "file" can be understood to be both a file or directory. The file_type attribute is there to distinguish which one it is for each specific case.

@giovannipizzi
Copy link
Member

why did we choose to store the metadata in this way, rather than just a simpler (path -> key)

to guarantee for consistency (see #4919 (comment)) and as we envisioned that when fetching a node, you anyway need all the structure: so it's just a one-row fetch (that BTW you already fetched in memory when you asked for that node), rather than one more JOIN query at the SQL level.

is it necessary to store empty folders?

Yes, see #4919 (comment)

@chrisjsewell
Copy link
Member

chrisjsewell commented May 7, 2021

Traditionally, a "file" can be understood to be both a file or directory. The file_type attribute is there to distinguish which one it is for each specific case.

I'd be interested to see where you got that "traditional" description from?
Anyhow IMO it's definitely a "not ideal" abstraction, exampled by

(a) that you have to enforce certain polymorphic constraints on a File

        if file_type == FileType.DIRECTORY and key is not None:
            raise ValueError('an object of type `FileType.DIRECTORY` cannot define a key.')

        if file_type == FileType.FILE and objects is not None:
            raise ValueError('an object of type `FileType.FILE` cannot define any objects.')

and (b) the fact there are these three methods on the Repository:

def get_object(self, path: FilePath = None) -> File:
def get_directory(self, path: FilePath = None) -> File:
def get_file(self, path: FilePath) -> File:

It would probably be better to mirror the pathlib.Path API

@sphuber
Copy link
Contributor Author

sphuber commented Dec 10, 2021

@ramirezfranciscof can this now be closed since #4965 has been merged? There may be more specific features that we put here originally, such as verdi storage backup, but I suggest we put that in separate issues. All critical functionality for v2.0 is in there, no?

@ramirezfranciscof
Copy link
Member

As far as I understood it, yes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority/critical-blocking must be resolved before next release topic/repository topic/verdi type/accepted feature approved feature request
Projects
None yet
Development

No branches or pull requests

4 participants