Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement the new file repository #4345

Merged

Conversation

sphuber
Copy link
Contributor

@sphuber sphuber commented Sep 1, 2020

Fixes #3445
Fixes #1663
Fixes #64
Fixes #1675
Fixes #3437
Fixes #2580
Fixes #3764
Fixes #4020

@codecov
Copy link

codecov bot commented Sep 3, 2020

Codecov Report

Merging #4345 (44a3c11) into develop (4d90f37) will increase coverage by 0.06%.
The diff coverage is 89.09%.

❗ Current head 44a3c11 differs from pull request most recent head 0e250e6. Consider uploading reports for the commit 0e250e6 to get more accurate results
Impacted file tree graph

@@             Coverage Diff             @@
##           develop    #4345      +/-   ##
===========================================
+ Coverage    79.62%   79.67%   +0.06%     
===========================================
  Files          519      523       +4     
  Lines        37111    37166      +55     
===========================================
+ Hits         29546    29609      +63     
+ Misses        7565     7557       -8     
Flag Coverage Δ
django 74.27% <82.96%> (-0.09%) ⬇️
sqlalchemy 73.19% <79.60%> (-0.07%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
aiida/cmdline/commands/cmd_database.py 62.41% <0.00%> (ø)
aiida/cmdline/commands/cmd_devel.py 43.48% <ø> (+0.63%) ⬆️
aiida/cmdline/commands/cmd_setup.py 49.13% <0.00%> (-3.70%) ⬇️
aiida/manage/database/integrity/duplicate_uuid.py 31.71% <ø> (-61.77%) ⬇️
aiida/orm/nodes/process/calculation/calcjob.py 76.15% <ø> (+1.59%) ⬆️
aiida/orm/nodes/process/process.py 86.09% <ø> (-0.51%) ⬇️
aiida/restapi/translator/nodes/node.py 85.44% <0.00%> (ø)
aiida/tools/importexport/common/utils.py 26.83% <ø> (-4.98%) ⬇️
aiida/backends/manager.py 66.67% <33.34%> (-3.44%) ⬇️
aiida/manage/configuration/__init__.py 78.58% <40.00%> (-0.42%) ⬇️
... and 72 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4d90f37...0e250e6. Read the comment docs.

@sphuber sphuber force-pushed the fix/3445/disk-object-store-repository branch 13 times, most recently from 96b49db to 7941131 Compare September 26, 2020 23:09
@sphuber sphuber force-pushed the fix/3445/disk-object-store-repository branch 5 times, most recently from 274264a to 4d6f6b1 Compare October 3, 2020 21:12
@sphuber sphuber force-pushed the fix/3445/disk-object-store-repository branch 2 times, most recently from 198ea83 to 4ea60c7 Compare October 11, 2020 22:17
@chrisjsewell
Copy link
Member

I've totally broken your PR now @sphuber 😆
I suggest when you are ready to pick this, you revert all changes to aiida/tools/importexport then rebase (I've made a copy of your branch, to preserve these changes for reference), then I can make a PR to your branch with the archive updates

@sphuber sphuber force-pushed the fix/3445/disk-object-store-repository branch 2 times, most recently from 9ff5293 to 21cb32f Compare November 5, 2020 12:14
@chrisjsewell
Copy link
Member

chrisjsewell commented Nov 6, 2020

Discussion point (cc @sphuber, @giovannipizzi, @ltalirz)

In the current PR implementation, for the archive format, a disk-objectstore container is implemented internal to the archive. I think this is an assumption that requires further discussion:

The disk-objectstore, as I understand, is essentially a compression format with concurrent read/write access, and ability to remove/overwrite objects in-place. These write features are not part of standard compression formats (zip, tar, ...) and are critical for the functioning of the AiiDA repository.
However, it is not necessarily required for the archive, where (at least currently) we only require the archive to be written from scratch (no overwrite) and from a single process (no write concurrency).

The drawbacks of having a disk-objectstore within the archive is that basically you are storing one compression format inside another. The key cons are then:

  1. It increases the complexity of the archive format, making it less easy to manipulate and potentially having implications for long-term support.
  2. You cannot introspect/read these objects without first fully extracting the archive to a folder, then using the disk-objectstore API (which includes initialising an sqlite connection)

An alternative approach would be to simply store the archive as:

metadata.json
data.json
objects/
    <hash-key-1>
    <hash-key-2>
    ...

It should then be possible to directly stream the object bytes to/from the (compressed) archive <-> AiiDa disk-objectstore, and I see no reason why this could not be reasonably performant, compared with directly streaming from one container to another (but without the initial decompression overhead).

I'd note, one con/issue I've found with python's zipfile.ZipFile, is that it generates a ZipInfo list for all objects (files+folders) on initiation, which can lead to high memory useage for a zip containing large amounts of objects (O(n)) (see #4534 (comment)).
Although, this will be partially improved by there no longer being duplication of objects and no nested folders to store (as in the current format).
I also have some ideas to "fix" this (at least for targeted read operations).

Actually I read the memory wrong, so the reading of a zip file with ~600,000 objects peaks at 0.4 Gb not 4 Gb 🤦, so not quite as much of a big deal.

@chrisjsewell chrisjsewell linked an issue Nov 11, 2020 that may be closed by this pull request
@giovannipizzi
Copy link
Member

Main comment to @chrisjsewell : your analysis is correct I think.
The important benchmark to do is if adding millions of objects to a ZipFile object is performant or not.
If you do some tests and verify that putting, say 10 million empty files in a zip is efficient from a memory an time point of view, both when reading and writing; and that it's also performant to move data from a disk-objecstore to zip in this format, and the opposite; then I agree your suggestion is better.
Before any complex implementation, a simple script benchmarking this would be great, and depending on the memory usage and CPU time we can decide what to do

@sphuber
Copy link
Contributor Author

sphuber commented Nov 11, 2020

Note that the implementation I chose here was just to get things working, given that we were planning to overhaul the entire archive format as a whole. I therefore chose to change as little as possible and make the implementation as simple as possible. This would at least guarantee that the entire code base works and is operational. Since we wouldn't release this anyway before the new archive format is there, we can even omit this intermediate format and its migrations once the new format and implementation is there.

sphuber added 12 commits April 28, 2021 07:53
The `Repository` class is the new interface to interact with the file
repository. It contains a file hierarchy in memory to now what is
contained within it. A new instance can be constructed from an existing
hierarchy using the `from_serialized` class method.

The actual storing of the objects in a repository is delegated to a
backend repository instance, represented by `AbstractRepositoryBackend`.
This backend class is only tasked with consuming byte streams and
storing them as objects in some repository under a unique key that
allows the object to be retrieved at a later time. The backend
repository is not expected to keep a mapping of what it contains. It
should merely provide the interface to store byte streams and return
their content, given their unique corresponding key.

This also means that the internal virtual hierarchy of a ``Repository``
instance does not necessarily represent all the files that are stored by
repository backend. The repository exposes a mere subset of all the file
objects stored in the backend. This is why object deletion is also
implemented as a soft delete, by default, where the files are just
removed from the internal virtual hierarchy, but not in the actual
backend. This is because those objects can be referenced by other
instances.

The interface is implemented for the `disk-objectstore` which is an
efficient key-value store on the local file system. This commit also
provides a sandbox implementation which represents a temporary folder on
the local file system.
Since now all nodes will have a full representation of the virtual file
hierarchy of their repository content, which is stored in the database
in the `repository_metadata` column, it is crucial that the format of
that metadata is optimized to use as little data as possible, to prevent
the database from bloating unnecessarily.

Currently, we only define two file types:

 * Directories: which can contain an arbitrary number of sub objects
 * Files: which cannot contain objects, but instead have a unique hashkey

This means that we can serialize the metadata, without encoding the
explicit type. If it contains a key, it is necessarily a file. If it is
not a file, it has to be a directory.

The serialization format therefore can make do with simple nested
dictionaries, where each entry either contains the key `k`, indicating
it is a file, or `o`, which can be a dictionary itself, containing the
objects contained within that directory. Note that we can even safely
omit the `o` key for an empty dictionary. Since there is no `k` we know
it has to be a directory anyway.

This logic is implemented in `FileObject.from_serialized` that can fully
reconstruct all instance properties by inferring them from the sparse
serialized data.
This code was put in place because a naive backup of the file repository
was impossible already for reasonably size projects. The problem was the
underlying design where each node had an associated folder on disk, even
if it contained no files, and all files were stored separately, creating
a huge numbers of files. This meant that just rsync'ing the folder could
take days even to just calculate the difference with the backed up
folder.

The ad-hoc solution was this script that would only transfer the folders
of the nodes added since the last backup date, the list of which was
determined by a simple query. However, now that the new file repository
is implemented, which has been explicitly designed to be easily backed
up by tools like `rsync`, this custom solution is no longer needed.
The new `Repository` implementation is integrated by wrapping it in a
new class `NodeRepository` that is mixed into the `Node` class. This
approach serves multiple purposes.

 1. The class serves as a backwards compatibility layer between the old
    repository interface, which would allow clients to set the mode for
    file handles for reading, as well as passing in text-like streams,
    and the new interface, which exclusively operates with bytes. The
    interface will decode the bytes when the caller is expecting normal
    strings in order to not break the interface.

 2. It allows to implement the concept of immutability which is a
    concept of the `Node` and is completely unknown to the `Repository`
    class.

 3. It nicely separates the file repository functionality into a
    separate file preventing the already huge `Node` implementation from
    getting even larger.

The downside of the latter is that the `NodeRepository` class contains
some potentially confusing logic. It accesses `Node` properties, but
from the code it is not clear where they come from. Since this class is
only meant to be used as a mixin for the `Node` class, I think that is
a sacrifice that is worth the benefits.
This ensures that the repository is still mutable for sealable nodes as
long as they are not yet sealed. After sealing, the repository of even
these nodes becomes fully immutable.

Note that the implementation completely overrides the methods of the
`NodeRepositoryMixin`, because it needs to override the check in those
that will raise if the node is stored. An alternative would have been to
call the `super` from the `Sealable`, which due to the correct MRO,
would in fact the underlying method on the `NodeRepositoryMixin`, but it
would need to accept an argument, like for example `force`, to skip the
mutability check. This is not desirable, however, since those repository
methods are public methods of the `Node` interface and so any user will
be able to disable the mutability check for any node.

This solution does not suffer from that vulnerability but the downside
is that the implementation of the method from the `NodeRepositoryMixin`
needs to be copied and it needs to be kept in sync.
This section describes in detail the new file repository design,
including the design guidelines that were kept in mind and what
requirements it was designed to respect. It also gives a detailed
overview over how the implementation is integrated in the existing
code base.
The migration of an existing legacy file repository consists of:

 1. Make inventory of all files in the current file repository
 2. Write these files to the new backend (disk object store)
 3. Store the metadata of a node's repository contents, containing the
    virtual hierarchy to the corresponding `repository_metadata` column
    in the database.

The repository metadata will contain a hashkey for each file that was
written to the disk object store, which was generated by the latter and
that can be used to retrieve the content of the file.

The migration is performed in a single database migration, meaning that
everything is executed in a single transaction. Either the entire
migration succeeds or in case of an error, it is a no-op. This why the
migration will not delete the legacy file repository in the same
migration.

The advantage of this approach is that now there is no risk of data loss
and/or corruption. If the migration fails, all original data will be in
tact. The downside, however, is that the content of the file repository
is more or less duplicated. This means that the migration will require a
potentially significant amount of disk space that is at worst equal to
the size of the existing file repository. This should be the upper limit
since the disk object store will both deduplicate as well as compress
the content that is written.
Use the `Container.export` method in export/import functionality
The new repository implementation, using the `disk-objectstore`
underneath, now provides a UUID for each repository. Currently, each
database can only be configured to work with a single repository. By
writing the UUID of the repository into the database, it will become
possible to enable a consistency check of the repository and database
that are configured for a particular profile. This will prevent
accidental misconfigurations where the wrong repository is coupled to a
particular database.

The repository UUID is generated automatically by the `disk-objectstore`
library when the container is created and it provides an interface to
retrieve it. This value is written to the database in the `verdi setup`
command when a new profile is created. If the database already has the
repository UUID setting defined, it will be cross-referenced with the
one of the repository to make sure it is compatible. This case is to
facilitate the creation of a new profile for an existing repository and
database. However, if the UUIDs do not match, the setup command fails.

For databases that were created before this commit, the database
migration that performed the migration of the legacy file repository
includes a statement that will insert the UUID of the new object store
container once is has been created.
Since the migration to the new repository implementation, each file
repository has its own UUID. This UUID is now written to the settings
table of the database that is associated to it. This allows to check
that the repository and database that are configured for a profile are
compatible.

The check is added to the `Manager._load_backend` method, as this is the
central point where the database is loaded for the currently loaded
profile. We need to place the check here since in order to retrieve the
repository UUID from the database, the corresponding backend needs to be
loaded first.

If the UUID of the repository and the one stored in the database are
found to be different, a warning is emitted instructing the user to make
sure the repository and database are correctly configured. Since this is
new functionality and its stability is not known, a warning is chosen
instead of an exception in order to prevent AiiDA becoming unusable in
the case of an unintentional bug in the compatibility checking. In the
future, when the check has been proven to be stable and reliable, the
warning can be changed into an exception.
The `from disk_objectstore import Container` import has a significant
loading time. The `aiida.manage.configuration.profile` module imported
it at the top level, and since this module is loaded when `verdi` is
called, the load time of `verdi` was significantly increased. This would
have a detrimental effect on the tab completion speed.

The work around is to only import the module within the methods that use
it. Ideally, the loading of this library would not be so costly.
@sphuber sphuber force-pushed the fix/3445/disk-object-store-repository branch from c3eb2d2 to 0e250e6 Compare April 28, 2021 05:53
@chrisjsewell
Copy link
Member

chrisjsewell commented Apr 28, 2021

Cheers @sphuber
Can you confirm that anything outstanding (bugs/improvements) have been captured in an issue(s)?
Or actually, it may anyway be good to open a "meta" discussion outlining the good, the bad and the ugly about the new repository system.
A few things that come to mind:

  • If I remember correctly, the user is now responsible for periodically running container packing
  • There is no longer an "easy" way to look/search through stored files (i.e. you now have to go through the API/CLI if you want to look at simulation raw outputs)
  • the documentation for repository backup has been replaced with a todo
  • the archive format (+ migration) is not finalised

@sphuber
Copy link
Contributor Author

sphuber commented Apr 28, 2021

As far as I am aware there are no outstanding known bugs with the implementation. There are quite a few features that we still want to add before the v2.0 release though, which are part of the relevant milestone:

#4321
#4601
#695

There might be other things that may not already have an issue, but I would suggest to just add them to the v2.0 milestone. Unless this would be a non-pressing feature that can be released later of course.

I noticed someone created the milestone "Preparations for the new repository" but the issues don't seem to be related to the new repository whatsoever. If the bugs need to be fixed with 2.0 I would simply move them to that milestone and close this milestone.

@chrisjsewell
Copy link
Member

but the issues don't seem to be related to the new repository whatsoever.

yeh I think @ramirezfranciscof changed the name from the 1.6.x milestone, but they can all go in 2.0

@sphuber sphuber mentioned this pull request Apr 28, 2021
@sphuber
Copy link
Contributor Author

sphuber commented Apr 28, 2021

Just so we are clear. This branch should be merged with a normal merge commit. Not rebased-merged or squashed

@sphuber
Copy link
Contributor Author

sphuber commented Apr 28, 2021

yeh I think @ramirezfranciscof changed the name from the 1.6.x milestone, but they can all go in 2.0

I moved them all there and closed that milestone.

@chrisjsewell
Copy link
Member

chrisjsewell commented Apr 28, 2021

Just so we are clear. This branch should be merged with a normal merge commit. Not rebased-merged or squashed

See now you're just daring me to squash

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment