Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support abstract files and directories #1009

Closed
9 tasks done
Tracked by #10672
lukpueh opened this issue Mar 30, 2020 · 27 comments
Closed
9 tasks done
Tracked by #10672

Support abstract files and directories #1009

lukpueh opened this issue Mar 30, 2020 · 27 comments

Comments

@lukpueh
Copy link
Member

lukpueh commented Mar 30, 2020

So far we built on the assumption that both target files and TUF metadata can be loaded from and written to the local filesystem. This, however, is no necessity. In a large-scale production environment (like e.g. Python warehouse, see PEP 458) the TUF repository management code (most notably repository_tool and its underlying repository_lib) can and is likely to run on a different node than where TUF metadata files or target files reside. To support distributed operation, TUF repository code needs to be updated as outlined below.

metadata files

  • Provide an abstract file interface that supports at least reading and writing files, creating directories and listing files in a directory (implement this in securesystemslib).

  • Provide a file service that implements the abstract file interface and performs said file operations on the local file system to be used below as default file backend (implement this in securesystemslib).

  • Update TUF repository code to create a new or load an existing TUF repository, to obtain hashes and sizes of metadata files, and to persist metadata files, all using a customizable file backend.

    (repository_lib.generate_snapshot_metadata, repository_lib.write_metadata_file, repository_tool.create_new_repository, repository_tool.load_repository (**))

  • Update securesystemslib code that is currently used by TUF repository code for file operations to support the use of a customizable file backend.

    (util.get_file_details, util.load_json_file, util.persist_temp_file, hash.digest_filename(**))

  • Revise file existence checks (os.path.{isfile,isdir,exists}) in TUF repository code, and depending on which seems less invasive, or generally better suited, either

    • switch to a file system agnostic EAFP-style (e.g. catching IOError), or
    • allow TUF to perform these checks using a customizable file backend.

    (repository_lib.generate_targets_metadata, repository_lib.write_metadata_file, repository_lib._check_directory, repository_lib._delete_obsolete_metadata, repository_lib._load_top_level_metadata(**))

(**) (Non-exhaustive list of probably affected functions)

target files

@joshuagl and @sechkova have submitted PRs that decouple abstract targets in TUF metadata from their physical equivalents on disk. This work includes:

  • removal of file existence checks in user functions that add target files to the internal TUF metadata store (Adopt a consistent behavior when adding targets and paths #1008),

  • support for a fileinfo argument on add target user functions to pass out-of-band obtained hashes and sizes of files (Enhancements for hashed bin delegation #1007),

  • support for a use_existing_fileinfo on write metadata user functions, to user prior passed hashes and sizes instead of obtaining them by reading files on disk.

  • Update TUF repository code to obtain hashes and sizes of target files using a customizable file backend. (Note that above PRs suffice to operate TUF with non-local target files, hence this sub-feature request is low-prio.)

@woodruffw
Copy link
Contributor

Thanks for adding the tracker for this!

I'm adding some of my own notes below, for reference:

def create_new_repository(repository_directory, repository_backend=FilesystemBackend, repository_name='default'):
    # ...

def load_repository(repository_directory, repository_backend=FilesystemBackend, repository_name='default'):
    # ...

where repository_backend(repository_directory, repository_name) instantiates the provider. Direct I/O would then become something like this:

backend = repository_backend(repository_directory, repository_name)
# ...
backend.put(filename, file_contents)

@lukpueh
Copy link
Member Author

lukpueh commented Mar 30, 2020

Thanks for sharing your notes, @woodruffw. I think we can brainstorm here a bit and then fork out into separate (issues if necessary or directly into) PRs, one for each of the items marked with a checkbox above.

@JustinCappos
Copy link
Member

We should talk a bit about what the snapshot role means and how it functions in this environment. Already we have a need in other domains for greater scale without a huge amount of state. We'd need to consider how to get rollback protection while still having a design where synchronization / data transmitted is minimized...

@lukpueh
Copy link
Member Author

lukpueh commented Mar 30, 2020

Thanks for chiming in, @JustinCappos. What do you have in mind? Do you think the snapshot role should change?

@trishankatdatadog
Copy link
Member

@JustinCappos did you mean to say this in the context of Notary v2 or here?

@JustinCappos
Copy link
Member

I'm thinking about both contexts. I see it as a potential problem in each place.

@trishankatdatadog
Copy link
Member

trishankatdatadog commented Mar 31, 2020

I'm thinking about both contexts. I see it as a potential problem in each place.

Are we removing snapshot here? Not sure what you mean...

@JustinCappos
Copy link
Member

I certainly wouldn't be in favor of removing it. I'd like to perhaps discuss if there are ways to have rollback prevention functionality without explicitly listing all versions in the snapshot for all packages so all clients see (i.e., download) all version numbers.

One possible solution is as follows. Create a Merkle tree that gets published periodically where a signed copy of the latest version number for every package is published by the repo. To show freshness, one could show that a version number exists in each tree and is non-decreasing. This has ~ (log2(N)) * age / period) cost where age is how long the package has been in the repo or all time. So, slightly increased per-download cost, but usually this will be much smaller than snapshot.

This scheme does lose protection from changes less than the period if done as described above. (I believe I have a partial fix for this, but it adds more complexity.)

Anyways, I'll stop here for thoughts before moving further along...

@JustinCappos
Copy link
Member

@SantiagoTorres @mnm678 In case they want to follow this as well.

@woodruffw
Copy link
Contributor

I might be missing something, but I don't think this work necessarily requires any changes to the volume or size of data transmitted -- having (tens of) thousands of metadata files isn't a concern for Warehouse, they just need to be stored somewhere that isn't a local filesystem on a particular host 🙂

@JustinCappos
Copy link
Member

Sure, it's more that things like the snapshot metadata file need to list the versions and names of all of these. We did a bunch of work to show this will work for Warehouse, but does get to be somewhat large in b/w cost. We could analyze to see if a different approach may be more efficient in terms of space / bandwidth. From your side, I hope this will just be a different type of file stored on the backend so probably shouldn't matter.

@lukpueh
Copy link
Member Author

lukpueh commented Apr 1, 2020

I'd like to perhaps discuss if there are ways to have rollback prevention functionality without explicitly listing all versions in the snapshot for all packages so all clients see (i.e., download) all version numbers.

I agree that it's a good idea to discuss storage + b/w optimization. But I think we can handle the feature request here independently. We would need to support file operations on remote filesystems, no matter what is stored in snapshot, right?

@trishankatdatadog
Copy link
Member

Yes, agreed, we should separate the snapshot issue vs this issue.

@joshuagl
Copy link
Member

joshuagl commented Apr 8, 2020

I spent a couple of hours sketching out the feasibility of this. Some notes:

  • Observation: securesystemslib.util.persist_temp_file() is used to ensure a consistent file is written somewhat atomically to disk. This can effectively become an implementation detail of the local filesystem storage backend and may not need to be a public method any longer.
  • Recommendation: we'll need to decide what to do with the consistent snapshot writing. Currently tuf defaults to copying files, with an option to hardlink. I propose we remove tuf.settings.CONSISTENT_METHOD and have the local filesystem backend implementation default to copying (the only option that works on Windows, afaict). If we hear that folks were using the hardlink method and would like us to add that back we can figure out how to support it later.
  • Open: for a local storage backend we can do everything without maintaining any state, and therefore as static/class-level methods. For a non-local storage backend I don't think we can get away with that (i.e. we'll need some authentication information, at least). This will require a mechanism to initialise and configure the filesystem backend such as reading a config file and/or environment variable at package or module initialisation. It feels somewhat cleaner to implement such a mechanism in tuf. The securesystemslib functions accessing storage can all take an object that quacks like a FilesystemBackend, if one isn't passed securesystemslib can use a default FilesystemBackend. tuf can then have a configuration option to choose which storage backend to use, initialise it (at repo initialisation?) and pass it to securesystemslib functions that require a FilesystemBackend.

Looking at the TUF repository code to create a new or load an existing TUF repository, to obtain hashes and sizes of metadata files, and to persist metadata files, features provided by the following functions repository_lib.generate_snapshot_metadata, repository_lib.write_metadata_file, repository_tool.create_new_repository, repository_tool.load_repository and possibly more, we'll need abstract implementations of:

  • open()
  • write()
  • os.path.getsize()
  • os.makedirs()
  • os.listdir()

We could implement an interface something like the following pseudocode:

class FilesystemBackend(object):
  def get(filepath) -> file:

  def put(fileobj, filepath) -> None:

  def getsize(filepath) -> int:

  def create_folder(filepath) -> None:

  def list_folder(filepath) -> list:

@lukpueh
Copy link
Member Author

lukpueh commented Apr 8, 2020

This all sounds great. Thank you for thinking it through, @joshuagl!

Regarding hard link vs. copy:
Your suggestion here also aligns with what we agreed on in the tuf/warehouse meeting with (@woodruffw, @mnm678, @trishankatdatadog, @joshuagl and @sechkova).

@JustinCappos, as being involved in the original discussion way back, are you fine with removing support for hard linking files for now?

If the need arises, I suggest we add an option such as hard_link_on_consistent_snapshot to any of the FilesystemBackend implementations that support linking files and implement it in a something like a Mixin.

@JustinCappos
Copy link
Member

JustinCappos commented Apr 8, 2020 via email

@woodruffw
Copy link
Contributor

That proposed interface looks fantastic, thanks a ton @joshuagl!

@lukpueh
Copy link
Member Author

lukpueh commented Apr 9, 2020

Here's are some notes from conversation a with @joshuagl and @sechkova yesterday:

  • securesystemslib should only provide an abstract base class / interface for a FilesystemBackend and one implementation of the interface for local files (see tasks 1 and 2 in issue description).
  • Suggestions on how to define the abstract base class / interface are welcome. Should we consider using abc or similar tools? (cc @woodruffw, @SantiagoTorres)
  • Implementations of the interface to support a particular remote filesystem (e.g. S3, GCS, etc.) should be provided by a TUF adopter such as warehouse.
  • Instances of the interface should receive all information to set up remote connections (URLs, credentials, etc.) explicitly from TUF (as opposed to loading and parsing TUF specific settings or configuration files on their own).
  • Instances of the interface should raise only exceptions of a known base class, so that TUF code can handle them independently of the type of FilesystemBackend. (see task 5 in issue description). Suggestions on how to implement this are welcome.

@lukpueh
Copy link
Member Author

lukpueh commented Apr 9, 2020

Here is a list of different approaches, discussed with @joshuagl and @sechkova and backed by some good reads about Python interfaces and metaclasses:

  1. Use standard inheritance
    Pro: An OOP-concept that most programmers know (readability remains a primary goal here)
    Con: Hard/quirky to enforce the interface
  2. Use a virtual base class with a custom metaclass
    Pro: Enforcing the interface becomes easier
    Con: A lot of magic (i.e. lesser known Python OOP-concepts) in plain sight
  3. Use abc
    Pro: Enforcing the interface becomes even easier than in 2.
    Pro/Con: Magic is somewhat happening under the hood

We all agreed that approach 2. seems least favorable.

I advocate for approach 3, i.e. using abc, which after briefly skimming its code, looks like a a generic implementation of the hard/quirky parts, it would take to enforce the interface when going for approach 1. And together with some comments it should be readable for the regular Python coder. Maybe something along the lines of...

# ABC helps to enforce strict adherence to the interface in its
# implementations. That is, a concrete FilesystemStorage can only
# be instantiated if it implements all the methods defined below.
class FilesystemStorageInterface(abc.ABC): 

@woodruffw
Copy link
Contributor

Approach 3 makes sense to me -- I think abc is intended for exactly this use case 🙂.

  • Instances of the interface should receive all information to set up remote connections (URLs, credentials, etc.) explicitly from TUF (as opposed to loading and parsing TUF specific settings or configuration files on their own).
  • Instances of the interface should raise only exceptions of a known base class, so that TUF code can handle them independently of the type of FilesystemBackend. (see task 5 in issue description). Suggestions on how to implement this are welcome.

Agree completely! Thanks for the design foresight here. In terms of configuration, passing it at the TUF layer as another argument to create_new_repository/load_repository would work well in the context of Warehouse.

For example:

def create_new_repository(repository_directory, repository_name='default', repository_backend=FilesystemBackend, repository_configuration={}):
    backend = repository_backend(repository_directory, repository_name, configuration=repository_configuration)

or similar would work well for us.

joshuagl added a commit to joshuagl/securesystemslib that referenced this issue Apr 17, 2020
Implement an abstract base class (ABC) which defined an abstract interface
for storage operations, regardless of backend.

The aim is to enable securesystemslib functions to operate as normal on
local filesystems by implementing the interface for local filesystem
operations within securesystemslib, with users of the library able to
provide implementations for use with their storage backend of choice when
this is not a local filesystem, i.e. S3 buckets as used by Warehouse for
the PEP 458 implementation.

For more context see tuf issue #1009:
theupdateframework/python-tuf#1009

Signed-off-by: Joshua Lock <jlock@vmware.com>
joshuagl added a commit to joshuagl/securesystemslib that referenced this issue Apr 17, 2020
Implement an abstract base class (ABC) which defined an abstract interface
for storage operations, regardless of backend.

The aim is to enable securesystemslib functions to operate as normal on
local filesystems by implementing the interface for local filesystem
operations within securesystemslib, with users of the library able to
provide implementations for use with their storage backend of choice when
this is not a local filesystem, i.e. S3 buckets as used by Warehouse for
the PEP 458 implementation.

For more context see tuf issue #1009:
theupdateframework/python-tuf#1009

Signed-off-by: Joshua Lock <jlock@vmware.com>
joshuagl added a commit to joshuagl/securesystemslib that referenced this issue Apr 17, 2020
Implement an abstract base class (ABC) which defined an abstract interface
for storage operations, regardless of backend.

The aim is to enable securesystemslib functions to operate as normal on
local filesystems by implementing the interface for local filesystem
operations within securesystemslib, with users of the library able to
provide implementations for use with their storage backend of choice when
this is not a local filesystem, i.e. S3 buckets as used by Warehouse for
the PEP 458 implementation.

For more context see tuf issue #1009:
theupdateframework/python-tuf#1009

Signed-off-by: Joshua Lock <jlock@vmware.com>
joshuagl added a commit to joshuagl/securesystemslib that referenced this issue Apr 20, 2020
Implement an abstract base class (ABC) which defined an abstract interface
for storage operations, regardless of backend.

The aim is to enable securesystemslib functions to operate as normal on
local filesystems by implementing the interface for local filesystem
operations within securesystemslib, with users of the library able to
provide implementations for use with their storage backend of choice when
this is not a local filesystem, i.e. S3 buckets as used by Warehouse for
the PEP 458 implementation.

For more context see tuf issue #1009:
theupdateframework/python-tuf#1009

Signed-off-by: Joshua Lock <jlock@vmware.com>
@joshuagl
Copy link
Member

I've opened PR's to address the first four sub-tasks under metadata files:

  • Implement filesystem abstraction secure-systems-lab/securesystemslib#232 defines an StorageBackendInterface, an implementation of the interface for local filesystem implementation LocalFilesystemBackend and switches securesystemslib to making use of the abstract storage backends in securesystemslib.hash.digest_filename() and securesystemslib.util.[get_file_details|ensure_parent_dir|load_json_file]()
  • tuf Port to securesystemslib with abstract files and directories (securesystemslib PR 232) #1024 ports tuf to use the update securesystemslib.
    • repository_lib.generate_snapshot_metadata and repository_lib.write_metadata_file now take a storage_backend parameter, which should be an object implementing StorageBackendInterface.
    • repository_tool.create_new_repository and arepository_tool.load_repositoryboth take a storage_backend parameter, and object implementingStorageBackendInterface, and pass that to the repository_tool.Repositorythey instantiate. TheRepositoryobject keeps thestorage_backendin an attribute and passes it to the functions inrepository_lib` when required.

That leaves the fifth sub-task under metadata files still to address:

Revise file existence checks (os.path.{isfile,isdir,exists}) in TUF repository code, and depending on which seems less invasive, or generally better suited, either

switch to a file system agnostic EAFP-style (e.g. catching IOError), or
allow TUF to perform these checks using a customizable file backend.
(repository_lib.generate_targets_metadata, repository_lib.write_metadata_file, _repository_lib.check_directory, _repository_lib.delete_obsolete_metadata, _repository_lib.load_top_level_metadata(**))

joshuagl added a commit to joshuagl/securesystemslib that referenced this issue Apr 30, 2020
Implement an abstract base class (ABC) which defined an abstract interface
for storage operations, regardless of backend.

The aim is to enable securesystemslib functions to operate as normal on
local filesystems by implementing the interface for local filesystem
operations within securesystemslib, with users of the library able to
provide implementations for use with their storage backend of choice when
this is not a local filesystem, i.e. S3 buckets as used by Warehouse for
the PEP 458 implementation.

For more context see tuf issue #1009:
theupdateframework/python-tuf#1009

Signed-off-by: Joshua Lock <jlock@vmware.com>
joshuagl added a commit to joshuagl/securesystemslib that referenced this issue May 4, 2020
Implement an abstract base class (ABC) which defined an abstract interface
for storage operations, regardless of backend.

The aim is to enable securesystemslib functions to operate as normal on
local filesystems by implementing the interface for local filesystem
operations within securesystemslib, with users of the library able to
provide implementations for use with their storage backend of choice when
this is not a local filesystem, i.e. S3 buckets as used by Warehouse for
the PEP 458 implementation.

For more context see tuf issue #1009:
theupdateframework/python-tuf#1009

Signed-off-by: Joshua Lock <jlock@vmware.com>
@joshuagl
Copy link
Member

PR #1024 completes the work defined here to support abstract files and directories in repository_lib and repository_tool.

Note that developer_tool and the updater still assume local access to files.

@joshuagl
Copy link
Member

The changes landed in #1024 were incomplete, because although they document a storage_backend argument to generate_timestamp_metadata() they miss out actually adding that argument and handling it 🤦 (thanks for spotting that @MVrachev). I'll create a fix for that today.

In addition we clearly need a test for the abstract files and directories support. Because the default case is to use the local file storage backend, which is also the only current implementation of StorageBackendInterface, we need to a) create a new class object implementing StorageBackendInterface b) ensure that this test class can use local files and directories while also revealing errors in the abstract files and directories handling.

Two ways we might achieve this are:

  1. mutate filenames on put()/get() so that trying to read the expected file paths from local storage doesn't find the files (unless going through the test file interface)
  2. serialise files on put() and deserialise on get() (pickle?), or otherwise make the files unreadable when stored, so that attempts to read the files without going through the test storage backend result in failures to read.

For the relatively small amount of code required, it may make sense to implement both.

@woodruffw
Copy link
Contributor

  1. mutate filenames on put()/get() so that trying to read the expected file paths from local storage doesn't find the files (unless going through the test file interface)

This sounds like a pretty good approach for testing! If we wanted to be really aggressive about it, we could create a redis or memcached backend that's completely isolated from the underlying filesystem, although that would definitely complicate the testing story.

@MVrachev
Copy link
Collaborator

  1. mutate filenames on put()/get() so that trying to read the expected file paths from local storage doesn't find the files (unless going through the test file interface)

I agree that this is the better solution. It sounds to me that it would be simpler and easier to understand for newcomers.

@lukpueh
Copy link
Member Author

lukpueh commented Jun 12, 2020

Thanks to @joshuagl and @sechkova for implementing and adopting this feature!! ❤️ 🎉 💯

@trishankatdatadog
Copy link
Member

Thanks to @joshuagl and @sechkova for implementing and adopting this feature!! ❤️ 🎉 💯

The world is a better place with them...

woodruffw pushed a commit to trail-of-forks/securesystemslib that referenced this issue Jun 18, 2020
Implement an abstract base class (ABC) which defined an abstract interface
for storage operations, regardless of backend.

The aim is to enable securesystemslib functions to operate as normal on
local filesystems by implementing the interface for local filesystem
operations within securesystemslib, with users of the library able to
provide implementations for use with their storage backend of choice when
this is not a local filesystem, i.e. S3 buckets as used by Warehouse for
the PEP 458 implementation.

For more context see tuf issue #1009:
theupdateframework/python-tuf#1009

Signed-off-by: Joshua Lock <jlock@vmware.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants