Improve integrations with data sources #1297

mmourafiq · 2021-04-24T08:11:50Z

mmourafiq
Apr 24, 2021
Maintainer

This discussion was started on slack and was moved to github for future reference.

Original content:

These are actually good questions (both the “is there an integration for DVC” and “how do you want to use DVC”), it was also asked a couple of times in the past, and I also had such discussions in demos or feedback sessions and shared some info with some users about upcoming work.

I can share some ideas and some features we are planning, I think some of them will materialize in the next few months.

First, I think it depends on the kind of data, how it’s stored, how it changes, and how frequently it changes, and who changes that data. But we can look at the integration from the point of view of Polyaxon as it provides the orchestration and scheduling mechanisms that should work with different external systems and libraries.

Using DVC, DBT, Feast, … are all good candidates for integration in addition to the direct access with boto, GCS, Azure client, mounted paths, git, … that we currently provide.

We provide an abstraction called connection , that’s how we integrate Polyaxon with an external system, some systems support a versioning mechanism like git, docker registries, data/volumes/buckets (by calculating a hash and storing the path)
For the current systems that we integrate with, we provide 3 mechanisms, an optional way to fetch the data automatically (an initializer), an optional way to collect some outputs automatically, and a custom container (main) that runs the user’s logic with some accompanying tracking methods to log the version summary, and some predefined logic (i.e. tracking the commit, calculating the hash/path, storing the image hash, …) that the users can use or provide their own summary (json blob) about such versions.

Adding an integration as initializer for DVC, DBT, Feast, … is a matter of extending the connection schema to allow the user to define the default access method, when multiple options are possible, like for datasets a user can use the native client or DVC, for a database the user can use the native library or DBT ...
Linking and exploring artifacts and lineage for each connection. In Polyaxon EE/Cloud we have a connections catalog metadata layer, this layer will be promoted to the same level as projects/component hub/model registry in the next coming releases. Users can explore all artifacts related to a connection, e.g. all container images related to a registry, all commits used in Polyaxon related to a git connection, all metrics references in an artifacts store. They can also see the runs that requested those git commits or artifact versions under a specific connection as well as the profile of the runs that interacted with those connections (e.g. duration, resources GPU/CPU, ...).
Adding an integration to the logging system, this is actually coming rather soon we are still thinking how best to handle CE. The tracking code will allow to specify the the connection for log_artifact_ref , log_code_ref , log_data_ref , log_file_ref , log_dir_ref and the generic log_artifact_lineage to provide the user with the tools to create such rich metadata.
If anyone want to discuss such features or would like share more ideas, feel free to comment. (edited)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

polyaxon

Improve integrations with data sources #1297

{{title}}

Replies: 0 comments

Select a reply

polyaxon

Improve integrations with data sources #1297

mmourafiq Apr 24, 2021 Maintainer

Original content:

Replies: 0 comments

mmourafiq
Apr 24, 2021
Maintainer