taggedartifacts

Simple artifact versioning and caching for scientific workflows

Free software: MIT license
Documentation: https://taggedartifacts.readthedocs.io.

Features

taggedartifacts exists to provide a simple interface for versioning functions that produce artifacts. An "artifact" could be anything -- maybe you have some sort of ETL pipeline that writes intermediate files, or you have a plotting function that writes a bunch of plots to disk, or you have a machine learning workflow that produces a bunch of model files somewhere. The purpose of taggedartifacts is to allow you to write normally -- give your output its regular name, like plot.png -- and automatically attach git commit and configuration information as part of the path.

Example

The following example shows how to use taggedartifacts to tag an output file with commit and config info:

.. codeblock:: python

    from taggedartifacts import Artifact
    @Artifact(keyword='outpath', config={}, allow_dirty=True)
    def save_thing(outpath):
        with open(outpath, 'w') as outf:
            outf.write('good job')


    save_thing(outpath='foo.txt')

The resulting file that would be created would be foo-<commit>-<config-hash>.txt, without having to litter string formats and fetching git commit info throughout the code.

Why

It's really easy, once you start running a lot of experiments, to end up with a ton of output files produced at different times with names like plot.png, plot2.png, plot-please-work.png, etc. Later, you'll maybe want to show someone a plot, and they'll try to reproduce it, and you won't be able to tell them the state of the code when the plot was produced. That's not great! taggedartifacts offers one solution to this problem, where you can tell at a glance whether two files were produced by the same code and the same configuration.

Isn't this just another workflow library

It's not! I promise.

The workflow library ecosystem in python already has a lot of entrants, like Luigi, Airflow, Pinball, and probably many I haven't heard of. There are also experiment and data/code versioning systems around like DVC, and older solutions to DAGs that understand how not to redo work, like make. taggedartifacts isn't really like any of those. It isn't aware of a DAG of all of your tasks at any point, and it doesn't know anything about data science workflows in general. It only knows about tagging some sort of file-based output with git commit and configuration information so that you can tell whether two artifacts produced potentially on different computers should match.

As a result, you don't have to have a separate daemon running, you don't get anything like task distribution and parallelization for free, and you don't get a special CLI. taggedartifacts only attempts to solve one problem.

Credits

This package was created with Cookiecutter and the audreyr/cookiecutter-pypackage project template.

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
.circleci		.circleci
.github		.github
docs		docs
taggedartifacts		taggedartifacts
tests		tests
.editorconfig		.editorconfig
.gitignore		.gitignore
CONTRIBUTING.rst		CONTRIBUTING.rst
HISTORY.rst		HISTORY.rst
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.rst		README.rst
requirements_dev.txt		requirements_dev.txt
setup.cfg		setup.cfg
setup.py		setup.py
test.py		test.py
tox.ini		tox.ini

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

taggedartifacts

Features

Example

Why

Isn't this just another workflow library

Credits

About

Releases

Packages

Languages

License

jisantuc/taggedartifacts

Folders and files

Latest commit

History

Repository files navigation

taggedartifacts

Features

Example

Why

Isn't this just another workflow library

Credits

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages