Skip to content

rieseberglab/bunnies

Repository files navigation

Bunnies

Bunnies is a python API to write scalable and reproducible scientific workflows/pipelines. It shares many ideas with other data-driven pipeline frameworks such as Snakemake, Nextflow, and Luigi, but strives to achieves a far higher level of reproducibility. It is in early stages of development, but it has been so far used to run bioinformatics pipelines on AWS, successfully.

Bunnies captures a snapshot of all the information involved directly or indirectly in the creation of a datafile. This is necessary for reproducibility. It records the software versions (in the form of container images), scripts (git commits), data inputs (digests), and application specific parameters (json config) involved at all stages of the pipeline.

Most existing frameworks are content with that above definition of reproducibility. The main objective of a "reproducible" pipeline is to allow multiple users of this pipeline to produce the same result (be it a file, a report, or a verdict).

We have found in many cases that this is insufficient. Bunnies aims to provide a solution to many typical shortcomings of reproducible frameworks:

  • Detecting changes Pipelines that involve many core-years of computation, can rarely be run from start to end in one go -- There is an iterative aspect to their development. It is sometimes necessary, to have to change parameters (or software) along the way. A good reproducible framework should detect a change, and should be able to precisely determine whether any existing previous result is to be affected by the change (and re-generated). Bunnies gives you that choice. Bunnies also allows multiple versions of the same data result to co-exist.

  • Data provenance At some point, the data generated by a pipeline inevitably leaves the framework. The data is either moved to a different service/storage, shared with other teams, or backed up. It then becomes important to be able to determine, by anyone working on a data file, exactly how it was produced. There shouldn't be any guess work involved. This allows others, downstream, to use the data with confidence, and in our opinion is the only way to achieve scientific results with fidelity.

  • Reusing existing results safely If a pipeline allows reproducible results, it must be possible for two people, working on the same pipeline, to share their partial results. Bunnies not only makes results reproducible, but will also let users take advantage of pre-existing results that satisfy exactly their own parameters. Objects generated by bunnies have a predictable unique id, and allows taking advantage of storage caches aggressively, and safely. There is no need to re-compute an object which has been previously computed with the exact same set of parameters. Also, two different users can run pipelines with overlapping steps, without stepping on each other's toes or data causing corruption.

  • Predicting costs A reproducible pipeline should be able to provide you with a hint about costs. Bunnies aims to record time, resource usage, and costs associated with delivering all data objects. It can provide cost and time forecasts associated with obtaining a new result. It can also attribute a dollar value to each data previously generated. Users can use this information and make an informed decision about whether to store an object long-term, or regenerate it on demand.

Bunnies pipelines

Bunnies allow users to express data-driven pipelines as a graph that you assemble using plain Python3 objects. There is no new language to learn (to use Bunnies) if you already know Python. Conceptually, at a high level, you build a graph of dependencies using transformation objects, and make a call to compile a pipeline that will generate one or more objects of your choice in that graph. It works similarly to a Makefile (or SCons), except that Bunnies doesn't work at the granularity of files -- but objects. Each transformation can generate one or more files.

For each file generated by a transformation step, Bunnies captures fully the inputs, the software, and the software parameters which generated the file. This is recorded in a structured document manifest for each node of the graph.

One requirement for creating new Bunnies transformations, is that it should be possible to determine a set of parameters that when presented again to the transformation, will generate an equivalent output. The defalt list of parameters include the docker image id, the inputs (and their parameters, recursively), and parameters/flags for the transformation.

Running the transformation after choosing new parameters or inputs will trigger the transformation to run again. Running with the same parameters, however, will reuse a previous result for that transformation, if one is available.

For each output generated, Bunnies records provenance:

  • the collection of software which was used to generate the file. this takes the form of git commits, and Docker image ids.
  • the full list of parameters configuring the transformation.
  • the input files which were fed to the transformation.

Installation

  1. create virtualenv:

    virtualenv -p python3 --prompt="(bunnies) " .venv
    

    Note: if you don't have virtualenv, you can install it first with pip install virtualenv, or use the builtin module in python3, i.e. python3 -m venv .venv

    Note: If you're on an older distribution with a different default python3 and/or you don't have root access to install packages, you can bootstrap its installation as a regular user with conda

         ~/miniconda3/bin/conda env create -n python36 python=3.6
         source ~/miniconda3/bin/activate python36
    
         # inside the conda environment, you have python3.6
         (python36) $ pip install --upgrade pip
         (python36) $ pip install virtualenv
    
         # create a python3.6 virtual env
         (python36) $ virtualenv --prompt="(bunnies) " -p python3.6 .venv
    
         # from this point on you no longer need the conda environment.
         # a copy of the python3.6 runtime was added to the .venv virtualenv
         # folder
         source ~/miniconda3/bin/deactivate
    
  2. activate env

    source .venv/bin/activate
    
  3. install python dependencies (includes awscli tools)

    # optional, but recommended before you install deps:
    pip install --upgrade pip
    
    # platform dependencies
    pip install -r requirements.txt
    
  4. Configure your AWS credentials. This is detailed elsewhere, but here's one way:

    mkdir -p ~/.aws
    

    Add a section in ~/.aws/config.:

    [profile reprod]
    region=us-west-2
    output=json
    

    Update your credentials file ~/.aws/credentials (section header syntax differs from config file):

    [reprod]
    aws_access_key_id=AKIAIOSFODNN7EXAMPLE
    aws_secret_access_key=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
    

    It's a good idea to chmod go-rw ~/.aws/credentials too.

    Note: The region you pick here should be one where FARGATE is supported.

Setup

While you're working on reprod, you may then wish to export the AWS_PROFILE environment variable to pick the desired account. If this is not custmized, the aws cli tools will use the default account.

   export AWS_PROFILE=reprod

Resources

To get started, a few platform resources need to be created and configured in your AWS account.

  • IAM roles and permissions for S3, EC2, ECS.
  • S3 buckets
  • EC2 VPCs
  • API Gateway pipeline management endpoints.

More resources will be generated when the pipeline definitions are converted into AWS concepts:

  • Lambdas
  • ECS Tasks
  • S3 Buckets to store temporary data

These resources are created using the scripts provided in ./scripts/. FIXME provide more detailed description.

  • ./scripts/setup-lambda.sh creates roles with permissions for platform-created lambdas. Creates ./lambda-settings.json.

  • ./scripts/setup-network.sh creates network configuration usable by tasks. outputs created ids in ./network-settings.json.

  • ./scripts/setup-tasks.sh creates task configuration based on available tasks. Currently using mostly hardcoded values sufficient to drive the example. The created entities are saved in cluster-settings.json

  • You will need to create ./storage-settings.json with the name of a bucket you intend to use as temporary storage. Example contents:

    { "storage": { "tmp_bucket": "reprod-temp-bucket", "build_bucket": "reprod-build-bucket" } }

  • ./scripts/setup-key-pair.sh creates the keypair that will be associated with the new instances. This will be the key to use to ssh into the created VMs or containers. Outputs ./key-pair-settings.json and key-pair.pem private key.

  • python -m bunnies.environment setup will create amazon roles and permissions necessary for scheduling instances and submit jobs in the context of a compute environment.

About

Reproducible data-driven research framework

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages