Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add scipy2023 virtual poster #458

Merged
merged 2 commits into from
Jul 12, 2023
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
117 changes: 117 additions & 0 deletions docs/scipy-2023.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,117 @@
# conda-lock scipy 2023 virtual poster

## Abstract

[`conda-lock`](https://conda.github.io/conda-lock) is a tool to designed to aid reproducible science and analysis by providing a reliable and easy to use means to ensure consistent creation of computing environments.

This is not a document on how to use `conda-lock`, for that consult the [documentation](https://conda.github.io/conda-lock/).

This document covers the design consideration as well as common usage patterns.

## Why conda?

A large number of commonly used libraries in python make entensive use of extension modules written in a different language (usually C/C++). Building these libraries has historically been challenging since the python package management tools (like pip) cannot be used to the dependencies that are needed by native compilers.

Conda solves this by building _both_ the native libraries and the python libraries that make use of these. It performs a few adjustments to the compiled artifacts to ensure that the built binaries can be more easily intalled without needing to recompile.

## Why conda-lock?

Conda was designed as a developer facing tool, much like its more pure python sibling `pip`. This means that when trying to use `conda` in cases where reproducibility is required it has a number of short-comings that can appear from time to time which can result in inconsistent execution environments.

conda-lock addresses these shortcomings by leaning on existing package management tools (conda, mamba and poetry) and the concept of a dependency lockfile (popularized by systems like npm and cargo) and allowing users to generate a lockfile that covers both conda and PyPI packages.

## Design principles

### 1. The environment created by conda lock should be consistent across all machines on a given platform

Conda packages are platform-specific binaries we cannot guarantee any resolution more general than platform level.

### 2. Conda-lock should be adaptible to support alternate conda frontends provided they adhere to the same cli patterns as conda

Since conda-lock relies on subprocess based execution, the library does not need to be altered to support additional conda solvers, provided that they implement a sufficiently compatible comnmand line interface

### 3. At installation time conda-lock should not be needed. Installing a locked set of dependencies should not invoke the conda solver as that would make the environment created no longer reproducible

Conda-lock makes use of some lesser-known features of conda to provide an explicit installation file that can be consumed by conda to perform a solve-less installation

Additionally mamba, micromamba support the conda-lock format natively and can create an environment from that directly.

### 4. conda-lock should be able to solve for non-native platforms

Developers of software are frequently running different operating systems than those used by production systems.

Since the lockfile generated by conda-lock is merely a set of packages to install (and its installation order), conda-lock forces the conda frontend and poetry to perform the dependency resolution as if it were running on that foreign platform.

Whilst this does not guarantee that the versions of dependencies resolved for all platforms are identical, by being able to perform the package reolution at the same time for all target platforms it drastically increases the chances of a mostly compatible set of packages.

### 5. Subprocessing and code-vending

Both conda and mamba's user-facing interface takes the form of a command line-based application. This interface has remained consistent and stable thoughout most of conda's existance. This interface stability allows tools like conda-like to reliably invoke conda/mamba as a subprocess instead of needing to be subject to breaking changes when making use of internal apis not meant for end users.

For resolving python packages from PyPI conda-lock includes the entirely of poetry as a subpackage. Neither poetry nor pip provided a stable developer api at the time that the library was developed.

### 6. Meeting users where they are

In addition to supporting the standard `environment.yaml` format for conda environment specifications, conda-lock also supports the use of `pyproject.toml` files as these are commonly used to define environments for python software projects.

By allowing for this single-source way of defining the library dependencies, the tool reduces the effort required to maintain both conda and pyi dependency sets for a given project.

conda-lock achieves this by leveraging the scale of conda-forge - pip crosswalk that is automatically maintained by conda-forge.

#### pip <-> conda-forge crosswalk

Since conda and pypi packages live in difference namespaces the same name cannot be guaranteed to point to the same package for conda and pypi. This naming problem is very common across packaging ecosystems.

Conda-forge uses a graph dependency based heuristic for determining which conda packages correspond to the particular pypi package. This graph based hueristic does have access to the conda package recipe.

Selection heuristics.

1. If a package has been [manually mapped](https://github.com/regro/cf-scripts/blob/master/conda_forge_tick/pypi_name_mapping_static.yaml) prefer that over all other heuristics.
2. The package recipe must contain "source" that points to a PyPI source destination. This allows us to match conda package names to PyPI package names.
3. The package recipe must contain a "import" test section that perform the python import associated with this package. This gives us an additional vote that the package is indeed a python package.
4. Since many packages can declare the same "import" section we make use of graph measure (the HITS algorithm) that prefers packages that have fewer ancestors and more successors. This ensures that we correctly resolve the python import `numpy` as the package `numpy` instead of `jax` which provides an alternate implementation of the numpy api.
5. Once an import name determined to belong to (conda, pypi) pair it cannot be resolved to another package. This ensures that nammespaced packages that declare a common base import are handled correctly.
6. If a package is not present in the crosswalk assume that the conda and pypi names are the same.

Whilst this set of metrics is admittedly fairly arbitrary it does exhibit enough desirable characteristics to work for a large number of environments.

For details on the implementation see the [code](https://github.com/regro/cf-scripts/blob/master/conda_forge_tick/pypi_name_mapping.py).

## Common usage patterns observed

conda-lock has beern observed to have a couple of common usage patterns across open source ecosystems.

### Human refreshes

The initial designed use case for conda-lock was to provision a number of exactly reproducible conda environments across a large fleet of worker machines. These environments would share a lot of dependencies which mneant that the usage of alternate tools like conda-constructor would impose too much of a disk-usage cost to be feasible (conda by default installs files by making hard-links minimizing the disk usage needed to provision a large number of environments when they share some dependencies.)

These environments are generally only updated whenever there is a need to do so.

### Consistent CI/test environments

For some continuous integration workflows conda-lock can be used as a means of generating consistent environment definitions used later on in the workflow.

These systems gnerally generate a lock file dynamically whenever the workflow is invoked. When the resulting lockfile changes this change is either committed back to the repository or propageted to downstream jobs as an artifact.

This approach allows projects with larger continous integration processes to have a consistently reproducible test/execution environment that can be used by both developers and CI systems.

This approach is used by [ibis](https://ibis-project.org/).

### Container building

Whilst conda lock does not directly provide a way to build docker/oci containers, it is commonly used to help aid this task.

This is generally done by performing the following steps.

1. Outside the dockerfile

a. Generate the lockfile (outside of the Dockerfile)

b. Render the lockfile to a platform specifc explicit lock
2. Inside the dockerfile

a. `COPY` the lockfile generated into the container

b. Install a conda environmenbt from the explicit lock

For exact details see the excellent [article](https://uwekorn.com/2021/03/01/deploying-conda-environments-in-docker-how-to-do-it-right.html) by Uwe Korn.