This is a tool for bootstrapping real-world datascience projects that are easy to understand, easy to deploy, easy to customise, and easy to maintain.
The quickstarter lets you set up a new project with the following components:
- 📚 Python library
- 📨 Service
- ⚓ Docker container
- ✨ Streamlit demo(s)
This repo also contains a few examples of datascience projects that we bootstrapped with the quickstarter:
- A zero-shot text classifier which runs out-of-the-box, with accompanying research notebooks and a streamlit demo.
pip install git+https://github.com/AYLIEN/datascience-project-quickstarter.git
After installation finishes, the the following new commands will be available:
quickstart-project
quickstart-demo
To start a new project, simply run quickstart-project
and you will be guided through the process.
You can also provide all required arguments directly, e.g.:
quickstart-project --path cool-project --libname cool_library
This will create a project in cool-project
, including a Python package/library named cool_library
.
Next, create and activate a new project-specific environment (we like miniconda):
# skip the next two lines if you prefer to create python environments in a different way
conda create -n cool-project python=3.10
conda activate cool-project
Go to the new project and install it:
cd cool-project && make dev
New projects are already setup with a mock service that receives POST requests. Back in your project directory, start the service by simply running:
make run
The default service includes two routes as toy examples: /reverse
which takes a text
argument and /count
with no arguments. Once the service is running, you try out sending requests, e.g. using
make example-request-count
make example-request-reverse
or by using the python script which shows how to send requests and receive responses as a client:
python examples/example_requests.py
Deploying your service will be easy once you have a working Docker image! Run this to containerize the service implemented in the project:
# create Docker image
make build
# run container locally
docker run -p 8000:8000 -e --rm -it <image name>:0.1
You can interact with the containerized service in the same way as earlier, e.g. by running python examples/example_requests.py
.
We begin many projects by creating a proof-of-concept in a Streamlit demo. Demos live inside a project. Simply run:
quickstart-streamlit
this will create new demo, e.g. called cool-demo
in the demos/
subdirectory of your new data science project. Move into the new demo directory and run the demo in the browser:
cd demos/cool-demo && make run
Within the demo directory demos/cool-demo
you can develop the demo which is implemented in the script demos/cool-demo/demo.py
.
You can also containerize the whole demo using Docker! Within the demo folder, simply run:
make build
The Docker image will make sharing or deploying the demo easier.
Here is a checklist to turn the new project into a fully functional tool:
- implement your project's core functionality in the Python package
- write unittests for the key functionality in each new module in the Python package (we like TDD 😉)
- maintain dependencies in
requirements.txt
- implement a demo
- implement service
- build Docker image & make sure containerized service works afterwards (this often takes a few debugging cycles)
Let's have a closer look at how projects created by our quickstarter are built. The top-level structure of our projects usually looks like this:
<project directory>/
├── <python package name>/
├── bin/
├── Makefile
├── README.md
├── requirements.txt
├── demos/
├── research/
├── resources/
├── setup.py
├── VERSION
An overview of each component of this template follows. Let's use the zero-shot classification project in examples/aylien-zs-classifier as an example.
Data science projects are different than other software projects, because they often result in both a body of exploratory research and a codebase that is used in production. Some engineering teams prefer to take prototypes from research and data-science teams and re-implement them from scratch, which is totally ok. However, we believe that it is good practice for researchers and data science teams to strive to produce code libraries that can be used in production, meaning that code is well-tested, and follows good API design principles.
Below we explain how we structure our projects to support both exploratory research and production-ready code in the same repo. We have used this simple pattern effectively in many real-world projects, ranging from research papers with accompanying codebases, to production services wrapping ML-models which handle millions of requests per day.
The research/
directory
In this directory, anything goes. The research/
directory is the home of Jupyter notebooks and other exploratory analysis tools. This directory gives us the freedom to iterate quickly and break things, while still using git to keep track of the code and to facilitate easy sharing and collaboration. Any code that is not ready for production, but that you still want to keep track of, can go into this directory.
If multiple members of the team are working on different ideas in parallel, just create multiple subdirectories in research/
such as
research/GAN-graph-based-meta-reinforcement-learning/...
and research/bayesian-flow-multi-horizon-hypercubes/...
.
We don't like to use branches for non-production code because ideas tend to get lost in unmerged branches. So we commit research code directly to the main
branch, but we put it in the research/
directory.
We only create branches for production features (see below).
The Python package directory (for example: aylien_zs_classifier/
)
This is where the main source code of a project lives. We generally structure each project around one Python package. In the early stages of a project, we tend to prototype new features in notebooks or scripts in the research/
directory. Once we're confident that we have something working and useful, we add it to an existing or new module of the Python package from where it can be imported easily. For each module (.py
file) in the package, we write unit tests in a file with a consistent naming convention: e.g. test_classifier.py
for the module classifier.py
.
Code that is added to the main Python package should be submitted in a branch, and ideally reviewed by at least one other person. In our projects, multiple review cycles are common, and we somethimes even end up moving an idea
to the research/
directory if it's cool, but somehow not well-suited or relevant to the primary usecase of the project.
Once the project is mature, the code in the main Python package should be ready for production, meaning that it can be integrated into a larger system, shared on PyPI, or shipped in a docker container that exposes a service.
The main Python package also requires the requirements.txt
, setup.py
and VERSION
files. Make sure to keep the dependencies inrequirements.txt
updated and depending on your deployment scenario, maintain the package version in the VERSION
file.
The demos/
directory
This is the newest addition to our template. Over the last few years, amazing libraries like streamlit have drastically reduced the effort required to make interactive demos of data science projects. Streamlit in particular is fast-becoming an essential library for anyone building Python-based prototypes. In the demos/
directory we put self-contained demos that are expected to have their own requirements.txt
and make run
commands. Interactive demos are one of the main ways for data scientists to communicate their work to the rest of an organization.
Check out our example for zero-shot-classification: demos/zs-classifier-demo
This directory contains executable scripts, usually written in Python or bash. These are usually one-off data processing or shell scripts that we keep separated from the python package modules for better clarity.
We usually store any large files required in a project such as model binaries or database-like files in resources
. We usually add a Makefile
command to obtain these resources locally from an external storage source, e.g. Google Cloud Storage, and do not track them with git
.
Checkout Testing.md for instructions to test the datascience project quickstarter, e.g. for making changes.
The datascience project quickstarter was conceived of and implemented by Demian Gholipour Ghalandari and Chris Hokamp. Aishwarya Radhakrishnan provided feedback and code review, and created the current version of the model-serving library. Many of the ideas in this template are based on John Glover's excellent approach to ml-ops and productionization of research work, in particular the use of Makefiles to expose the main entrypoints to projects.