conda create -n dqm_playground_ds python=3.8
conda activate dqm_playground_ds
conda install pip
pip3 install kedro
kedro info
kedro new --starter=spaceflights
> project_name: DQM Playground DS
> repo_name: dqm-playground-ds
> python_package: dqm_playground_ds
cd dqm-playground-ds
pip install -r src/requirements.txt
Copy file from eos to local data files (would be nice to use direct call to API in the future)
rm data/01_raw/*
cp /eos/user/x/xcoubez/SWAN_projects/ml4dqm-return/starting_data_analysis/pickles/Run_316187_ALL_clusterposition_PXLayer_* data/01_raw/.
Add to the catalog
vim conf/base/catalog.yml
Two pipelines are provided in the starter kit:
- a data procession pipeline
- a data science pipeline
Two additional pipelines could be useful:
- an optional data extraction pipeline allowing to bypass the command line interface by using directly the APIs from the website
- a data visualization pipeline producing time series, correlation plots...
In order to create the new pipelines
kedro pipeline create data_extraction
kedro pipeline create data_visualization
Starting with the data visualization pipeline
The overall idea would be to create a new task on the website (list of runs/lumisections to develop a strategy + list of runs/lumisections to apply the strategy). The data science pipeline (aka this repository) would then allow to produce the predictions from each strategy and re-upload to the website. However, it would be good to produce few plots within the pipeline to check that the pipeline ran successfully.
Visualisation can be achieved using Kedro-Viz. In order to install kedro-viz:
pip install kedro-viz
To run kedro-viz:
kedro viz
The command will run a server on http://127.0.0.1:4141. To run kedro-viz on lxplus with a no-browser option, edit .ssh/config on your personal computer with the following lines:
Host lxplus*
HostName lxplus.cern.ch
User your_username
ForwardX11 yes
ForwardAgent yes
ForwardX11Trusted yes
Host *_f
LocalForward localhost:4141 localhost:4141
ExitOnForwardFailure yes
Then log to lxplus, move to the top of the kedro repository and run the command:
ssh lxplus_f
cd project
kedro viz --no-browser
Continuous Integration is added using Github Actions. In order to keep a catalog with full eos path and another one for ci, a new conf folder is created under conf/ci
and used in the CI workflow:
kedro run --env=ci
Instructions on how to proceed to create a Docker image can be found here. Lxplus doesn't allow the use of docker, could be related to this. Two solutions are therefore available to make it into a Docker image:
- make a local copy
- create image using Github Actions
Starting from a local copy
Install Docker Desktop (for Mac in my case). Daemon is running by default.
Install kedro-docker:
pip install kedro-docker
Initialize the files and build the image:
kedro docker init
kedro docker build
In case an error appears during the build, try login out of Docker and back in.
[Optional] Analyse the image using Dive:
kedro docker dive
Check that the image has been created and check size of each layer - to learn more, head over to here.
docker images
docker history dqm-playground-ds:latest
Using python buster leads to very large layers for pip install:
<missing> 2 hours ago RUN /bin/sh -c pip install -r /tmp/requireme… 1.45GB buildkit.dockerfile.v0
<missing> 2 hours ago COPY src/requirements.txt /tmp/requirements.… 587B buildkit.dockerfile.v0
Moving to slim (kedro docker build --base-image="python:3.8-slim"
) doesn't change the situation... Dive report shows a 22MB gain from removing some files from the layers, could be done in the future but negligible with respect to the current image size.
Upload to a registry (Docker Hub for now)
docker tag dqm-playground-ds <DockerID:xavier2c>/dqm-playground-ds
docker push <DockerID:xavier2c>/dqm-playground-ds
In order to make the image creation automatic via Github Actions, a new workflow is created and triggered if the tests are successfully passing. The image is pushed to Docker Hub and can be found here.
Once done, the docker image can be pulled by Openshift, the only remaining task is then to add eos storage as a volume in order to allow the IO. In order to do so, instructions can be found here.
The extraction pipeline aims at getting data using the website API. The API is protected and requires token authentication via headers. An example code can be found below to access the RunHistograms information:
import requests
from requests.auth import HTTPBasicAuth
import pandas as pd
endpoint = "https://ml4dqm-playground.web.cern.ch/api/run_histograms/"
response = requests.get(endpoint, headers={'Authorization': 'Token <token>'})
print(response.text)
The goal is to use the APIDataSet extra dataset from Kedro to load the data. Unfortunately, providing credentials through the headers is not supported. After discussion on Kedro Discord channel, opting for the easiest solution: creating a TunedAPIDataSet which allows loading the credentials to the header.
In order for the CI workflow to keep running, a dummy credential is created inside the ci
configuration in order to provide a fake dqm_playground_token
.
Instructions on how to proceed to create an Argo workflow (with the aim of deploying to Openshift) can be found here.