Skip to content

Latest commit

 

History

History
188 lines (133 loc) · 6.73 KB

INITIAL_SETUP.md

File metadata and controls

188 lines (133 loc) · 6.73 KB

Setting up the data science framework (initial setup)

Creating a virtual environment

conda create -n dqm_playground_ds python=3.8
conda activate dqm_playground_ds
conda install pip

Installing kedro

pip3 install kedro
kedro info

Creating a new project

kedro new --starter=spaceflights
> project_name: DQM Playground DS
> repo_name: dqm-playground-ds
> python_package: dqm_playground_ds

Installing dependencies

cd dqm-playground-ds
pip install -r src/requirements.txt

Setting up the data

Copy file from eos to local data files (would be nice to use direct call to API in the future)

rm data/01_raw/*
cp /eos/user/x/xcoubez/SWAN_projects/ml4dqm-return/starting_data_analysis/pickles/Run_316187_ALL_clusterposition_PXLayer_* data/01_raw/.

Add to the catalog

vim conf/base/catalog.yml

Modifying the pipelines

Two pipelines are provided in the starter kit:

  • a data procession pipeline
  • a data science pipeline

Adding some pipelines

Two additional pipelines could be useful:

  • an optional data extraction pipeline allowing to bypass the command line interface by using directly the APIs from the website
  • a data visualization pipeline producing time series, correlation plots...

In order to create the new pipelines

kedro pipeline create data_extraction
kedro pipeline create data_visualization

Starting with the data visualization pipeline

The overall idea would be to create a new task on the website (list of runs/lumisections to develop a strategy + list of runs/lumisections to apply the strategy). The data science pipeline (aka this repository) would then allow to produce the predictions from each strategy and re-upload to the website. However, it would be good to produce few plots within the pipeline to check that the pipeline ran successfully.

Visualising the pipelines

Visualisation can be achieved using Kedro-Viz. In order to install kedro-viz:

pip install kedro-viz

To run kedro-viz:

kedro viz

The command will run a server on http://127.0.0.1:4141. To run kedro-viz on lxplus with a no-browser option, edit .ssh/config on your personal computer with the following lines:

Host lxplus*
    HostName lxplus.cern.ch
    User your_username
    ForwardX11 yes
    ForwardAgent yes
    ForwardX11Trusted yes

Host *_f
    LocalForward localhost:4141 localhost:4141
    ExitOnForwardFailure yes

Then log to lxplus, move to the top of the kedro repository and run the command:

ssh lxplus_f

cd project

kedro viz --no-browser

Adding Continuous Integration

Continuous Integration is added using Github Actions. In order to keep a catalog with full eos path and another one for ci, a new conf folder is created under conf/ci and used in the CI workflow:

kedro run --env=ci

Making project into a Docker image

Instructions on how to proceed to create a Docker image can be found here. Lxplus doesn't allow the use of docker, could be related to this. Two solutions are therefore available to make it into a Docker image:

Starting from a local copy

Install Docker Desktop (for Mac in my case). Daemon is running by default.

Install kedro-docker:

pip install kedro-docker

Initialize the files and build the image:

kedro docker init
kedro docker build

In case an error appears during the build, try login out of Docker and back in.

[Optional] Analyse the image using Dive:

kedro docker dive

Check that the image has been created and check size of each layer - to learn more, head over to here.

docker images
docker history dqm-playground-ds:latest

Using python buster leads to very large layers for pip install:

<missing>      2 hours ago   RUN /bin/sh -c pip install -r /tmp/requireme…   1.45GB    buildkit.dockerfile.v0
<missing>      2 hours ago   COPY src/requirements.txt /tmp/requirements.…   587B      buildkit.dockerfile.v0

Moving to slim (kedro docker build --base-image="python:3.8-slim") doesn't change the situation... Dive report shows a 22MB gain from removing some files from the layers, could be done in the future but negligible with respect to the current image size.

Upload to a registry (Docker Hub for now)

docker tag dqm-playground-ds <DockerID:xavier2c>/dqm-playground-ds
docker push <DockerID:xavier2c>/dqm-playground-ds

In order to make the image creation automatic via Github Actions, a new workflow is created and triggered if the tests are successfully passing. The image is pushed to Docker Hub and can be found here.

Once done, the docker image can be pulled by Openshift, the only remaining task is then to add eos storage as a volume in order to allow the IO. In order to do so, instructions can be found here.

Adding extraction pipeline

The extraction pipeline aims at getting data using the website API. The API is protected and requires token authentication via headers. An example code can be found below to access the RunHistograms information:

import requests
from requests.auth import HTTPBasicAuth

import pandas as pd

endpoint = "https://ml4dqm-playground.web.cern.ch/api/run_histograms/"

response = requests.get(endpoint, headers={'Authorization': 'Token <token>'})

print(response.text)

The goal is to use the APIDataSet extra dataset from Kedro to load the data. Unfortunately, providing credentials through the headers is not supported. After discussion on Kedro Discord channel, opting for the easiest solution: creating a TunedAPIDataSet which allows loading the credentials to the header.

In order for the CI workflow to keep running, a dummy credential is created inside the ci configuration in order to provide a fake dqm_playground_token.

Creating an Argo workflow

Instructions on how to proceed to create an Argo workflow (with the aim of deploying to Openshift) can be found here.