Came across a pistachio dataset for classification. Intent here is to use this to develop an ml-pipeline (kubeflow/vertex-ai) and deploy on gcp
will develop a simple model in a jupyterlab notebook, and use that as a starting point for pipeline development.
Pistachio Image Dataset downloaded from kaggle here
will use the 16 feature version which contains 1718 records across two pistachio types.
pandera for schema/data validation
installing packages into image: just use pip
- base image for all python functionality
- python scripts to handle arguments for each component definition
things: - some sort of config, project, storage, arifact locations, etc. - build images locally and push to AR, vs cloudbuild - component definitions, including image location
test load_data
docker run -v ./pipeline/data:/data pistachio_base:0.0.1 load_data.py /data/Pistachio_16_Features_Dataset.arff /data/pistachio_imagetest_train.pqt /data/pistachio_imagetest_test.pqt
test validate_data
docker run -v ./pipeline/data:/data pistachio_base:0.0.1 validate_data.py /data/pistachio_imagetest_train.pqt /data/pistachio_schema.json
- kfp has a local runner/docker setup for testing components. look at this instead of test_images.sh
- XGboost warnings - can disable them in the container code - verbosity 0 or some other flag