This is the code used to run simulations associated with our Mix&Match paper:
Please cite the above paper if this code is used in any publication.
Code to setup the infrastructure to run these experiments in Google Cloud can be found in the other project folder.
The main entrypoint to run an experiment is:
run_single_experiment.py
. This script can be run locally,
or on k8s using one of the scripts in experiment_running
folder.
The scripts in the experiment_running
folder demonstrate exactly
how the experiments were run to generate results for the paper.
Here is how the scripts correspond to our experiments/datasets:
experiment_running/allstate_aimk_alt_newfeats_alt2.sh
: Allstate experiment corresponding to Figures 1a&2, and corresponding to datasetneurips_datasets/allstate_fig1aand2.p
experiment_running/wine_new_country_alt11_no_price_quantiles.sh
: Wine experiment corresponding to Figure 3, and corresponding to datasetneurips_datasets/wine_fig3.p
experiment_running/wine_ppq_even_smaller_us.sh
: Wine experiment corresponding to Figure 4, and corresponding to datasetneurips_datasets/wine_fig4.p
experiment_running/mnist.p
: Amazon experiment corresponding to Figure 5, and corresponding to datasetneurips_datasets/mnist_fig5.p
experiment_running/mnist_diff_test.p
: Amazon experiment corresponding to Figure 6, and corresponding to datasetneurips_datasets/mnist_fig6.p
experiment_running/mnist_slight_shift.p
: Amazon experiment corresponding to Figure 1b and 7, and corresponding to datasetneurips_datasets/mnist_fig1band7.p
The datasets currently supported are:
I have run these scripts on a Macbook Pro with 2.7 GHz Intel Core i7 and 16GB of RAM to create the experiments in a Kubernetes cluster (created by the other included project) running in Google Cloud using n1-standard-4 servers with 40GB of disk space.
If you would like to run the scripts as I have them set up (i.e., run them in a Kubernetes cluster), then you'll need to have the following environment variables have been set:
GCLOUD_DATASET_BUCKET
: The base bucket name where datasets and experiment results will be stored. E.g.gs://your-bucket-name
GCLOUD_PROJECT
: The project name associated with all Google cloud resources you're using. Note that this can be different from your google cloud project name.
The infrastructure for running all code in this project can be created by following the setup instructions in the other project folder.
Running this python script with no arguments will describe the argument
settings available. The python requirements are provided in requirements.txt
.
Note that I use python:3.7.3
for Docker containers for running these scripts, so if you're running
locally, you should also be using python3
.
The codebase is designed to easily create a new custom dataset from the
original Kaggle datasets (e.g., make a custom split, transform features, ect.).
The following is not necessary if you simply want to use the datasets from the
neurips_datasets
folder. However, if you'd like to make a new dataset,
follow these instructions.
To create a new dataset (e.g., an allstate dataset), take the following steps:
- Run one of the
transform_...
scripts in the/dataset_transformation
folder. This script will download the dataset from the original source (e.g., Kaggle), and perform the feature transformation used for the experiments in our paper. It will not, however, split the dataset into K sources. Note that these scripts call thetransform_<DATA SOURCE>.py
python script. You can also just run this script directly. - Use the output dataset
csv
of thetransform_...
script as an input to the dataset creation scripts, which can be found indataset_creation
folder, each named likecreate_...
. This script will split the dataset in the desired proportions and pickle the dataset in a format ready to use by theexperiment_running
scripts. Note that these scripts all call thecreate.py
python script. You can also just run this script directly - Optionally, use the output of a
dataset_creation
script as an input to thedataset_postprocessing
script. Currently, the only functionality of these scripts is to compute importance weights for the training data to be used downstream for importance-weighted ERM
Hyperparameter tuning was performed in the Katib framework, using scripts in
the hyperparameter_tuning
folder. Running these scripts will require having
a running Kubernetes cluster, or simply running the python scripts that these scripts invoke.
Note that these scripts also call the run_single_experiment.py
script, using the validation loss instead of test loss
as the metric to report
All experiments for this paper were run inside Docker containers so that the
experiments can easily be run by others.
The Docker container packaging the code from this repo to run experiments
and create datasets can be created
either by running the ./build_docker.sh
script, or by setting up a Jenkins
project and running the Jenkinsfile
. Note that the project
variable
in the Jenkinsfile should be changed to point to your Google Cloud project id.
After the container has been published (to the Google container registry),
the experiments can be run using one
of the scripts in the experiment_running
folder. Hyperparameter tuning
is performed using Katib by running the
katib_experiment_*.sh
scripts,
and testing experiments can be run using an experiment_running
script.
Experiments can be visualized by using the code in ExamineGraphs.ipynb
.
A convenience script (and instructions for running the script) that can
be used to spin up a Jupyter server
and run this notebook can be found in the other project folder.
The outputs of running the experiments in the paper are saved in
datasets_in_paper/experiment_running
.
These are the results that can be loaded into a Jupyter notebook (``ExamineGraphs.ipynb`)
and generate all of the plots. Note that there are scripts in the sister infrastructure project
to generate a Jupyter notebook with these files.
Also, note that the datasets_in_paper/transformed
folder contains all of the pre-processed
datasets used for the paper, before they were split.
Code dealing with data lives in the datasets/
dir, and code dealing with
experiment running, and defining the tree search data structures, lives in
the mf_tree/
dir.
Entrypoint code is in the base dir of this project.
This project is licensed under the terms of the Apache 2.0 License.