This repository contains the source code for the article "Towards Feature Selection for Ranking and Classification Exploiting Quantum Annealers" published at SIGIR 2022. See the websites of our quantum computing group for more information on our teams and works.
Here we explain how to install dependencies, setup the connection to D-Wave Leap quantum cloud services and how to run experiments included in this repository.
If you want to cite us or use our repository you can use the following bibtex entry:
@inproceedings{DBLP:conf/sigir/DacremaMN0FC22,
author = {Maurizio {Ferrari Dacrema} and Fabio Moroni and Riccardo Nembrini and Nicola Ferro and Guglielmo Faggioli and Paolo Cremonesi},
editor = {Enrique Amig{\'{o}} and Pablo Castells and Julio Gonzalo and Ben Carterette and J. Shane Culpepper and Gabriella Kazai},
title = {Towards Feature Selection for Ranking and Classification Exploiting Quantum Annealers},
booktitle = {{SIGIR} '22: The 45th International {ACM} {SIGIR} Conference on Research and Development in Information Retrieval, Madrid, Spain, July 11 - 15, 2022},
pages = {2814--2824},
publisher = {{ACM}},
year = {2022},
url = {https://doi.org/10.1145/3477495.3531755},
doi = {10.1145/3477495.3531755},
}
NOTE: This repository requires Python 3.8 and has been developed for Linux
It is suggested to install all the required packages into a new Python environment. So, after repository checkout, enter the repository folder and run the following commands to create a new environment:
If you're using conda
:
conda create -n QFeatureSelection python=3.8 anaconda
conda activate QFeatureSelection
If you run the experiments on the terminal it may be necessary to add this project in the PYTHONPATH environmental variable:
export PYTHONPATH=$PYTHONPATH:/path/to/project/folder
Then, make sure you correctly activated the environment and install all the required packages through pip
:
pip install -r requirements.txt
In order to make use of D-Wave cloud services you must first sign-up to D-Wave Leap and get your API token.
Then, you need to run the following command in the newly created Python environment:
dwave setup
This is a guided setup for D-Wave Ocean SDK. When asked to select non-open-source packages to install you should
answer y
and install at least D-Wave Drivers (the D-Wave Problem Inspector package is not required, but could be
useful to analyse problem solutions, if solving problems with the QPU only).
Then, continue the configuration by setting custom properties (or keeping the default ones, as we suggest), apart from
the Authentication token
field, where you should paste your API token obtained on the D-Wave Leap dashboard.
You should now be able to connect to D-Wave cloud services. In order to verify the connection, you can use the following command, which will send a test problem to D-Wave's QPU:
dwave ping
PyMIToolbox is a Python wrapper to the C library MIToolbox which is used to compute Mutual Information.
In order to use PyMIToolbox you first need to download and compile the MIToolbox library in the PyMIToolbox directory. To download the MIToolbox source code execute the following command:
cd PyMIToolbox/
wget https://github.com/Craigacp/MIToolbox/archive/refs/tags/v3.0.2.zip
Unzip the file with:
unzip v3.0.2.zip
and rename the extracted folder with:
mv MIToolbox-3.0.2 MIToolbox
Now, go into the MIToolbox directory and compile the C library. If you are on Linux or macOS run the following command:
cd MIToolbox/
make x64
while on Windows, install MinGW, add MinGW binaries to the PATH and run:
make x64_win
This will result in a compiled library file (.so
on Linux/macOS and .dll
on Windows) to be placed in the
PyMIToolbox/MIToolbox/
folder. If you don't see the file, it may have been compiled to another directory and should be moved to the correct folder.
To run the experiments enter the root folder of the project, activate the environment and run the following script:
conda activate QFeatureSelection
python run_feature_selection.py
This python script will automatically download and split the datasets used in the experiments.
The resulting splits are saved in the results_classification/[dataset_name]/data
directory.
The script will then proceed to run all experiments: baseline and QUBO with both classical and quantum based solvers.
All the results will be saved in the results_classification/[dataset_name]/[method_name]
directory.
NOTE: Running all the experiments requires a significant amount of QPU time and will exhaust all the free time given with the developer plan on D-Wave Leap. If the available time runs out it will result in errors or invalid selections. We suggest to select a limited number of datasets at a time.
For each dataset the script will also generate a dataframe summarizing the results and at the end of all experiments it will generate summary tables in latex format.
Within each dataset folder the file result_dataset_summary.csv
will contain one row per each feature selection method and, for QUBO methods, QUBO solvers. The row is selected as the one with the best validation score across all the target numbers of features (i.e., k) for that experiment.
To run the Ranking experiments on LETOR, you first need to download the datasets:
After downloading, unzip them in the folder data/letor/
in order to have the structure data/letor/[dataset_name]
.
In order to execute the Learning to Rank algorithm you need the RankLib library.
In the experiments RankLib 2.17 is used, you can download it here.
Place the downloaded RankLib-2.17.jar
file in the RankLib/
directory.
To run the experiments enter the root folder of the project, activate the environment and run the following script:
conda activate QFeatureSelection
python run_letor_feature_selection.py
The script will proceed to run all experiments: baseline and QUBO with both classical and quantum based solvers.
All the results will be saved in the results_ranking/[dataset_name]/[method_name]
directory.
NOTE: Running all the experiments requires a significant amount of QPU time and will exhaust all the free time given with the developer plan on D-Wave Leap. If the available time runs out it will result in errors or invalid selections. We suggest to select a limited number of datasets at a time.
For each dataset the script will also generate a dataframe summarizing the results.
Within each dataset folder the file result_dataset_summary.csv
will contain (slightly differently from classification experiments) all the feature selection information.
The results of the ranking algorithm will be instead saved in the results_ranking/processed/[dataset_name]_eval.csv
files.
Note that the ranking experiments have been tested only on a Unix system with
bash
. Running them on Windows may require additional setup.