L. Gallagher, A. Mallia, J. S Culpepper, T. Suel, and B. Barla Cambazoglu. 2020. Feature Extraction for Large-Scale Text Collections. In Proc. CIKM. DOI: https://doi.org/10.1145/3340531.3412773
This repository contains scripts to build the dataset, and reproduce the experiments from the paper Feature Extraction for Large-Scale Text Collections from the CIKM 2020 Resource Track.
The dataset is available for download at the links below:
- Release 1.0.1 - cikm20ltr-1.0.1 (sha1
cca713f3d331921f4d5d3093832d5f182da79c25
)
The following environment configuration was used to build the dataset and run
the experiments. We assume you have a working conda
installation
(recommended).
Clone this repo and setup Conda environment:
git clone https://github.com/ten-blue-links/cikm20
cd cikm20
git submodule update --init --recursive --depth 1
conda env create -f env.yml
conda activate cikm20fxt
./src/sh/lgbm.sh
pip install -r requirements.txt
The following details the prerequisites and steps to configure and run the build scripts.
- Indri index of ClueWeb09B (example config)
- ~350GiB RAM
- ~300GiB disk space
- Webgraph data ClueWeb09_WG_50m.graph-txt.gz and ClueB-ID-DOCNO.txt.tar.gz. Once downloaded decompress the
ClueB-ID-DOCNO.txt.tar.gz
:ClueWeb09B_WG_50m.graph-txt.gz
leave this as is.ClueB-ID-DOCNO.txt.tar.gz
decompress toClueB-ID-DOCNO.txt
.
- The gradle build system was used for the AlexaRank data
- GCC 8.x (not tested with Clang)
- Boost (tested with 1.65.1)
- Cmake 3.x
- Copy configuration template:
cp config/dataset.dist config/dataset
- Edit
config/dataset
and configure the following variables:INDRI_INDEX_PATH
- path to existing ClueWeb09B Indri index (example config)FXT_INDEX_PATH
- path where the Fxt index will be createdBOOST_INCLUDE_PATH
- path to Boost headersBOOST_LIBRARY_PATH
- path to Boost librariesINDRI_INCLUDE_PATH
- path to Indri headersINDRI_LIBRARY_PATH
- path to Indri librariesWEBGRAPH_PATH
- path toClueWeb09_WG_50m.graph-txt.gz
(gzipped)GRAPHPAIRS_PATH
- path toClueB-ID-DOCNO.txt
(decompressed)
- Run
./src/dataset/main.sh
(build may take ~32 hours) - Dataset files
build/cikm20ltr
The snapshot for the AlexaRank data is from 2010. This was the temporally closest working snapshot to Jan-Feb 2009 for ClueWeb09B.
The term reproduce is defined as per the ACM artifacts policy. Note the definitions for the terms replicate and reproduce were recently swapped around (Aug 2020).
- Copy configuration template:
cp config/experiment.dist config/experiment
- If the dataset files are in a different location than the default
build/cikm20ltr
editconfig/experiment
and setDATASETD
to the correct path
- If the dataset files are in a different location than the default
- Run
./src/experiment/main.sh
cat
the results:for i in build/result/wt??/test/eval/*.txt; do echo $i; cat $i; done
- TREC run files
build/result/wt??/test/run
The experiment scripts should be able to reproduce the following results:
Test Queries | RBP 0.9 | NDCG 5 | NDCG 20 | AP |
---|---|---|---|---|
Web Track 2009 (Topics 1-50) | 0.286+0.344 | 0.298 | 0.296 | 0.219 |
Web Track 2010 (Topics 51-100) | 0.187+0.295 | 0.224 | 0.245 | 0.131 |
Web Track 2011 (Topics 101-150) | 0.132+0.139 | 0.235 | 0.199 | 0.117 |
Web Track 2012 (Topics 151-200) | 0.193+0.185 | 0.193 | 0.189 | 0.164 |