This is a very simple question answering system, which is developed as a lightweight baseline for AIO2 competition.
In the training stage, the model builds a sparse matrix of TF-IDF features from the questions in training dataset. In the inference stage, the model predicts answers of unseen questions by finding the most similar training question to the input by computing dot product scores of TF-IDF features.
Therefore, in principle, the model cannot predict answers unseen in the training data.
$ pip install -r requirements.txt
$ python train.py \
--train_file <data dir>/aio_02_train.jsonl \
--output_dir model \
--pos_list 名詞 \
--stop_words でしょ う \
--max_features 10000
$ python predict.py \
--model_dir model \
--test_file <data dir>/aio_02_dev_unlabeled_v1.0.jsonl \
--prediction_file <output dir>/predictions.jsonl
$ docker build -t aio2-tfidf-baseline .
Test locally:
$ docker run --rm -v "<data dir absolute path>:/app/input" -v "<output dir absolute path>:/app/output" aio2-tfidf-baseline bash ./submission.sh input/aio_02_dev_unlabeled_v1.0.jsonl output/predictions.jsonl
Save the docker image to file:
$ docker save aio2-tfidf-baseline | gzip > aio2-tfidf-baseline.tar.gz
The codes in this repository are open-sourced under MIT License.