mrc-level2-nlp-16

Team Wrap-up Report

notion

Project Overview

목표
- 사전에 구축되어있는 Knowledge resourse에서 질문의 답을 찾는 Open-Domain Question Answering(ODQA) 구축
모델
- Retriever: COIL + ElasticSearch (query+context 150개)
- Reader: SoftVoting(conv + lstm_conv + rnn_conv + base + bert_base(conv))
Data
- train_dataset: 3952개 train set/ 240개 Validation set
- test_dataset: 240개 public validatoin set / 360개 private validation set
- Wikipedia_dataset
Result
- Public
  - Exact Match: 70.000
  - F1 Score: 79.490
- Private
  - Exact Match: 64.170
  - F1 Score: 75.810
Contributors
- 김아경(github): EDA, Negative Sampling, Post-Porcessing
- 김현욱(github): Elastic-search
- 김황대(github): Retriever (DPR, COIL, Retriever , Retriver ensemble, Elasticsearch ensemble), Data Augmentation && Reader(Train Dataset Negative sampling 후, Reader 학습 진행), contexts joining delimiter 실험
- 박상류(github): K-Fold 구현, Ensemble 구현, Pre-processing, Post-Processing 실험
- 정재현(github): ElasticSearch , Addquery 제작
- 최윤성(github): Reader (custom_layer, qestion_token span masking), Retriever (BM25, COIL), Data Augmentation (question generation, backtranslation), Ensemble (soft-votting) 구현 및 실험

Getting Started

Install requirements

  # AEDA를 사용하기 위한 jdk 설치
  apt install default-jdk
  
  # requirement 설치
  bash ./install/install_requirements.sh

Train model

python train.py --output_dir ./models/train_dataset --do_train [if use K-fold add --do_kfold]

Inference Model

python inference.py --output_dir ./outputs/test_dataset/ --dataset_name ../data/test_dataset/ --model_name_or_path ./models/train_dataset/ --do_predict [if use K-fold add --do_kfold]

Hardware

The following specs were used to create the original solution.

Ubuntu 18.04.5 LTS
Intel(R) Xeon(R) Gold 5120 CPU @ 2.20GHz
NVIDIA Tesla V100-SXM2-32GB

Code Structure

├── code/      
│   ├── install/
│   │   └── install_requirements.sh
│   │
│   ├── reader/
│   │   ├── ConvModel.py
│   │   ├── LSTMConvModel.py
│   │   ├── LSTMModel.py
│   │   ├── RNNConvModel.py
│   │   └── models.py
│   │
│   ├── retriever/
│   │   ├── coil/
│   │   │   ├── data_helper/
│   │   │   │   ├── build_train_from_triplet.py
│   │   │   │   └── make_triplet.py
│   │   │   ├── retrieve/
│   │   │   │   ├── retriever_ext/
│   │   │   │   │   ├── scatter.pyx
│   │   │   │   │   └── setup.py
│   │   │   │   ├── format-query.py
│   │   │   │   ├── merger.py
│   │   │   │   ├── retriever-fast.py
│   │   │   │   └── sharding.py
│   │   │   ├── arguments.py
│   │   │   ├── macro_datasets.py
│   │   │   ├── modeling.py
│   │   │   ├── run_macro.py
│   │   │   ├── score_to_macro.py
│   │   │   ├── trainer.py
│   │   │   └── coil_tutorial.md
│   │   │
│   │   ├── elastic_search/
│   │   │   └── elasticsearch_retriever.md
│   │   │
│   │   ├── bm25.py
│   │   ├── coil.py
│   │   └── tfidf.py
│   │
│   ├── arguments.py
│   ├── argumentation.py
│   ├── inference.py
│   ├── postprocess.py
│   ├── preprocess.py
│   ├── train.py
│   └── trainer_qa.py                   
│
└── data/
    ├── train_dataset/
    ├── test_dataset/
    └── wikipedia_documents.json

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
code		code
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

mrc-level2-nlp-16

Team Wrap-up Report

Table of Contents

Project Overview

Getting Started

Hardware

Code Structure

About

Releases

Packages

Contributors 3

Languages

boostcampaitech2/mrc-level2-nlp-16

Folders and files

Latest commit

History

Repository files navigation

mrc-level2-nlp-16

Team Wrap-up Report

Table of Contents

Project Overview

Getting Started

Hardware

Code Structure

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages