Skip to content

manavkapadnis/FastDoc-Fast-Pre-training-Technique

Repository files navigation

Code for - 'FastDoc: Domain-Specific Fast Continual Pre-training Technique using Document-Level Metadata and Taxonomy'

Required dependencies

Please run pip install -r requirements.txt (python3 required). For fine-tuning on the TechQA Dataset, use this.

Links to models pre-trained on the EManuals Corpus

  • Proposed RoBERTa-based variants
  1. FastDocRoBERTa (hier.)
  2. FastDocRoBERTa (triplet)
  3. FastDocRoBERTa
  • Proposed BERT-based variants
  1. FastDocBERT (hier.)
  2. FastDocBERT (triplet)
  3. FastDocBERT
  • Baselines
  1. BERTBASE
  2. RoBERTaBASE
  3. Longformer
  4. EManualsBERT
  5. EManualsRoBERTa
  6. DeCLUTR
  7. ConSERT
  8. SPECTER

Links to the models pre-trained in the Scientific Domain

  • Proposed Variants
  1. FastDoc(Sci.)BERT
  2. FastDoc(Sci.)BERT (hier.)
  3. FastDoc(Sci.)BERT (triplet)
  • Baseline
  1. SciBERT

Fine-tuning on SQuAD 2.0

  • To download the training set, run wget https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v2.0.json.

  • Run python3 finetune_squad.py <MODEL_TYPE> <MODEL_PATH>

    • <MODEL_TYPE> can be bert or roberta
    • <MODEL_PATH> is the model path/HuggingFace model name.

To get the models fine-tuned on SQuAD 2.0, follow the following format to get the link - https://huggingface.co/AnonymousSub/<SUBSTRING AFTER THE LAST '/' IN PRE-TRAINED MODEL LINK>_squad2.0 (For example, the link to the model obtained after fine-tuning FastDocRoBERTa - https://huggingface.co/AnonymousSub/rule_based_roberta_hier_triplet_epochs_1_shard_1 on SQuAD 2.0 is https://huggingface.co/AnonymousSub/rule_based_roberta_hier_triplet_epochs_1_shard_1_squad2.0)

Fine-tuning on TechQA Dataset

Fine-tuning on S10 QA Dataset

Fine-tuning on some of the SciBERT Paper Datasets

  • Check all notebooks here.

Fine-tuning on GLUE Benchmark Datasets

  • Check all notebooks here.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published