Skip to content

celsofranssa/CluWords

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

34 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CluWords based-on Fine-tuned Transformer

1. Quick Start

# clone the project 
git clone git@github.com:celsofranssa/CluWords.git

# change directory to project folder
cd CluWords/

# Create a new virtual environment by choosing a Python interpreter 
# and making a ./venv directory to hold it:
virtualenv -p python3 CluWords/

# activate the virtual environment using a shell-specific command:
source ./venv/bin/activate

# install dependecies
pip install -r requirements.txt

# setting python path
export PYTHONPATH=$PATHONPATH:<path-to-project-dir>/CluWords/

# (if you need) to exit virtualenv later:
deactivate

2. Datasets

Downloading the datasets from Kaggle Datasets (get kaggle credentials on Kaggle API Docs):

kaggle datasets download \
    --unzip \
    -d celsofranssa/CluWords-datasets \
    -p resource/dataset/

Make sure that after completing the download of the datasets the file structure is as follows:

CluWords/
├── main.py
├── requirements.txt
├── resource
│   ...
│   ├── dataset
│   │   ├── 20ng
│   │   │   ├── fold_1
│   │   │   │   ├── test.jsonl
│   │   │   │   ├── train.jsonl
│   │   │   │   └── val.jsonl
│   │   │   ...
│   │   │   └── fold_9
│   │   │       ├── test.jsonl
│   │   │       ├── train.jsonl
│   │   │       └── val.jsonl
|   |   ...
│   │   ├── yelp_2015
│   │   │   ├── fold_1
│   │   │   │   ├── test.jsonl
│   │   │   │   ├── train.jsonl
│   │   │   │   └── val.jsonl
|   |   |   ...
│   │   │   └── fold_5
│   │   │       ├── test.jsonl
│   │   │       ├── train.jsonl
│   │   │       └── val.jsonl
│   ├── log
│   ├── model_checkpoint
│   ├── prediction
│   └── stat
├── settings
│   ...
│   └── settings.yaml
└── source
    ...

3. Test Run

The following bash command fits the BERT model over 20NG dataset using batch_size=128 and a single epoch.

python main.py tasks=[train] model=BERT_NO_POOL data=20NG data.batch_size=32 trainer.max_epochs=1

If all goes well the following output should be produced:

GPU available: True, used: True
[2020-12-31 13:44:42,967][lightning][INFO] - GPU available: True, used: True
TPU available: None, using: 0 TPU cores
[2020-12-31 13:44:42,967][lightning][INFO] - TPU available: None, using: 0 TPU cores
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
[2020-12-31 13:44:42,967][lightning][INFO] - LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name     | Type        | Params
-----------------------------------------
0 | encoder  | BertEncoder | 108 M 
1 | cls_head | Sequential  | 15.4 K
2 | loss     | NLLLoss     | 0     
3 | f1       | F1          | 0     
-----------------------------------------
108 M     Trainable params
0         Non-trainable params
108 M     Total params


Epoch 0: 100%|███████████████████████████████████████████████████████| 5199/5199 [13:06<00:00,  6.61it/s, loss=5.57, v_num=1, val_mrr=0.041, val_loss=5.54]

Benchmark Results

bench-results

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages