Worse performance and less training time for the PLT reduction, when updating from VW 9.6 to 9.7 #4511

FabianKaiser · 2023-02-27T09:22:47Z

Describe the bug

The PLT reduction performs worse and trains faster in VW 9.7 when compared to VW 9.6.

The The difference can be seen also in the training logs:

the loss behaves very different in both versions
there are less passes used in VW 9.7 (which is likely due to the different loss)
the performance is worse in VW 9.7 (this is not that drastic in the example data, but more so in larger datasets)

VW 9.6

PLT k = 1013
kary_tree = 2
creating Reading datafile = train.vw
num sources = 1
Num weight bits = 30
learning rate = 0.5
initial_t = 0
power_t = 0.5
decay_learning_rate = 1
Enabled Input label = MULTILABEL
Output pred = MULTILABELS
average since loss last 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000244 1.000488 0.996429 0.996429 0.994506 0.992584 0.991005 0.987503 0.984825 0.978646 0.976190 0.967556 cache_file = train.vw.cache
reductions: gd, scorer-identity, plt
example example current current current
counter weight label predict features
1 1.0 254 127
2 2.0 240 103
4 4.0 960 42
8 8.0 172 199
16 16.0 645 184
32 32.0 268 236
64 64.0 207 215
128 128.0 117 118
256 256.0 777 62
512 512.0 900 216
1024 1024.0 852 446
2048 2048.0 389 190
4096 4096.0 545 152
8192 8192.0 603 83
16384 16384.0 730 101
32768 32768.0 917 124 h
65536 65536.0 977 89 h
131072 131072.0 150 68 h
262144 262144.0 29 101 h
524288 524288.0 804 108 h

finished run
number of examples per pass = 24300
passes used = 31
weighted example sum = 753300.000000
weighted label sum = 0.000000
average loss = 0.958889 h
total feature number = 103905211

VW 9.7

PLT k = 1013
kary_tree = 2
creating cache_file = train.vw.cache
Reading datafile = train.vw
num sources = 1
Num weight bits = 30
learning rate = 0.5
initial_t = 0
power_t = 0.5
decay_learning_rate = 1
Enabled reductions: gd, scorer-identity, plt
Input label = MULTILABEL
Output pred = MULTILABELS
average since example example current current current
loss last counter weight label predict features
14.55609 14.55609 1 1.0 382 81
15.16680 15.77752 2 2.0 879 146
14.76802 14.36924 4 4.0 283 65
17.27986 19.79169 8 8.0 930 119
16.27946 15.27906 16 16.0 863 113
15.91260 15.54575 32 32.0 377 185
15.79295 15.67330 64 64.0 684 102
16.23921 16.68547 128 128.0 226 66
16.49474 16.75027 256 256.0 845 89
16.48981 16.48488 512 512.0 935 56
16.35399 16.21816 1024 1024.0 0 149
15.75564 15.15729 2048 2048.0 365 101
15.00027 14.24491 4096 4096.0 471 95
14.03309 13.06592 8192 8192.0 138 141
12.93321 11.83332 16384 16384.0 925 104
0.000000 0.000000 32768 32768.0 95 79 h
0.000000 0.000000 65536 65536.0 860 85 h

finished run
number of examples per pass = 24300
passes used = 4
weighted example sum = 97200.000000
weighted label sum = 0.000000
average loss = 0.000000 h
total feature number = 13414196

How to reproduce

import os
from vowpalwabbit import Workspace
import pandas as pd
import gc
from sklearn.metrics import precision_score, recall_score, f1_score


def to_vw_format(text: str, label=None) -> str:
    if label is None:
        label = ''
    return f'{label} |text {text}'


def evaluate(labels: pd.DataFrame):
    print(f"precision: {precision_score(labels['target'], labels['pred'], average='weighted', zero_division=0)}")
    print(f"recall: {recall_score(labels['target'], labels['pred'], average='weighted', zero_division=0)}")
    print(f"f1: {f1_score(labels['target'], labels['pred'], average='weighted', zero_division=0)}")

    true_positives = labels[labels.columns.difference(['target'])].apply(lambda x: x == labels['target']).sum()
    print(f"accuracy: {true_positives['pred'] / len(labels)}")


data = pd.read_csv('reddit_data_sample_local.csv')

target_var = 'target'
training_var = 'text'

cleaned_targets = data[target_var].dropna()
unique_labels = cleaned_targets.unique().tolist()
num_classes = len(unique_labels)
numbers = list(range(num_classes))
mapping = dict(zip(unique_labels, numbers))

training_data = data.sample(frac=0.9, random_state=25)
testing_data = data.drop(training_data.index)

training_data = training_data.dropna(subset=[training_var]).sample(frac=1).reset_index(drop=True)

vw_training_file_name = 'train.vw'

with open(vw_training_file_name, "wb") as f:
    for text, label in zip(training_data[training_var], training_data[target_var]):
        vw_label = mapping[label]
        example = to_vw_format(
            text, vw_label) + ' \n'
        f.write(example.encode())

os.makedirs('model', exist_ok=True)

params = {
    'loss_function': 'logistic',
    'data': vw_training_file_name,
    'c': True,
    'k': True,
    'f': 'model/model.vw',
    'compressed': True,
    'plt': num_classes,
    'b': 30,
    'passes': 50,
    'example_queue_limit': 256,
    'learning_rate': 0.5,
}

model = Workspace(**params)

model.finish()

del model
gc.collect()

params = {
    'loss_function': 'logistic',
    'predict_only_model': True,
    'i': 'model/model.vw',
    'top_k': 100,
}
model = Workspace(**params)

predictions = testing_data.text.apply(lambda x: model.predict(to_vw_format(x))[0]).rename('pred')
evaluate(pd.concat([predictions, testing_data.target.apply(lambda x: mapping[x])], axis=1))

del model
gc.collect()

Version

9.7

OS

Linux

Language

Python

Additional context

No response

mwydmuch · 2023-03-06T22:07:45Z

Hello @FabianKaiser, thank you for reporting the issue and providing the code to replicate it. I will look into this by the end of this week.

mwydmuch · 2023-03-16T18:07:04Z

Hi @FabianKaiser, I checked and there is no bug. The thing is that PLT reduction in 9.6 no longer returns loss when predicting (in 9.6 it was not the correct value btw.), and since you use PLT with a holdout dataset, you get 0 loss for that set, which results in premature early stopping of the training. If you add 'holdout_off': True, to your training params, the training will go as in 9.6, resulting in a similar model.

@jackgerrits There are two possible solutions for that problem one is to disable the holdout dataset for PLT reduction (not sure if that is possible). Alternatively, I can add an additional loss calculation to the prediction method that would basically mimic the learn step without updating base classifiers.

jackgerrits · 2023-03-20T14:34:42Z

Closing this as it should be resolved by #4534

FabianKaiser added the Bug Bug in learning semantics, critical by default label Feb 27, 2023

mwydmuch mentioned this issue Mar 16, 2023

feat: add a training loss calculation to the predict method of PLT reduction #4534

Merged

jackgerrits closed this as completed Mar 20, 2023

FabianKaiser mentioned this issue Mar 27, 2023

Prediction memory error with new PLT --probabilities option #4510

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Worse performance and less training time for the PLT reduction, when updating from VW 9.6 to 9.7 #4511

Worse performance and less training time for the PLT reduction, when updating from VW 9.6 to 9.7 #4511

FabianKaiser commented Feb 27, 2023 •

edited

Loading

mwydmuch commented Mar 6, 2023

mwydmuch commented Mar 16, 2023 •

edited

Loading

jackgerrits commented Mar 20, 2023

Worse performance and less training time for the PLT reduction, when updating from VW 9.6 to 9.7 #4511

Worse performance and less training time for the PLT reduction, when updating from VW 9.6 to 9.7 #4511

Comments

FabianKaiser commented Feb 27, 2023 • edited Loading

Describe the bug

VW 9.6

VW 9.7

How to reproduce

Version

OS

Language

Additional context

mwydmuch commented Mar 6, 2023

mwydmuch commented Mar 16, 2023 • edited Loading

jackgerrits commented Mar 20, 2023

FabianKaiser commented Feb 27, 2023 •

edited

Loading

mwydmuch commented Mar 16, 2023 •

edited

Loading