Skip to content

Commit

Permalink
Squashed commit of the following:
Browse files Browse the repository at this point in the history
commit adcaff2
Author: Marek Wydmuch <marek@wydmuch.poznan.pl>
Date:   Mon Mar 20 15:30:30 2023 +0100

    feat: add a training loss calculation to the predict method of PLT reduction (#4534)

    * add a training loss calculation to the predict method of PLT reduction

    * update PLT demo

    * update the tests for PLT reduction

    * disable the calculation of additional evaluation measures in PLT reduction when true labels are not available

    * apply black formating to plt_demo.py

    * remove unnecessary reset of weighted_holdout_examples variable in PLT reduction

    * revert the change of the path to the exe in plt_demo.py

    * apply black formating again to plt_demo.py

    ---------

    Co-authored-by: Jack Gerrits <jackgerrits@users.noreply.github.com>

commit f7a197e
Author: Griffin Bassman <griffinbassman@gmail.com>
Date:   Fri Mar 17 11:01:17 2023 -0400

    refactor: separate cb_to_cs_adf_mtr and cb_to_cs_adf_dr (#4532)

    * refactor: separate cb_to_cs_adf_mtr and cb_to_cs_adf_dr

    * clang

    * unused

    * remove mtr

commit e5597ae
Author: swaptr <83858160+swaptr@users.noreply.github.com>
Date:   Fri Mar 17 02:21:50 2023 +0530

    fix: fix multiline typo (#4533)

commit 301800a
Author: Eduardo Salinas <edus@microsoft.com>
Date:   Wed Mar 15 12:25:55 2023 -0400

    test: [automl] improve runtest and test changes (#4531)

commit 258731c
Author: Griffin Bassman <griffinbassman@gmail.com>
Date:   Tue Mar 14 13:28:03 2023 -0400

    chore: Update Version to 9.5.0 (#4529)

commit 49131be
Author: Eduardo Salinas <edus@microsoft.com>
Date:   Tue Mar 14 11:03:24 2023 -0400

    fix: [automl] avoid ccb pulling in generate_interactions (#4524)

    * fix: [automl] avoid ccb pulling in generate_interactions

    * same features in one line minimal repro

    * add assert of reserve size

    * update test file

    * remove include and add comment

    * temp print

    * sorting interactions matters

    * update temp print

    * fix by accounting for slot ns

    * remove prints

    * change comment and remove commented code

    * add sort to test

    * update runtests

    * Squashed commit of the following:

    commit 322a2b1
    Author: Eduardo Salinas <edus@microsoft.com>
    Date:   Mon Mar 13 21:51:49 2023 +0000

        possibly overwrite vw brought in by vw-executor

    commit 0a6baa0
    Author: Eduardo Salinas <edus@microsoft.com>
    Date:   Mon Mar 13 21:25:46 2023 +0000

        add check for metrics

    commit 469cebe
    Author: Eduardo Salinas <edus@microsoft.com>
    Date:   Mon Mar 13 21:22:38 2023 +0000

        update test

    commit 7c0b212
    Author: Eduardo Salinas <edus@microsoft.com>
    Date:   Mon Mar 13 21:11:45 2023 +0000

        format and add handler none

    commit 533e067
    Author: Eduardo Salinas <edus@microsoft.com>
    Date:   Mon Mar 13 20:56:07 2023 +0000

        test: [automl] add ccb test that checks for ft names

    * update python test

    * Update automl_oracle.cc

commit 37f4b19
Author: Griffin Bassman <griffinbassman@gmail.com>
Date:   Fri Mar 10 17:38:02 2023 -0500

    refactor: remove resize in gd setup (#4526)

    * refactor: remove resize in gd setup

    * rm resize

commit 009831b
Author: Griffin Bassman <griffinbassman@gmail.com>
Date:   Fri Mar 10 16:57:53 2023 -0500

    fix: multi-model state for cb_adf (#4513)

    * switch to vector

    * fix aml and ep_dec

    * clang

    * reorder

    * clang

    * reorder

commit a31ef14
Author: Griffin Bassman <griffinbassman@gmail.com>
Date:   Fri Mar 10 14:52:50 2023 -0500

    refactor: rename wpp, ppw, ws, params_per_problem, problem_multiplier, num_learners, increment -> feature_width (#4521)

    * refactor: rename wpp, ppw, ws, params_per_problem, problem_multiplier, num_learners, increment -> interleaves

    * clang

    * clang

    * settings

    * make bottom interleaves the same

    * remove bottom_interleaves

    * fix test

    * feature width

    * clang

commit 8390f48
Author: Griffin Bassman <griffinbassman@gmail.com>
Date:   Fri Mar 10 12:25:12 2023 -0500

    refactor: dedup dict const (#4525)

    * refactor: dedup dict const

    * clang

commit 2238d70
Author: Jack Gerrits <jackgerrits@users.noreply.github.com>
Date:   Thu Mar 9 13:51:35 2023 -0500

    refactor: add api to set data object associated with learner (#4523)

    * refactor: add api to set data object associated with learner

    * add shared ptr func

commit b622540
Author: Griffin Bassman <griffinbassman@gmail.com>
Date:   Tue Mar 7 12:16:39 2023 -0500

    fix: cbzo ppw fix (#4519)

commit f83cb7f
Author: Jack Gerrits <jackgerrits@users.noreply.github.com>
Date:   Tue Mar 7 11:21:51 2023 -0500

    refactor: automatically set label parser after stack created (#4471)

    * refactor: automatically set label parser after stack created

    * a couple of fixes

    * Put in hack to keep search working

    * formatting

commit 64e5920
Author: olgavrou <olgavrou@gmail.com>
Date:   Fri Mar 3 16:20:05 2023 -0500

    feat: [LAS] with CCB (#4520)

commit 69bf346
Author: Jack Gerrits <jackgerrits@users.noreply.github.com>
Date:   Fri Mar 3 15:17:29 2023 -0500

    refactor: make flat_example an implementation detail of ksvm (#4505)

    * refactor!: make flat_example an implementation detail of ksvm

    * Update memory_tree.cc

    * Absorb flat_example into svm_example

    * revert "Absorb flat_example into svm_example"

    This reverts commit b063feb.

commit f08f1ec
Author: Jack Gerrits <jackgerrits@users.noreply.github.com>
Date:   Fri Mar 3 14:04:48 2023 -0500

    test: fix pytype issue in test runner and utl (#4517)

    * test: fix pytype issue in test runner

    * fix version_number.py type checker issues

commit a8b1d91
Author: Eduardo Salinas <edus@microsoft.com>
Date:   Fri Mar 3 12:59:26 2023 -0500

    fix: [epsdecay] return champ prediction always (#4518)

commit b2276c1
Author: olgavrou <olgavrou@gmail.com>
Date:   Thu Mar 2 20:18:23 2023 -0500

    chore: [LAS] don't force mtr with LAS (#4516)

commit c0ba180
Author: olgavrou <olgavrou@gmail.com>
Date:   Tue Feb 28 11:27:36 2023 -0500

    feat: [LAS] add example ft hash and cache and re-use rows of matrix if actions do not change (#4509)

commit e1a9363
Author: Eduardo Salinas <edus@microsoft.com>
Date:   Mon Feb 27 16:35:09 2023 -0500

    feat: [gd] persist ppw extra state (#4023)

    * feat: [gd] persist ppm state

    * introduce resize_ppw_state

    * wip: move logic down to gd, respect incoming ft_offset

    * replace assert with status quo behaviour

    * implement writing/reading to modelfile

    * remove from predict

    * update test 351 and 411

    * update sensitivity and update

    * remove debug prints

    * update all tests

    * apply fix of other pr

    * use .at() for bounds checking

    * add max_ft_offset and add asserts

    * comment extra assert that is failing

    * remove files

    * fix automl tests

    * more tests

    * tests

    * tests

    * clang

    * fix for predict_only_model automl

    * comment

    * fix ppm printing

    * temporarily remove tests 50 and 68

    * address comments

    * expand width for search

    * fix tests

    * merge

    * revert cb_adf

    * merge

    * fix learner

    * clang

    * search fix

    * clang

    * fix unit tests

    * bump 9.7.1 for version CIs

    * revert to 9.7.0

    * stop search from learning out of bounds

    * expand search num_learners

    * fix search cs test

    * comment

    * revert ext_libs

    * clang

    * comment out saveresume tests

    * pylint

    * comment

    * fix with search

    * fix search

    * clang

    * unused

    * unused

    * commnets

    * fix scope_exit

    * fix cs test

    * revert automl test update

    * remove resize

    * clang

    ---------

    Co-authored-by: Griffin Bassman <griffinbassman@gmail.com>
  • Loading branch information
lalo committed Mar 20, 2023
1 parent a191552 commit b7e1c4b
Show file tree
Hide file tree
Showing 216 changed files with 5,508 additions and 4,939 deletions.
18 changes: 9 additions & 9 deletions .github/workflows/python_wheels.yml
Original file line number Diff line number Diff line change
Expand Up @@ -70,8 +70,8 @@ jobs:
shell: bash
run: |
pip install -r requirements.txt
pip install pytest twine
pip install built_wheel/*.whl
pip install pytest vw-executor twine
pip install --force-reinstall built_wheel/*.whl
twine check built_wheel/*.whl
python -m pytest ./python/tests/
python ./python/tests/run_doctest.py
Expand Down Expand Up @@ -133,7 +133,7 @@ jobs:
shell: bash
run: |
pip install -r requirements.txt
pip install pytest
pip install pytest vw-executor
- name: Run unit tests
shell: bash
run: |
Expand Down Expand Up @@ -212,8 +212,8 @@ jobs:
source .env/bin/activate && \
pip install --upgrade pip && \
pip install -r requirements.txt && \
pip install pytest twine && \
pip install built_wheel/*.whl && \
pip install pytest vw-executor twine && \
pip install --force-reinstall built_wheel/*.whl && \
twine check built_wheel/*.whl && \
python --version && \
python -m pytest ./python/tests/ && \
Expand Down Expand Up @@ -272,8 +272,8 @@ jobs:
shell: bash
run: |
pip install -r requirements.txt
pip install pytest twine
pip install built_wheel/*.whl
pip install pytest vw-executor twine
pip install --force-reinstall built_wheel/*.whl
twine check built_wheel/*.whl
python -m pytest ./python/tests/
python ./python/tests/run_doctest.py
Expand Down Expand Up @@ -361,8 +361,8 @@ jobs:
export wheel_file="${wheel_files[0]}"
echo Installing ${wheel_file}...
pip install -r requirements.txt
pip install pytest twine
pip install ${wheel_file}
pip install pytest vw-executor twine
pip install --force-reinstall ${wheel_file}
twine check ${wheel_file}
python -m pytest .\\python\\tests\\
python .\\python\\tests\\run_doctest.py
2 changes: 1 addition & 1 deletion cs/unittest/TestSearch.cs
Original file line number Diff line number Diff line change
Expand Up @@ -146,7 +146,7 @@ void RunSearchPredict(Multiline multiline)

rawvw.EndOfPass();
driver.Reset();
} while (remainingPasses-- > 0);
} while (--remainingPasses > 0);

driver.Reset();

Expand Down
11 changes: 7 additions & 4 deletions demo/plt/README.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
Probabilistic Label Tree demo
-------------------------------
-----------------------------

This demo presents PLT for applications of logarithmic time multilabel classification.
It uses Mediamill dataset from the [LIBLINEAR datasets repository](https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multilabel.html)
Expand All @@ -12,19 +12,22 @@ The datasets and paremeters can be easliy edited in the script. The script requi
## PLT options
```
--plt Probabilistic Label Tree with <k> labels
--kary_tree use <k>-ary tree. By default = 2 (binary tree)
--threshold predict labels with conditional marginal probability greater than <thr> threshold"
--kary_tree use <k>-ary tree. By default = 2 (binary tree),
higher values usually give better results, but increase training time
--threshold predict labels with conditional marginal probability greater than <thr> threshold
--top_k predict top-<k> labels instead of labels above threshold
```


## Tips for using PLT
PLT accelerates training and prediction for a large number of classes,
if you have less than 10000 classes, you should probably use OAA.
If you have a huge number of labels and features at the same time,
you will need as many bits (`-b`) as can afford computationally for the best performance.
You may also consider using `--sgd` instead of default adaptive, normalized, and invariant updates to
gain more memory for feature weights what may lead to better performance.
You may also consider using `--holdout_off` if you have many rare labels in your data.
If you have many rare labels in your data, you should train with `--holdout_off`, that disables usage of holdout (validation) dataset for early stopping.


## References

Expand Down
24 changes: 18 additions & 6 deletions demo/plt/plt_demo.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@
# This is demo example that demonstrates usage of PLT reduction on few popular multilabel datasets.

# Select dataset
dataset = "mediamill_exp1" # should be in ["mediamill_exp1", "eurlex", "rcv1x", "wiki10", "amazonCat"]
dataset = "eurlex" # should be in ["mediamill_exp1", "eurlex", "rcv1x", "wiki10", "amazonCat"]

# Select reduction
reduction = "plt" # should be in ["plt", "multilabel_oaa"]
Expand All @@ -18,10 +18,16 @@
output_model = f"{dataset}_{reduction}_model"

# Parameters
kary_tree = 16
l = 0.5
passes = 3
other_training_params = "--holdout_off"
kary_tree = 16 # arity of the tree,
# higher values usually give better results, but increase training time

passes = 5 # number of passes over the dataset,
# for some datasets you might want to change number of passes

l = 0.5 # learning rate

other_training_params = "--holdout_off" # because these datasets have many rare labels,
# disabling the holdout set improves the final performance

# dict with params for different datasets (k and b)
params_dict = {
Expand All @@ -34,14 +40,20 @@
if dataset in params_dict:
k, b = params_dict[dataset]
else:
print(f"Dataset {dataset} is not supported for this demo.")
print(f"Dataset {dataset} is not supported by this demo.")

# Download dataset (source: http://manikvarma.org/downloads/XC/XMLRepository.html)
# Datasets were transformed to VW's format,
# and features values were normalized (this helps with performance).
if not os.path.exists(train_data):
print("Downloading train dataset:")
os.system("wget http://www.cs.put.poznan.pl/mwydmuch/data/{}".format(train_data))

if not os.path.exists(test_data):
print("Downloading test dataset:")
os.system("wget http://www.cs.put.poznan.pl/mwydmuch/data/{}".format(test_data))


print(f"\nTraining Vowpal Wabbit {reduction} on {dataset} dataset:\n")
start = time.time()
train_cmd = f"vw {train_data} -c --{reduction} {k} --loss_function logistic -l {l} --passes {passes} -b {b} -f {output_model} {other_training_params}"
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -39,9 +39,9 @@ public void probs() throws IOException {
float[][] expected = new float[][]{
new float[]{0.333333f, 0.333333f, 0.333333f},
new float[]{0.475999f, 0.262000f, 0.262000f},
new float[]{0.373369f, 0.345915f, 0.280716f},
new float[]{0.360023f, 0.415352f, 0.224625f},
new float[]{0.340208f, 0.355738f, 0.304054f}
new float[]{0.374730f, 0.344860f, 0.280410f},
new float[]{0.365531f, 0.411037f, 0.223432f},
new float[]{0.341800f, 0.354783f, 0.303417f}
};
assertEquals(expected.length, pred.length);
for (int i=0; i<expected.length; ++i)
Expand Down
5 changes: 3 additions & 2 deletions python/pylibvw.cc
Original file line number Diff line number Diff line change
Expand Up @@ -841,7 +841,7 @@ void unsetup_example(vw_ptr vwP, example_ptr ae)
}
}

uint32_t multiplier = all.reduction_state.wpp << all.weights.stride_shift();
uint32_t multiplier = all.reduction_state.total_feature_width << all.weights.stride_shift();
if (multiplier != 1) // make room for per-feature information.
for (auto ns : ae->indices)
for (auto& idx : ae->feature_space[ns].indices) idx /= multiplier;
Expand Down Expand Up @@ -1657,7 +1657,8 @@ BOOST_PYTHON_MODULE(pylibvw)

py::class_<Search::search, search_ptr>("search")
.def("set_options", &Search::search::set_options, "Set global search options (auto conditioning, etc.)")
//.def("set_num_learners", &Search::search::set_num_learners, "Set the total number of learners you want to
//.def("set_total_feature_width", &Search::search::set_total_feature_width, "Set the total number of learners you
// want to
// train")
.def("get_history_length", &Search::search::get_history_length,
"Get the value specified by --search_history_length")
Expand Down
135 changes: 135 additions & 0 deletions python/tests/test_ccb.py
Original file line number Diff line number Diff line change
Expand Up @@ -122,3 +122,138 @@ def test_ccb_non_slot_none_outcome():
# CCB label is set to UNSET by default.
assert label.type == vowpalwabbit.CCBLabelType.UNSET
assert label.outcome is None


def test_ccb_and_automl():
import random, json, os, shutil
import numpy as np
from vw_executor.vw import Vw

people_ccb = ["Tom", "Anna"]
topics_ccb = ["sports", "politics", "music"]

def my_ccb_simulation(n=10000, swap_after=5000, variance=0, bad_features=0, seed=0):
random.seed(seed)
np.random.seed(seed)

envs = [[[0.8, 0.4], [0.2, 0.4]]]
offset = 0

animals = [
"cat",
"dog",
"bird",
"fish",
"horse",
"cow",
"pig",
"sheep",
"goat",
"chicken",
]
colors = [
"red",
"green",
"blue",
"yellow",
"orange",
"purple",
"black",
"white",
"brown",
"gray",
]

for i in range(1, n):
person = random.randint(0, 1)
chosen = [int(i) for i in np.random.permutation(2)]
rewards = [envs[offset][person][chosen[0]], envs[offset][person][chosen[1]]]

for i in range(len(rewards)):
rewards[i] += np.random.normal(0.5, variance)

current = {
"c": {
"shared": {"name": people_ccb[person]},
"_multi": [{"a": {"topic": topics_ccb[i]}} for i in range(2)],
"_slots": [{"_id": i} for i in range(2)],
},
"_outcomes": [
{
"_label_cost": -min(rewards[i], 1),
"_a": chosen[i:],
"_p": [1.0 / (2 - i)] * (2 - i),
}
for i in range(2)
],
}

current["c"]["shared"][random.choice(animals)] = random.random()
current["c"]["shared"][random.choice(animals)] = random.random()

current["c"]["_multi"][random.choice(range(len(current["c"]["_multi"])))][
random.choice(colors)
] = random.random()
current["c"]["_multi"][random.choice(range(len(current["c"]["_multi"])))][
random.choice(colors)
] = random.random()

if random.random() < 0.50:
current["c"]["_multi"].append(
{random.choice(colors): {random.choice(animals): 0.6666}}
)

current["c"]["_slots"][random.choice(range(len(current["c"]["_slots"])))][
random.choice(colors)
] = random.random()
current["c"]["_slots"][random.choice(range(len(current["c"]["_slots"])))][
random.choice(colors)
] = random.random()

yield current

def save_examples(examples, path):
with open(path, "w") as f:
for ex in examples:
f.write(f'{json.dumps(ex, separators=(",", ":"))}\n')

input_file = "ccb.json"
cache_dir = ".cache"
save_examples(
my_ccb_simulation(n=10, variance=0.1, bad_features=1, seed=0), input_file
)

assert os.path.exists(input_file)

vw = Vw(cache_dir, handler=None)
q = vw.train(
input_file, "-b 18 -q :: --ccb_explore_adf --dsjson", ["--invert_hash"]
)
automl = vw.train(
input_file,
"-b 20 --ccb_explore_adf --log_output stderr --dsjson --automl 4 --oracle_type one_diff --verbose_metrics",
["--invert_hash", "--extra_metrics"],
)

automl_metrics = json.load(open(automl.outputs["--extra_metrics"][0]))

# we need champ switches to be zero for this to work
assert "total_champ_switches" in automl_metrics
assert automl_metrics["total_champ_switches"] == 0

q_weights = q[0].model9("--invert_hash").weights.sort_index()
automl_weights = automl[0].model9("--invert_hash").weights.sort_index()
automl_champ_weights = automl_weights[~automl_weights.index.str.contains("\[")]

fts_names_q = set([n for n in q_weights.index])
fts_names_automl = set([n for n in automl_weights.index if "[" not in n])

assert len(fts_names_q) == len(fts_names_automl)
assert fts_names_q == fts_names_automl
# since there is no champ switch automl_champ should be q:: and be equal to the non-automl q:: instance
assert q_weights.equals(automl_champ_weights)

os.remove(input_file)
shutil.rmtree(cache_dir)
assert not os.path.exists(input_file)
assert not os.path.exists(cache_dir)
2 changes: 1 addition & 1 deletion python/vowpalwabbit/pyvw.py
Original file line number Diff line number Diff line change
Expand Up @@ -640,7 +640,7 @@ def learn(self, ec: Union["Example", List["Example"], str, List[str]]) -> None:

elif isinstance(ec, list):
if not self._is_multiline():
raise TypeError("Expecting a mutiline Learner.")
raise TypeError("Expecting a multiline learner.")
if len(ec) == 0:
raise ValueError("An empty list is invalid")
if isinstance(ec[0], str):
Expand Down
12 changes: 6 additions & 6 deletions python/vowpalwabbit/sklearn.py
Original file line number Diff line number Diff line change
Expand Up @@ -738,12 +738,12 @@ class would be predicted.
>>> model = VWMultiClassifier(oaa=3, loss_function='logistic')
>>> _ = model.fit(X, y)
>>> model.predict_proba(X)
array([[0.38928846, 0.30534211, 0.30536944],
[0.40664235, 0.29666999, 0.29668769],
[0.52324486, 0.23841164, 0.23834346],
[0.5268591 , 0.23660533, 0.23653553],
[0.65397811, 0.17312808, 0.17289382],
[0.61190444, 0.19416356, 0.19393198]])
array([[0.38924146, 0.30537927, 0.30537927],
[0.40661219, 0.29669389, 0.29669389],
[0.52335149, 0.23832427, 0.23832427],
[0.52696788, 0.23651604, 0.23651604],
[0.65430814, 0.17284594, 0.17284594],
[0.61224216, 0.19387889, 0.19387889]])
"""
return VW.predict(self, X=X)

Expand Down
16 changes: 14 additions & 2 deletions test/core.vwtest.json
Original file line number Diff line number Diff line change
Expand Up @@ -5082,12 +5082,13 @@
{
"id": 394,
"desc": "Test using automl with ccb",
"vw_command": "--ccb_explore_adf --example_queue_limit 7 -d train-sets/ccb_reuse_small.data --automl 3 --verbose_metrics --extra_metrics aml_ccb_metrics.json --oracle_type one_diff",
"vw_command": "--ccb_explore_adf --example_queue_limit 7 -d train-sets/ccb_automl.dsjson --automl 4 --dsjson --verbose_metrics --extra_metrics aml_ccb_metrics.json --oracle_type one_diff",
"diff_files": {
"stderr": "train-sets/ref/automl_ccb.stderr",
"aml_ccb_metrics.json": "test-sets/ref/aml_ccb_metrics.json"
},
"input_files": [
"train-sets/ccb_reuse_small.data"
"train-sets/ccb_automl.dsjson"
]
},
{
Expand Down Expand Up @@ -5884,5 +5885,16 @@
"input_files": [
"train-sets/automl_spin_off.txt"
]
},
{
"id": 455,
"desc": "large action spaces with cb_explore_adf epsilon greedy and ips",
"vw_command": "--cb_explore_adf -d train-sets/las_100_actions.txt --noconstant --large_action_space --cb_type ips",
"diff_files": {
"stderr": "train-sets/ref/las_egreedy_ips.stderr"
},
"input_files": [
"train-sets/las_100_actions.txt"
]
}
]
Loading

0 comments on commit b7e1c4b

Please sign in to comment.