enable tokenizer customization in HFDetector #855

jmartin-tech · 2024-08-27T18:35:11Z

Commonize parameters utilized in HFDetector, existing detectors currently use common arguments.

promote tokenizer_kwargs as DEFAULT_PARAM of parent class
remove __init()__ overrides not longer needed

Expands on earlier exposure of model selection for HFDetectors in #810

These values seem independent from hf_args used to load the model.

Some additional code review suggests that misleading.MustContradictNLI does not actually consume these values since it overrides detect() with a custom method. Should this PR be expanded to consume values from self.tokenizer_kwargs for max_length and truncation? The values are currently hardcoded in detect() as local values. It does not look like padding is ever consumed in misleading.MustContradictNLI and could be suppressed with instance defaults for that class.

* promote `tokenizer_kwargs` as DEFAULT_PARAM of parent class * remove `__init()__` overrides not longer needed Signed-off-by: Jeffrey Martin <jemartin@nvidia.com>

leondz · 2024-08-28T08:55:08Z

garak/detectors/base.py

-    DEFAULT_PARAMS = Detector.DEFAULT_PARAMS | {"hf_args": {"device": "cpu"}}
+    DEFAULT_PARAMS = Detector.DEFAULT_PARAMS | {
+        "hf_args": {"device": "cpu"},
+        "tokenizer_kwargs": {"padding": True, "truncation": True, "max_length": 512},


I suspect max_length might want to come from model ctx_len - but this is out of scope for this PR

I'd actually exclude max_length since it's very specifically model/config dependent and this might override defaults. If the model's max_length is, for some reason <512, this default could throw errors.

On the other hand, running without a length cap also risks breaking models (this happens regularly on e.g. some NVCF instances). I think it's OK to offer a default control over this with a mild value that balances both being too high and too low. 512 seems reasonable (bearing in mind that detectors tend to be smaller models - this is a side effect of garak needing to run them at scale, but sometimes having primary compute allocated to the target model, reducing detector resources).

erickgalinkin

LGTM

erickgalinkin · 2024-08-28T15:18:41Z

garak/detectors/base.py

-    DEFAULT_PARAMS = Detector.DEFAULT_PARAMS | {"hf_args": {"device": "cpu"}}
+    DEFAULT_PARAMS = Detector.DEFAULT_PARAMS | {
+        "hf_args": {"device": "cpu"},
+        "tokenizer_kwargs": {"padding": True, "truncation": True, "max_length": 512},


I'd actually exclude max_length since it's very specifically model/config dependent and this might override defaults. If the model's max_length is, for some reason <512, this default could throw errors.

garak/detectors/base.py

Did not intend to approve

Signed-off-by: Jeffrey Martin <jemartin@nvidia.com>

leondz · 2024-08-29T09:31:26Z

garak/detectors/base.py

-    DEFAULT_PARAMS = Detector.DEFAULT_PARAMS | {"hf_args": {"device": "cpu"}}
+    DEFAULT_PARAMS = Detector.DEFAULT_PARAMS | {
+        "hf_args": {"device": "cpu"},
+        "tokenizer_kwargs": {"padding": True, "truncation": True, "max_length": 512},


On the other hand, running without a length cap also risks breaking models (this happens regularly on e.g. some NVCF instances). I think it's OK to offer a default control over this with a mild value that balances both being too high and too low. 512 seems reasonable (bearing in mind that detectors tend to be smaller models - this is a side effect of garak needing to run them at scale, but sometimes having primary compute allocated to the target model, reducing detector resources).

garak/detectors/base.py

garak/detectors/misleading.py

Co-authored-by: Erick Galinkin <erick.galinkin@gmail.com> Signed-off-by: Leon Derczynski <leonderczynski@gmail.com>

enable tokenizer customization in HFDetector

dfe5eea

* promote `tokenizer_kwargs` as DEFAULT_PARAM of parent class * remove `__init()__` overrides not longer needed Signed-off-by: Jeffrey Martin <jemartin@nvidia.com>

jmartin-tech requested review from leondz and erickgalinkin August 27, 2024 18:35

leondz approved these changes Aug 28, 2024

View reviewed changes

erickgalinkin previously approved these changes Aug 28, 2024

View reviewed changes

utilize tokenizer_kwargs values

141c99c

Signed-off-by: Jeffrey Martin <jemartin@nvidia.com>

jmartin-tech added enhancement Architectural upgrades detectors work on code that inherits from or manages Detector labels Aug 28, 2024

leondz approved these changes Aug 29, 2024

View reviewed changes

add padding, truncations to tokenizer_kwargs

fba9fc3

Co-authored-by: Erick Galinkin <erick.galinkin@gmail.com> Signed-off-by: Leon Derczynski <leonderczynski@gmail.com>

leondz approved these changes Sep 2, 2024

View reviewed changes

leondz merged commit bc73683 into leondz:main Sep 3, 2024
8 checks passed

github-actions bot locked and limited conversation to collaborators Sep 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

enable tokenizer customization in HFDetector #855

enable tokenizer customization in HFDetector #855

jmartin-tech commented Aug 27, 2024

leondz Aug 28, 2024

erickgalinkin Aug 28, 2024

leondz Aug 29, 2024

erickgalinkin left a comment

erickgalinkin Aug 28, 2024

leondz Aug 29, 2024

enable tokenizer customization in HFDetector #855

enable tokenizer customization in HFDetector #855

Conversation

jmartin-tech commented Aug 27, 2024

leondz Aug 28, 2024

Choose a reason for hiding this comment

erickgalinkin Aug 28, 2024

Choose a reason for hiding this comment

leondz Aug 29, 2024

Choose a reason for hiding this comment

erickgalinkin left a comment

Choose a reason for hiding this comment

erickgalinkin Aug 28, 2024

Choose a reason for hiding this comment

leondz Aug 29, 2024

Choose a reason for hiding this comment