moverscore fails #6

michaelmior · 2023-09-19T17:24:56Z

Describe the bug
The MoverScore calculation seems to fail when trying to make use of the DistilBert tokenizer.

To Reproduce

>>> import nlgmetricverse
>>> scorer = nlgmetricverse.NLGMetricverse(metrics=['moverscore'])
>>> scorer(predictions=['foo'], references=['bar'])
# Expected score value

Exception Traceback (if available)

>>> import nlgmetricverse
2023-09-19 13:19:51.367801: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
>>> scorer = nlgmetricverse.NLGMetricverse(metrics=['moverscore'])
/home/mmior/.local/share/virtualenvs/annotate-schema-yEyO5xw6/lib/python3.8/site-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
  warnings.warn(
/home/mmior/.local/share/virtualenvs/annotate-schema-yEyO5xw6/lib/python3.8/site-packages/transformers/generation_utils.py:24: FutureWarning: Importing `GenerationMixin` from `src/transformers/generation_utils.py` is deprecated and will be removed in Transformers v5. Import as `from transformers import GenerationMixin` instead.
  warnings.warn(
/home/mmior/.local/share/virtualenvs/annotate-schema-yEyO5xw6/lib/python3.8/site-packages/transformers/generation_tf_utils.py:24: FutureWarning: Importing `TFGenerationMixin` from `src/transformers/generation_tf_utils.py` is deprecated and will be removed in Transformers v5. Import as `from transformers import TFGenerationMixin` instead.
  warnings.warn(
loading configuration file config.json from cache at /home/mmior/.cache/huggingface/hub/models--distilbert-base-uncased/snapshots/6cdc0aad91f5ae2e6712e91bc7b65d1cf5c05411/config.json
Model config DistilBertConfig {
  "activation": "gelu",
  "architectures": [
    "DistilBertForMaskedLM"
  ],
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "hidden_dim": 3072,
  "initializer_range": 0.02,
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers": 6,
  "output_attentions": true,
  "output_hidden_states": true,
  "pad_token_id": 0,
  "qa_dropout": 0.1,
  "seq_classif_dropout": 0.2,
  "sinusoidal_pos_embds": false,
  "tie_weights_": true,
  "transformers_version": "4.33.2",
  "vocab_size": 30522
}

loading file vocab.txt from cache at /home/mmior/.cache/huggingface/hub/models--distilbert-base-uncased/snapshots/6cdc0aad91f5ae2e6712e91bc7b65d1cf5c05411/vocab.txt
loading file added_tokens.json from cache at None
loading file special_tokens_map.json from cache at None
loading file tokenizer_config.json from cache at /home/mmior/.cache/huggingface/hub/models--distilbert-base-uncased/snapshots/6cdc0aad91f5ae2e6712e91bc7b65d1cf5c05411/tokenizer_config.json
loading configuration file config.json from cache at /home/mmior/.cache/huggingface/hub/models--distilbert-base-uncased/snapshots/6cdc0aad91f5ae2e6712e91bc7b65d1cf5c05411/config.json
Model config DistilBertConfig {
  "_name_or_path": "distilbert-base-uncased",
  "activation": "gelu",
  "architectures": [
    "DistilBertForMaskedLM"
  ],
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "hidden_dim": 3072,
  "initializer_range": 0.02,
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers": 6,
  "pad_token_id": 0,
  "qa_dropout": 0.1,
  "seq_classif_dropout": 0.2,
  "sinusoidal_pos_embds": false,
  "tie_weights_": true,
  "transformers_version": "4.33.2",
  "vocab_size": 30522
}

loading weights file model.safetensors from cache at /home/mmior/.cache/huggingface/hub/models--distilbert-base-uncased/snapshots/6cdc0aad91f5ae2e6712e91bc7b65d1cf5c05411/model.safetensors
Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertModel: ['vocab_layer_norm.bias', 'vocab_transform.weight', 'vocab_projector.bias', 'vocab_layer_norm.weight', 'vocab_transform.bias']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of DistilBertModel were initialized from the model checkpoint at distilbert-base-uncased.
If your task is similar to the task the model of the checkpoint was trained on, you can already use DistilBertModel for predictions without further training.
>>> scorer(predictions=['foo'], references=['bar'])
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
  File "/home/mmior/.pyenv/versions/3.8.16/lib/python3.8/multiprocessing/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "/home/mmior/.pyenv/versions/3.8.16/lib/python3.8/multiprocessing/pool.py", line 48, in mapstar
    return list(map(*args))
  File "/home/mmior/.local/share/virtualenvs/annotate-schema-yEyO5xw6/lib/python3.8/site-packages/moverscore_v2.py", line 30, in process
    a = ["[CLS]"]+truncate(tokenizer.tokenize(a))+["[SEP]"]
  File "/home/mmior/.local/share/virtualenvs/annotate-schema-yEyO5xw6/lib/python3.8/site-packages/moverscore_v2.py", line 25, in truncate
    if len(tokens) > tokenizer.max_len - 2:
AttributeError: 'DistilBertTokenizer' object has no attribute 'max_len'
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/mmior/.local/share/virtualenvs/annotate-schema-yEyO5xw6/lib/python3.8/site-packages/nlgmetricverse/core.py", line 86, in __call__
    score = self._compute_single_score(inputs)
  File "/home/mmior/.local/share/virtualenvs/annotate-schema-yEyO5xw6/lib/python3.8/site-packages/nlgmetricverse/core.py", line 208, in _compute_single_score
    score = metric.compute(predictions=predictions, references=references, reduce_fn=reduce_fn, **kwargs)
  File "/home/mmior/.local/share/virtualenvs/annotate-schema-yEyO5xw6/lib/python3.8/site-packages/evaluate/module.py", line 444, in compute
    output = self._compute(**inputs, **compute_kwargs)
  File "/home/mmior/.local/share/virtualenvs/annotate-schema-yEyO5xw6/lib/python3.8/site-packages/nlgmetricverse/metrics/_core/base.py", line 362, in _compute
    result = self.evaluate(predictions=predictions, references=references, reduce_fn=reduce_fn, **eval_params)
  File "/home/mmior/.local/share/virtualenvs/annotate-schema-yEyO5xw6/lib/python3.8/site-packages/nlgmetricverse/metrics/_core/base.py", line 302, in evaluate
    return eval_fn(predictions=predictions, references=references, **kwargs)
  File "/home/mmior/.local/share/virtualenvs/annotate-schema-yEyO5xw6/lib/python3.8/site-packages/nlgmetricverse/metrics/moverscore/moverscore_planet.py", line 132, in _compute_single_pred_single_ref
    idf_dict_ref = moverscore_v2.get_idf_dict(references)  # idf_dict_ref = defaultdict(lambda: 1.)
  File "/home/mmior/.local/share/virtualenvs/annotate-schema-yEyO5xw6/lib/python3.8/site-packages/moverscore_v2.py", line 42, in get_idf_dict
    idf_count.update(chain.from_iterable(p.map(process_partial, arr)))
  File "/home/mmior/.pyenv/versions/3.8.16/lib/python3.8/multiprocessing/pool.py", line 364, in map
    return self._map_async(func, iterable, mapstar, chunksize).get()
  File "/home/mmior/.pyenv/versions/3.8.16/lib/python3.8/multiprocessing/pool.py", line 771, in get
    raise self._value
AttributeError: 'DistilBertTokenizer' object has no attribute 'max_len'

Environment Information:

OS: Ubuntu 20.04.6 LTS
nlgmetricverse version: 0.9.9
evaluate version: 0.4.0
datasets version: 2.9.0
moverscore version: 1.0.3

Taghreed7878 · 2023-11-14T00:08:00Z

You could modify this in the script to be as follows:

if len(tokens) > tokenizer.model_max_length - 2:

And make sure that any other occurrences in the script are modified the same way too.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

moverscore fails #6

moverscore fails #6

michaelmior commented Sep 19, 2023

Taghreed7878 commented Nov 14, 2023

moverscore fails #6

moverscore fails #6

Comments

michaelmior commented Sep 19, 2023

Taghreed7878 commented Nov 14, 2023