Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

moverscore fails #6

Open
michaelmior opened this issue Sep 19, 2023 · 1 comment
Open

moverscore fails #6

michaelmior opened this issue Sep 19, 2023 · 1 comment

Comments

@michaelmior
Copy link

Describe the bug
The MoverScore calculation seems to fail when trying to make use of the DistilBert tokenizer.

To Reproduce

>>> import nlgmetricverse
>>> scorer = nlgmetricverse.NLGMetricverse(metrics=['moverscore'])
>>> scorer(predictions=['foo'], references=['bar'])
# Expected score value

Exception Traceback (if available)

>>> import nlgmetricverse
2023-09-19 13:19:51.367801: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
>>> scorer = nlgmetricverse.NLGMetricverse(metrics=['moverscore'])
/home/mmior/.local/share/virtualenvs/annotate-schema-yEyO5xw6/lib/python3.8/site-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
  warnings.warn(
/home/mmior/.local/share/virtualenvs/annotate-schema-yEyO5xw6/lib/python3.8/site-packages/transformers/generation_utils.py:24: FutureWarning: Importing `GenerationMixin` from `src/transformers/generation_utils.py` is deprecated and will be removed in Transformers v5. Import as `from transformers import GenerationMixin` instead.
  warnings.warn(
/home/mmior/.local/share/virtualenvs/annotate-schema-yEyO5xw6/lib/python3.8/site-packages/transformers/generation_tf_utils.py:24: FutureWarning: Importing `TFGenerationMixin` from `src/transformers/generation_tf_utils.py` is deprecated and will be removed in Transformers v5. Import as `from transformers import TFGenerationMixin` instead.
  warnings.warn(
loading configuration file config.json from cache at /home/mmior/.cache/huggingface/hub/models--distilbert-base-uncased/snapshots/6cdc0aad91f5ae2e6712e91bc7b65d1cf5c05411/config.json
Model config DistilBertConfig {
  "activation": "gelu",
  "architectures": [
    "DistilBertForMaskedLM"
  ],
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "hidden_dim": 3072,
  "initializer_range": 0.02,
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers": 6,
  "output_attentions": true,
  "output_hidden_states": true,
  "pad_token_id": 0,
  "qa_dropout": 0.1,
  "seq_classif_dropout": 0.2,
  "sinusoidal_pos_embds": false,
  "tie_weights_": true,
  "transformers_version": "4.33.2",
  "vocab_size": 30522
}

loading file vocab.txt from cache at /home/mmior/.cache/huggingface/hub/models--distilbert-base-uncased/snapshots/6cdc0aad91f5ae2e6712e91bc7b65d1cf5c05411/vocab.txt
loading file added_tokens.json from cache at None
loading file special_tokens_map.json from cache at None
loading file tokenizer_config.json from cache at /home/mmior/.cache/huggingface/hub/models--distilbert-base-uncased/snapshots/6cdc0aad91f5ae2e6712e91bc7b65d1cf5c05411/tokenizer_config.json
loading configuration file config.json from cache at /home/mmior/.cache/huggingface/hub/models--distilbert-base-uncased/snapshots/6cdc0aad91f5ae2e6712e91bc7b65d1cf5c05411/config.json
Model config DistilBertConfig {
  "_name_or_path": "distilbert-base-uncased",
  "activation": "gelu",
  "architectures": [
    "DistilBertForMaskedLM"
  ],
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "hidden_dim": 3072,
  "initializer_range": 0.02,
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers": 6,
  "pad_token_id": 0,
  "qa_dropout": 0.1,
  "seq_classif_dropout": 0.2,
  "sinusoidal_pos_embds": false,
  "tie_weights_": true,
  "transformers_version": "4.33.2",
  "vocab_size": 30522
}

loading weights file model.safetensors from cache at /home/mmior/.cache/huggingface/hub/models--distilbert-base-uncased/snapshots/6cdc0aad91f5ae2e6712e91bc7b65d1cf5c05411/model.safetensors
Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertModel: ['vocab_layer_norm.bias', 'vocab_transform.weight', 'vocab_projector.bias', 'vocab_layer_norm.weight', 'vocab_transform.bias']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of DistilBertModel were initialized from the model checkpoint at distilbert-base-uncased.
If your task is similar to the task the model of the checkpoint was trained on, you can already use DistilBertModel for predictions without further training.
>>> scorer(predictions=['foo'], references=['bar'])
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
  File "/home/mmior/.pyenv/versions/3.8.16/lib/python3.8/multiprocessing/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "/home/mmior/.pyenv/versions/3.8.16/lib/python3.8/multiprocessing/pool.py", line 48, in mapstar
    return list(map(*args))
  File "/home/mmior/.local/share/virtualenvs/annotate-schema-yEyO5xw6/lib/python3.8/site-packages/moverscore_v2.py", line 30, in process
    a = ["[CLS]"]+truncate(tokenizer.tokenize(a))+["[SEP]"]
  File "/home/mmior/.local/share/virtualenvs/annotate-schema-yEyO5xw6/lib/python3.8/site-packages/moverscore_v2.py", line 25, in truncate
    if len(tokens) > tokenizer.max_len - 2:
AttributeError: 'DistilBertTokenizer' object has no attribute 'max_len'
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/mmior/.local/share/virtualenvs/annotate-schema-yEyO5xw6/lib/python3.8/site-packages/nlgmetricverse/core.py", line 86, in __call__
    score = self._compute_single_score(inputs)
  File "/home/mmior/.local/share/virtualenvs/annotate-schema-yEyO5xw6/lib/python3.8/site-packages/nlgmetricverse/core.py", line 208, in _compute_single_score
    score = metric.compute(predictions=predictions, references=references, reduce_fn=reduce_fn, **kwargs)
  File "/home/mmior/.local/share/virtualenvs/annotate-schema-yEyO5xw6/lib/python3.8/site-packages/evaluate/module.py", line 444, in compute
    output = self._compute(**inputs, **compute_kwargs)
  File "/home/mmior/.local/share/virtualenvs/annotate-schema-yEyO5xw6/lib/python3.8/site-packages/nlgmetricverse/metrics/_core/base.py", line 362, in _compute
    result = self.evaluate(predictions=predictions, references=references, reduce_fn=reduce_fn, **eval_params)
  File "/home/mmior/.local/share/virtualenvs/annotate-schema-yEyO5xw6/lib/python3.8/site-packages/nlgmetricverse/metrics/_core/base.py", line 302, in evaluate
    return eval_fn(predictions=predictions, references=references, **kwargs)
  File "/home/mmior/.local/share/virtualenvs/annotate-schema-yEyO5xw6/lib/python3.8/site-packages/nlgmetricverse/metrics/moverscore/moverscore_planet.py", line 132, in _compute_single_pred_single_ref
    idf_dict_ref = moverscore_v2.get_idf_dict(references)  # idf_dict_ref = defaultdict(lambda: 1.)
  File "/home/mmior/.local/share/virtualenvs/annotate-schema-yEyO5xw6/lib/python3.8/site-packages/moverscore_v2.py", line 42, in get_idf_dict
    idf_count.update(chain.from_iterable(p.map(process_partial, arr)))
  File "/home/mmior/.pyenv/versions/3.8.16/lib/python3.8/multiprocessing/pool.py", line 364, in map
    return self._map_async(func, iterable, mapstar, chunksize).get()
  File "/home/mmior/.pyenv/versions/3.8.16/lib/python3.8/multiprocessing/pool.py", line 771, in get
    raise self._value
AttributeError: 'DistilBertTokenizer' object has no attribute 'max_len'

Environment Information:

  • OS: Ubuntu 20.04.6 LTS
  • nlgmetricverse version: 0.9.9
  • evaluate version: 0.4.0
  • datasets version: 2.9.0
  • moverscore version: 1.0.3
@Taghreed7878
Copy link

You could modify this in the script to be as follows:

if len(tokens) > tokenizer.model_max_length - 2:

And make sure that any other occurrences in the script are modified the same way too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants