QuestionAnsweringPipeline returns full context in Japanese #17706

KoichiYasuoka · 2022-06-15T04:37:25Z

System Info

- `transformers` version: 4.19.4
- Platform: Linux-5.10.0-13-amd64-x86_64-with-glibc2.31
- Python version: 3.9.2
- Huggingface_hub version: 0.1.0
- PyTorch version (GPU?): 1.11.0+cu102 (False)

Who can help?

@Narsil @sgugger

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

QuestionAnsweringPipeline (almost always) returns full context in Japanese, for example:

from transformers import AutoTokenizer, AutoModelForQuestionAnswering, QuestionAnsweringPipeline
tokenizer = AutoTokenizer.from_pretrained("KoichiYasuoka/deberta-base-japanese-aozora-ud-head")
model = AutoModelForQuestionAnswering.from_pretrained("KoichiYasuoka/deberta-base-japanese-aozora-ud-head")
qap = QuestionAnsweringPipeline(tokenizer=tokenizer, model=model)
print(qap(question="国語", context="全学年にわたって小学校の国語の教科書に挿し絵が用いられている"))

returns {'score': 0.9999955892562866, 'start': 0, 'end': 30, 'answer': '全学年にわたって小学校の国語の教科書に挿し絵が用いられている'}. On the other hand, directly with torch.argmax

import torch
from transformers import AutoTokenizer,AutoModelForQuestionAnswering
tokenizer = AutoTokenizer.from_pretrained("KoichiYasuoka/deberta-base-japanese-aozora-ud-head")
model = AutoModelForQuestionAnswering.from_pretrained("KoichiYasuoka/deberta-base-japanese-aozora-ud-head")
question = "国語"
context = "全学年にわたって小学校の国語の教科書に挿し絵が用いられている"
inputs = tokenizer(question,context, return_tensors="pt", return_offsets_mapping=True)
offsets = inputs.pop("offset_mapping").tolist()[0]
outputs = model(**inputs)
start, end = torch.argmax(outputs.start_logits), torch.argmax(outputs.end_logits)
print(context[offsets[start][0]:offsets[end][-1]])

the model returns the answer "教科書" correctly.

Expected behavior

Return the right answer "教科書" instead of full context.

The text was updated successfully, but these errors were encountered:

KoichiYasuoka · 2022-06-15T04:50:32Z

I suspect that "encoding" in Japanese models do not work at https://github.com/huggingface/transformers/blob/main/src/transformers/pipelines/question_answering.py#L452
but I'm vague how to fix.

gante · 2022-06-15T15:31:04Z

Hi @KoichiYasuoka 👋 As per our issues guidelines, we reserve GitHub issues for bugs in the repository and/or feature requests. For any other requests, we'd like to invite you to use our forum 🤗

(Since the issue is about the quality of the output, it's probably model-related, and not a bug per se. In any case, if you suspect it is due to a bug in transformers, please add more information here)

Narsil · 2022-07-04T13:40:46Z

Hi @KoichiYasuoka ,

This seems to be linked to the pipeline attempts to align on "words". The problem is that this japanese tokenizer does not ever cut on "words" so the whole context is a single word, so the realignment just forgets all about the actual answer, which is a bit sad.

I created a PR to include a new parameter to disable this so it can work on your use case (I personally think it should be the default but we cannot change this because of backward compatibility)

KoichiYasuoka · 2022-07-07T13:26:59Z

Thank you @Narsil for creating new PR with align_to_words=False option. Well, can I use the option in the widget of deberta-base-japanese-aozora-ud-head page?

Narsil · 2022-07-15T14:45:39Z

Hi, the PR is not merged yet, and it will take a few days before it lands on the API (API doesn't run master).

Afterwards, while being undocumented and thus maybe deactivated at anytime (though we rarely do this), you could send align_to_words: false within the parameters part of your query to the API.

Unfortunately the widget itself will not use parameters.

Does that answer your question ?

github-actions · 2022-08-08T15:02:47Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

KoichiYasuoka added the bug label Jun 15, 2022

Narsil mentioned this issue Jul 4, 2022

Adding a new align_to_words param to qa pipeline. #18010

Merged

5 tasks

Narsil closed this as completed in #18010 Aug 9, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

QuestionAnsweringPipeline returns full context in Japanese #17706

QuestionAnsweringPipeline returns full context in Japanese #17706

KoichiYasuoka commented Jun 15, 2022

KoichiYasuoka commented Jun 15, 2022

gante commented Jun 15, 2022

Narsil commented Jul 4, 2022

KoichiYasuoka commented Jul 7, 2022

Narsil commented Jul 15, 2022

github-actions bot commented Aug 8, 2022

QuestionAnsweringPipeline returns full context in Japanese #17706

QuestionAnsweringPipeline returns full context in Japanese #17706

Comments

KoichiYasuoka commented Jun 15, 2022

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

KoichiYasuoka commented Jun 15, 2022

gante commented Jun 15, 2022

Narsil commented Jul 4, 2022

KoichiYasuoka commented Jul 7, 2022

Narsil commented Jul 15, 2022

github-actions bot commented Aug 8, 2022