Question on AraBERT-Trainer-HyperParameterOpt-NER Notebok #179

ReemAlJunaid · 2022-10-29T02:22:05Z

ReemAlJunaid
Oct 29, 2022

Hi everyone!

I'm training an AraBERT model for NER, exactly as what you did in this notebook. After training and saving the model, I would like to see the predictions for some samples, I started by writing the following code:

from transformers import AutoTokenizer, AutoModel
from arabert.preprocess import ArabertPreprocessor
import torch

def predict(sample_text):
encoded_text = tokenizer.encode_plus(
sample_text,
max_length=138,
add_special_tokens=True,
return_token_type_ids=False,
pad_to_max_length=True,
return_attention_mask=True,
return_tensors='pt',
)
input_ids = encoded_text ['input_ids']
attention_mask = encoded_text ['attention_mask']
output = arabert_model(input_ids, attention_mask)
label_indices = np.argmax(output, axis=2)
print(f'Text: {text_preprocessed}')
print(f'Tags : {label_indices }')
return prediction

arabert_model = AutoModel.from_pretrained('/gdrive/MyDrive/AraBERT Model Config')

text = "محمد ذهب إلى أمريكا للحصول على شهادة الماجستير. "

arabert_prep = ArabertPreprocessor(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

text_preprocessed=arabert_prep.preprocess(text)

predict(text_preprocessed)

I don't know if my mapping is correct or not and also I want the output to be in the form of: [B-PER, O, O, B-LOC, O, O, O, O]

How can I accomplish that?

Thank you

WissamAntoun · 2022-10-29T14:12:33Z

WissamAntoun
Oct 29, 2022
Maintainer

for predictions i suggest using the pipeline class from huggingface transformers, and just giving it the model name instead of doing the manual prediction.

Also of you model is based on arabertv2 with presegmentation it might cause issues, but I'm not sure

0 replies

ReemAlJunaid · 2022-10-29T23:39:04Z

ReemAlJunaid
Oct 29, 2022
Author

I tried to do that using this code:

from transformers import pipeline, AutoModel, AutoModelForTokenClassification, AutoTokenizer

model_name = 'aubmindlab/bert-base-arabertv02'
arabert_model = AutoModel.from_pretrained('/gdrive/MyDrive/AraBERT Model Config')
tokenizer = AutoTokenizer.from_pretrained(model_name)
text = "محمد ذهب إلى أمريكا للحصول على شهادة الماجستير."

pipe = pipeline("ner", model=arabert_model, tokenizer=tokenizer)
pipe(text)

It shows me this error:

Some weights of the model checkpoint at /gdrive/MyDrive/LearningSpacy were not used when initializing BertModel: ['classifier.weight', 'classifier.bias']

This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertModel were not initialized from the model checkpoint at /gdrive/MyDrive/LearningSpacy and are newly initialized: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
The model 'BertModel' is not supported for ner. Supported models are ['AlbertForTokenClassification', 'BertForTokenClassification', 'BigBirdForTokenClassification', 'BloomForTokenClassification', 'CamembertForTokenClassification', 'CanineForTokenClassification', 'ConvBertForTokenClassification', 'Data2VecTextForTokenClassification', 'DebertaForTokenClassification', 'DebertaV2ForTokenClassification', 'DistilBertForTokenClassification', 'ElectraForTokenClassification', 'ErnieForTokenClassification', 'EsmForTokenClassification', 'FlaubertForTokenClassification', 'FNetForTokenClassification', 'FunnelForTokenClassification', 'GPT2ForTokenClassification', 'IBertForTokenClassification', 'LayoutLMForTokenClassification', 'LayoutLMv2ForTokenClassification', 'LayoutLMv3ForTokenClassification', 'LiltForTokenClassification', 'LongformerForTokenClassification', 'LukeForTokenClassification', 'MarkupLMForTokenClassification', 'MegatronBertForTokenClassification', 'MobileBertForTokenClassification', 'MPNetForTokenClassification', 'NezhaForTokenClassification', 'NystromformerForTokenClassification', 'QDQBertForTokenClassification', 'RemBertForTokenClassification', 'RobertaForTokenClassification', 'RoFormerForTokenClassification', 'SqueezeBertForTokenClassification', 'XLMForTokenClassification', 'XLMRobertaForTokenClassification', 'XLMRobertaXLForTokenClassification', 'XLNetForTokenClassification', 'YosoForTokenClassification'].

KeyError Traceback (most recent call last)
in
12
13 pipe = pipeline("ner", model=arabert_model, tokenizer=tokenizer)
---> 14 pipe(text)

4 frames
/usr/local/lib/python3.7/dist-packages/transformers/pipelines/token_classification.py in aggregate(self, pre_entities, aggregation_strategy)
321 score = pre_entity["scores"][entity_idx]
322 entity = {
--> 323 "entity": self.model.config.id2label[entity_idx],
324 "score": score,
325 "index": pre_entity["index"],

KeyError: 410

3 replies

WissamAntoun Oct 30, 2022
Maintainer

this is weird!
can you copy paste the pretrained model config please?

ReemAlJunaid Oct 31, 2022
Author

{
"_name_or_path": "aubmindlab/bert-base-arabertv02",
"architectures": [
"BertForTokenClassification"
],
"attention_probs_dropout_prob": 0.1,
"classifier_dropout": null,
"hidden_act": "gelu",
"hidden_dropout_prob": 0.1,
"hidden_size": 768,
"id2label": {
"0": "LABEL_0",
"1": "LABEL_1",
"2": "LABEL_2",
"3": "LABEL_3",
"4": "LABEL_4",
"5": "LABEL_5",
"6": "LABEL_6",
"7": "LABEL_7",
"8": "LABEL_8"
},
"initializer_range": 0.02,
"intermediate_size": 3072,
"label2id": {
"LABEL_0": 0,
"LABEL_1": 1,
"LABEL_2": 2,
"LABEL_3": 3,
"LABEL_4": 4,
"LABEL_5": 5,
"LABEL_6": 6,
"LABEL_7": 7,
"LABEL_8": 8
},
"layer_norm_eps": 1e-12,
"max_position_embeddings": 512,
"model_type": "bert",
"num_attention_heads": 12,
"num_hidden_layers": 12,
"pad_token_id": 0,
"position_embedding_type": "absolute",
"torch_dtype": "float32",
"transformers_version": "4.24.0.dev0",
"type_vocab_size": 2,
"use_cache": true,
"vocab_size": 64000
}

ReemAlJunaid Oct 31, 2022
Author

It's working now, I just change AutoModel to AutoModelForTokenClassification.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question on AraBERT-Trainer-HyperParameterOpt-NER Notebok #179

{{title}}

Replies: 2 comments 3 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Question on AraBERT-Trainer-HyperParameterOpt-NER Notebok #179

ReemAlJunaid Oct 29, 2022

Replies: 2 comments · 3 replies

WissamAntoun Oct 29, 2022 Maintainer

ReemAlJunaid Oct 29, 2022 Author

WissamAntoun Oct 30, 2022 Maintainer

ReemAlJunaid Oct 31, 2022 Author

ReemAlJunaid Oct 31, 2022 Author

ReemAlJunaid
Oct 29, 2022

Replies: 2 comments 3 replies

WissamAntoun
Oct 29, 2022
Maintainer

ReemAlJunaid
Oct 29, 2022
Author

WissamAntoun Oct 30, 2022
Maintainer

ReemAlJunaid Oct 31, 2022
Author

ReemAlJunaid Oct 31, 2022
Author