Improving my NER model's accuracy #13244

rshahrabani · 2024-01-18T13:57:18Z

rshahrabani
Jan 18, 2024

I am building an NER model for the following labels: ACQUIREE_COMPANY and ACQUIROR_COMPANY. The training data is based on press releases announcing mergers and acquisitions of acquiree and acquiror companies. I have annotated roughly 18,000 examples using ChatGPT-4. I trained the model using Prodigy with an 80% (training)-20% (eval) split both using a base model (en_core_web_lg) and without a base model. I am not getting above roughly 70% accuracy on the model trained with a base model and 67% accuracy on the model trained without a base model.

The stats for the training run without a base model were:

============================= Training pipeline =============================
Components: ner
Merging training and evaluation data for 1 components

[ner] Training: 13617 | Evaluation: 3418 (20% split)
Training: 13360 | Evaluation: 3339
Labels: ner (2)
ℹ Pipeline: ['tok2vec', 'ner']
ℹ Initial learn rate: 0.001
E # LOSS TOK2VEC LOSS NER ENTS_F ENTS_P ENTS_R SCORE

...
0 3400 236.99 870.02 67.13 68.91 65.44 0.67
...
0 4600 31159.08 946.75 67.07 73.73 61.52 0.67
...
0 5000 581.26 919.93 64.44 62.43 66.58 0.64
✔ Saved pipeline to output directory

The stats for the training run with en_core_web_lg as base model were:

============================= Training pipeline =============================
Components: ner
Merging training and evaluation data for 1 components

[ner] Training: 13617 | Evaluation: 3418 (20% split)
Training: 13360 | Evaluation: 3339
Labels: ner (2)
ℹ Pipeline: ['tok2vec', 'tagger', 'parser', 'attribute_ruler',
'lemmatizer', 'ner']
ℹ Frozen components: ['tagger', 'parser', 'attribute_ruler',
'lemmatizer']
ℹ Initial learn rate: 0.001
E # LOSS TOK2VEC LOSS NER ENTS_F ENTS_P ENTS_R SPEED SCORE

...
3 19000 0.00 3564.10 72.53 74.67 70.50 6875.54 0.73
3 20000 0.00 3647.85 72.67 74.46 70.96 7190.40 0.73
...
5 25000 0.00 3639.24 72.75 74.55 71.03 7433.97 0.73
5 26000 0.00 3409.12 72.74 74.67 70.91 7425.77 0.73
✔ Saved pipeline to output directory

my spacy info:

============================== Info about spaCy ==============================

spaCy version 3.7.2
Location C:\Program Files\Python\Lib\site-packages\spacy
Platform Windows-10-10.0.22621-SP0
Python version 3.11.6
Pipelines en_core_web_lg (3.7.0), en_core_web_md (3.7.0), en_core_web_sm (3.7.0), en_core_web_trf (3.7.2)

Some guidance on how to improve this accuracy would be greatly appreciated.

Thanks.

Answered by svlandeg

Jan 25, 2024

That is indeed a relatively low accuracy for an NER challenge, and the reason might be your data model. In general, we advice against having sublabels of common entity types, as it will probably be challenging for an NER model to see the distinctions between "ACQUIREE_COMPANY" and "ACQUIROR_COMPANY". On the other hand, simply identifying entities that are a "COMPANY" would be something that an NER is quite good at.

So, my personal approach to an NLP solution would probably look a bit more like this:

Train an NER to recognize any company name in text (or use/evaluate an existing one like from one our pretrained pipelines)
Identify sentences that talk about acquisition (e.g. with a textcat …

View full answer

svlandeg · 2024-01-25T17:10:28Z

svlandeg
Jan 25, 2024
Maintainer

That is indeed a relatively low accuracy for an NER challenge, and the reason might be your data model. In general, we advice against having sublabels of common entity types, as it will probably be challenging for an NER model to see the distinctions between "ACQUIREE_COMPANY" and "ACQUIROR_COMPANY". On the other hand, simply identifying entities that are a "COMPANY" would be something that an NER is quite good at.

So, my personal approach to an NLP solution would probably look a bit more like this:

Train an NER to recognize any company name in text (or use/evaluate an existing one like from one our pretrained pipelines)
Identify sentences that talk about acquisition (e.g. with a textcat component)
Implement a specific component that determines whether a given "COMPANY" entity mentioned in the sentence is the acquirer, the acquiree, or neither. This component could be implemented with e.g. a REL approach (note: provided link is just a tutorial and not a production-ready solution) or something like the DependencyMatcher.

Further, I would spend some time manually evaluating the ChatGPT results to verify whether the accuracy of those 18.000 instances are of high enough quality. You're probably better off with 2000 manually-curated examples rather than 18000 automatically generated ones (but maybe you did already do such a manual curation).

0 replies

rshahrabani · 2024-01-26T15:10:43Z

rshahrabani
Jan 26, 2024
Author

Thanks for your informative reply - I will put your suggestions to use and hopefully they will improve the accuracy of the model. In the meantime, I would like to ask you about possible consulting services provided by Spacy - do you offer any such thing, and if so, what are the costs associated?

…

On Thu, Jan 25, 2024 at 12:10 PM Sofie Van Landeghem < ***@***.***> wrote: That is indeed a relatively low accuracy for an NER challenge, and the reason might be your data model. In general, we advice against having sublabels of common entity types, as it will probably be challenging for an NER model to see the distinctions between "ACQUIREE_COMPANY" and "ACQUIROR_COMPANY". On the other hand, simply identifying entities that are a "COMPANY" would be something that an NER is quite good at. So, my personal approach to an NLP solution would probably look a bit more like this: 1. Train an NER to recognize any company name in text (or use/evaluate an existing one like from one our pretrained pipelines) 2. Identify sentences that talk about acquisition (e.g. with a textcat component) 3. Implement a specific component that determines whether a given "COMPANY" entity mentioned in the sentence is the acquirer, the acquiree, or neither. This component could be implemented with e.g. a REL <https://github.com/explosion/projects/tree/v3/tutorials/rel_component> approach (note: provided link is just a tutorial and not a production-ready solution) or something like the DependencyMatcher <https://spacy.io/usage/rule-based-matching#dependencymatcher>. Further, I would spend some time manually evaluating the ChatGPT results to verify whether the accuracy of those 18.000 instances are of high enough quality. You're probably better off with 2000 manually-curated examples rather than 18000 automatically generated ones (but maybe you did already do such a manual curation). — Reply to this email directly, view it on GitHub <#13244 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AI5PW7CGV4EF352SBJX5A7LYQKGY7AVCNFSM6AAAAABCAKKG2SVHI2DSMVQWIX3LMV43SRDJONRXK43TNFXW4Q3PNVWWK3TUHM4DENBXHA2TI> . You are receiving this because you authored the thread.Message ID: ***@***.***>

1 reply

svlandeg Jan 26, 2024
Maintainer

You're welcome! I hope it'll prove useful.

We do in fact over consulting services, and we have quite some experience with projects that require data modeling, setting up efficient annotation (with our tool Prodigy), building preliminary models, feeding results back into an iterative cycle to update your data model, etc.

Feel free to reach out to us through the link provided above - you can either email us directly or fill in a short form, and then we'll follow up to discuss your use-case and send you a costed proposal (no strings attached).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improving my NER model's accuracy #13244

{{title}}

Replies: 2 comments 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Improving my NER model's accuracy #13244

rshahrabani Jan 18, 2024

Replies: 2 comments · 1 reply

svlandeg Jan 25, 2024 Maintainer

rshahrabani Jan 26, 2024 Author

svlandeg Jan 26, 2024 Maintainer

rshahrabani
Jan 18, 2024

Replies: 2 comments 1 reply

svlandeg
Jan 25, 2024
Maintainer

rshahrabani
Jan 26, 2024
Author

svlandeg Jan 26, 2024
Maintainer