Improving my NER model's accuracy #13244
-
I am building an NER model for the following labels: ACQUIREE_COMPANY and ACQUIROR_COMPANY. The training data is based on press releases announcing mergers and acquisitions of acquiree and acquiror companies. I have annotated roughly 18,000 examples using ChatGPT-4. I trained the model using Prodigy with an 80% (training)-20% (eval) split both using a base model (en_core_web_lg) and without a base model. I am not getting above roughly 70% accuracy on the model trained with a base model and 67% accuracy on the model trained without a base model. The stats for the training run without a base model were: ============================= Training pipeline =============================
... The stats for the training run with en_core_web_lg as base model were: ============================= Training pipeline =============================
... my spacy info: ============================== Info about spaCy ============================== spaCy version 3.7.2 Some guidance on how to improve this accuracy would be greatly appreciated. Thanks. |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 1 reply
-
That is indeed a relatively low accuracy for an NER challenge, and the reason might be your data model. In general, we advice against having sublabels of common entity types, as it will probably be challenging for an NER model to see the distinctions between "ACQUIREE_COMPANY" and "ACQUIROR_COMPANY". On the other hand, simply identifying entities that are a "COMPANY" would be something that an NER is quite good at. So, my personal approach to an NLP solution would probably look a bit more like this:
Further, I would spend some time manually evaluating the ChatGPT results to verify whether the accuracy of those 18.000 instances are of high enough quality. You're probably better off with 2000 manually-curated examples rather than 18000 automatically generated ones (but maybe you did already do such a manual curation). |
Beta Was this translation helpful? Give feedback.
-
Thanks for your informative reply - I will put your suggestions to use and
hopefully they will improve the accuracy of the model.
In the meantime, I would like to ask you about possible consulting services
provided by Spacy - do you offer any such thing, and if so, what are the
costs associated?
…On Thu, Jan 25, 2024 at 12:10 PM Sofie Van Landeghem < ***@***.***> wrote:
That is indeed a relatively low accuracy for an NER challenge, and the
reason might be your data model. In general, we advice against having
sublabels of common entity types, as it will probably be challenging for an
NER model to see the distinctions between "ACQUIREE_COMPANY" and
"ACQUIROR_COMPANY". On the other hand, simply identifying entities that are
a "COMPANY" would be something that an NER is quite good at.
So, my personal approach to an NLP solution would probably look a bit more
like this:
1. Train an NER to recognize any company name in text (or use/evaluate
an existing one like from one our pretrained pipelines)
2. Identify sentences that talk about acquisition (e.g. with a textcat
component)
3. Implement a specific component that determines whether a given
"COMPANY" entity mentioned in the sentence is the acquirer, the acquiree,
or neither. This component could be implemented with e.g. a REL
<https://github.com/explosion/projects/tree/v3/tutorials/rel_component>
approach (note: provided link is just a tutorial and not a production-ready
solution) or something like the DependencyMatcher
<https://spacy.io/usage/rule-based-matching#dependencymatcher>.
Further, I would spend some time manually evaluating the ChatGPT results
to verify whether the accuracy of those 18.000 instances are of high enough
quality. You're probably better off with 2000 manually-curated examples
rather than 18000 automatically generated ones (but maybe you did already
do such a manual curation).
—
Reply to this email directly, view it on GitHub
<#13244 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AI5PW7CGV4EF352SBJX5A7LYQKGY7AVCNFSM6AAAAABCAKKG2SVHI2DSMVQWIX3LMV43SRDJONRXK43TNFXW4Q3PNVWWK3TUHM4DENBXHA2TI>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
Beta Was this translation helpful? Give feedback.
That is indeed a relatively low accuracy for an NER challenge, and the reason might be your data model. In general, we advice against having sublabels of common entity types, as it will probably be challenging for an NER model to see the distinctions between "ACQUIREE_COMPANY" and "ACQUIROR_COMPANY". On the other hand, simply identifying entities that are a "COMPANY" would be something that an NER is quite good at.
So, my personal approach to an NLP solution would probably look a bit more like this: