Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NER fine tuning keeps failing (sometimes wired results, sometimes malloc errors) #910

Closed
tmbo opened this issue Mar 23, 2017 · 11 comments
Closed
Labels
bug Bugs and behaviour differing from documentation

Comments

@tmbo
Copy link

tmbo commented Mar 23, 2017

There is no consistent behaviour when fine tuning a spacy NER model ("en" in this case).

Sometimes the newly trained model will annotate every word in the test sentence as an entity, sometimes I am looking for a restaurant is recognized as TIME. The added minimal failing example shows the behaviour.

Info about spaCy

  • Python version: 2.7.12
  • Platform: Darwin-16.4.0-x86_64-i386-64bit
  • spaCy version: 1.7.2
  • Installed models: cache, de, de-1.0.0, en, en-1.1.0, en_glove_cc_300_1m_vectors-1.0.0

Code to reproduce:

import json
import os
import random

import pathlib
import spacy
from spacy.gold import GoldParse
from spacy.pipeline import EntityRecognizer

if __name__ == '__main__':
    nlp = spacy.load("en")
    ner = nlp.entity

    train_data = [["hey",[]],["howdy",[]],["hey there",[]],["hello",[]],["hi",[]],["i'm looking for a place to eat",[]],["i'm looking for a place in the north of town",[[31,36,"location"]]],["show me chinese restaurants",[[8,15,"cuisine"]]],["show me chines restaurants",[[8,14,"cuisine"]]],["yes",[]],["yep",[]],["yeah",[]],["show me a mexican place in the centre",[[31,37,"location"],[10,17,"cuisine"]]],["bye",[]],["goodbye",[]],["good bye",[]],["stop",[]],["end",[]],["i am looking for an indian spot",[[20,26,"cuisine"]]],["search for restaurants",[]],["anywhere in the west",[[16,20,"location"]]],["central indian restaurant",[[0,7,"location"],[8,14,"cuisine"]]],["indeed",[]],["that's right",[]],["ok",[]],["great",[]]]
    additional_entity_types = [u'cuisine', u'location']

    # Fine tune the ner model
    for entity_type in additional_entity_types:
        if entity_type not in ner.cfg['actions']['1']:
            ner.add_label(entity_type)

    for itn in range(5):
        random.shuffle(train_data)
        for raw_text, entity_offsets in train_data:
            doc = nlp.make_doc(unicode(raw_text))
            gold = GoldParse(doc, entities=entity_offsets)
            ner.update(doc, gold)

    # store the fine tuned model
    if not os.path.exists("fine_tuned_ner_model"):
        os.mkdir("fine_tuned_ner_model")
    with open("fine_tuned_ner_model/config.json", 'w') as f:
        json.dump(ner.cfg, f)
    ner.model.dump("fine_tuned_ner_model/model")

    # load the fine tuned model
    ner = None
    ner = EntityRecognizer.load(pathlib.Path("fine_tuned_ner_model"), nlp.vocab)

    # test the model
    s = u"I am looking for a restaurant in Berlin"
    print("Test sentence: '{}'".format(s))
    doc = nlp(s, entity=False)
    ner(doc)
    print("Entities on fine tuned NER:")
    for e in doc.ents:
        print("\t'{}': {}".format(e.text, e.label_))

    print("Entities on plain spacy NER:")
    spacy_doc = nlp(s)
    for e in spacy_doc.ents:
        print("\t'{}': {}".format(e.text, e.label_))

Sometimes the result is a malloc error, sometimes I get

Entities on fine tuned NER:
	'I': TIME
	'am looking': LANGUAGE
	'for': ORG
	'a restaurant': DATE
	'in Berlin': ORG
Entities on plane spacy NER:

sometimes its

Entities on fine tuned NER:
	'I am looking for a restaurant': TIME
	'in Berlin': ORG
Entities on plane spacy NER:

and when unlucky I run into this

python2.7(64385,0x7fffb1dbd3c0) malloc: *** error for object 0x7ff413a6ef68: incorrect checksum for freed object - object was probably modified after being freed.
*** set a breakpoint in malloc_error_break to debug
@honnibal
Copy link
Member

I haven't run this code yet, but I think I see the problem. I doubt that the labels are being added to the ner.cfg. This means the model's transition-system is out-of-synch with the classes in the model.

The workaround for now would be to readd the labels after loading. The better fix will be to add them to the cfg during add_label.

@tmbo
Copy link
Author

tmbo commented Mar 23, 2017

Might that also be the cause for the malloc error or is that likely to be unrelated?

@honnibal
Copy link
Member

It would make sense. The model would produce a class that's out-of-bounds for the transition system. I doubt I'm checking for that, so it would cause a memory error.

@honnibal honnibal added the bug Bugs and behaviour differing from documentation label Mar 23, 2017
@honnibal
Copy link
Member

Two further issues here, one with your code, one with the library

  1. You're missing a call to nlp.tagger(doc) in your training loop. This means you're missing the tagger features during your fine-tuning, so the features won't match.
  2. On the current code, the parser model doesn't respect a learning rate for the perceptron update. This causes the weight updates to be out-of-scale with the existing weights, so the resulting model is quite bad.

@tmbo
Copy link
Author

tmbo commented Mar 24, 2017

Thanks for letting me know. You should probably change your example code then (or remove it if it is not up to date) https://github.com/explosion/spaCy/blob/master/examples/training/train_ner.py#L26

@honnibal
Copy link
Member

Closing this because the specific bug has been fixed. Still need to fix the docs and the save/load process, but that's covered in other issues.

@fgadaleta
Copy link

I've got exactly the same, even after upgrading to 1.7.3
Model resumed is inconsistent

@tmbo
Copy link
Author

tmbo commented Mar 31, 2017

If you have added new labels during fine tuning, you need to add them again after loading the model from disk (if you are saving the model in between train and use)

@fgadaleta
Copy link

The rationale behind is to save to disk nlp.entity, updating it like the code below
(where train_data is a list of raw_text, [startPOS, endPOS, ENT_TYPE])
and save it back. Then reload another time

for raw_text, entity_offsets in train_data:
            doc = nlp.make_doc(raw_text.decode())
            nlp.tagger(doc)
            gold = GoldParse(doc, entities=entity_offsets)
            ner.update(doc, gold)

@fgadaleta
Copy link

Still without adding any new entity type, it messed up

honnibal added a commit that referenced this issue Apr 23, 2017
@lock
Copy link

lock bot commented May 9, 2018

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators May 9, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Bugs and behaviour differing from documentation
Projects
None yet
Development

No branches or pull requests

3 participants