Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Need help to add new language (Indonesian) #2588

Closed
screamfest opened this issue Jul 24, 2018 · 6 comments
Closed

Need help to add new language (Indonesian) #2588

screamfest opened this issue Jul 24, 2018 · 6 comments
Labels
lang / id Indonesian language data and models training Training and updating models

Comments

@screamfest
Copy link

screamfest commented Jul 24, 2018

Hello,
I'm new to NLP and spacy. I've read the documentation but still got problems to adding new language (Indonesian) to Spacy.

I've realize that I need to train more steps, tokenizing, POS Tagging, NER, etc. Now, I have train the new models for Indonesian language for itn-30 and itn-100 using Indonesian-GSD from Universal Dependencies. The result is Token 100%, but there are no NER tagged in the result.

What steps I should proceed to actually use the final Indonesian language model for spacy?
How to actually prepare training dataset to train NER?

Note:

I have this from the Geovedi's github about Indonesian Treebank.

Can I use this?
@ines Really need help because I can't even progress more than just training from the UD..

Thank you in advance

`

UD ITB First position Second position Third position
PROPN E-- E (Proper Noun)
NOUN NPF N (Noun) P (Plural) F (Feminine)
NOUN NPM N (Noun) P (Plural) M (Masculine)
NOUN NPD N (Noun) P (Plural) D (Non-Specified)
NOUN NSF N (Noun) S (Singular) F (Feminine)
NOUN NSM N (Noun) S (Singular) M (Masculine)
NOUN NSD N (Noun) S (Singular) D (Non-Specified)
PRON PP1 P (Personal Pronoun) P (Plural) 1 (First Person)
PRON PP2 P (Personal Pronoun) P (Plural) 2 (Second Person)
PRON PP3 P (Personal Pronoun) P (Plural) 3 (Third Person)
PRON PS1 P (Personal Pronoun) S (Singular) 1 (First Person)
PRON PS2 P (Personal Pronoun) S (Singular) 2 (Second Person)
PRON PS3 P (Personal Pronoun) S (Singular) 3 (Third Person)
VERB VPA V (Verb) P (Plural) A (Active Voice)
VERB VPP V (Verb) P (Plural) P (Passive Voice)
VERB VSA V (Verb) S (Singular) A (Active Voice)
VERB VSP V (Verb) S (Singular) P (Passive Voice)
NUM CC- C (Numeral) C (Cardinal Numeral)
NUM CO- C (Numeral) O (Ordinal Numeral)
NUM CD- C (Numeral) D (Collective Numeral)
ADJ APP A (Adjective) P (Plural) P (Positive)
ADJ APS A (Adjective) P (Plural) S (Superlative)
ADJ ASP A (Adjective) S (Singular) P (Positive)
ADJ ASS A (Adjective) S (Singular) S (Superlative)
CONJ H-- H (Coordinating Conjunction)
SCONJ S-- S (Subordinating Conjunction)
X F-- F (Foreign Word)
ADP R-- R (Preposition)
AUX M-- M (Modal)
DET B-- B (Determiner)
ADV D-- D (Adverb)
PART T-- T (Particle)
PART G-- G (Negation)
INTJ I-- I (Interjection)
VERB O-- O (Copula)
PRON WP- W (Question) P (Pronoun)
ADV WD- W (Question) D (Adverb)
DET WB- W (Question) B (Determiner)
X X-- X (Unknown)
PUNCT Z-- Z (Punctuation)

*) untuk W-- sebagai interrogative pronouns (who; siapa, siapakah) menjadi PRON, W-- sebagai interrogative adverbs (where, when, how, why; di mana, kapan, bilamana, bagaimana, mengapa) menjadi ADV, W-- sebagai interrogative determiners (which; yang mana) menjadi DET

**) kelas PROPN masih belum ditentukan. oleh MorphInd, kelas yang digunakan adalah X-- dan F--

Lemma Tagset

. Tagset
n Noun
p Personal Pronoun
v Verb
c Numeral
q Adjective
h Coordinating Conjunction
s Subordinating Conjunction
f Foreign Word
r Preposition
m Modal
b Determiner
d Adverb
t Particle
g Negation
i Interjection
o Copula
w Question
x Unknown
z Punctuation
@ines ines added lang / id Indonesian language data and models training Training and updating models labels Jul 24, 2018
@ines
Copy link
Member

ines commented Jul 24, 2018

What steps I should proceed to actually use the final Indonesian language model for spacy? How to actually prepare training dataset to train NER?

What you did all looks good so far – I think the problem is that the Indonesian Treebank just doesn't have NER data. You might find existing datasets for this online and then train the entity recognizer separately afterwards. The dataset you use for training should be similar to the data you want your model to process later on, otherwise you won't see very good results. This is also part of what makes it difficult to train new language models, especially for named entity recognition.

This section in the docs has some background on how to get problem-specific training data using different approaches: https://spacy.io/usage/training#training-data

@screamfest
Copy link
Author

screamfest commented Jul 24, 2018

Thank you for the fast reply, @ines

Now I understand the probs and will continue to read more document and try to set the entity and intent itself.
Okay, so I need to prepare my own datasets to train the NER.
How to update the 'id' language models in the spacy/lang/id directory? or can I use it to train the new models?
and what does the "start" and "end" means in train_data.json?
which one is the right format for NER training? (i see like tons of format from spacy and tracy?)
can I use tracy instead and convert it to spacy's format? is it possible?
(sorry for asking too much, my curiosity is killing me right now)

Thank you for your help.

PS:
I found this train_data.json format but seems a bit too much. I dont understand.

[{
    "id": int,                      # ID of the document within the corpus
    "paragraphs": [{                # list of paragraphs in the corpus
        "raw": string,              # raw text of the paragraph
        "sentences": [{             # list of sentences in the paragraph
            "tokens": [{            # list of tokens in the sentence
                "id": int,          # index of the token in the document
                "dep": string,      # dependency label
                "head": int,        # offset of token head relative to token index
                "tag": string,      # part-of-speech tag
                "orth": string,     # verbatim text of the token
                "ner": string       # BILUO label, e.g. "O" or "B-ORG"
            }],
            "brackets": [{          # phrase structure (NOT USED by current models)
                "first": int,       # index of first token
                "last": int,        # index of last token
                "label": string     # phrase label
            }]
        }]
    }]
}]

@ines
Copy link
Member

ines commented Jul 24, 2018

How to update the 'id' language models in the spacy/lang/id directory? or can I use it to train the new models?

The resources in spacy/lang/id are just the language data – for example, tokenization rules, stop words etc. All of this is static and will be used in the model, but it won't be changed when you train a model. However, if you've found something that you want to change, you can always edit the code and re-build spaCy and/or submit a pull request (if you've found a mistake or want to improve the data).

which one is the right format for NER training?

To start with the more abstract explanation, there are mainly two ways of providing entity annotations:

  1. character offsets relative to the text, for example "Apple is a company" and (0, 5, 'ORG') (to describe the span "Apple").
  2. token-based BILUO tags, for example "Tim Cook is the CEO of Apple" and ['B-PERSON', 'L-PERSON', 'O', 'O', 'O', 'O', 'U-ORG']. B means beginning of an entity, I means inside an entity, L means last token of an entity, U means entity unit (single-token entity) and O means outside an entity.

The full JSON format is what you use when training with the spacy train command. It's also the format that the .conllu treebank files will be converted to before you train from them. The relevant line for NER is this one:

"ner": string       # BILUO label, e.g. "O" or "B-ORG"

So if your data already contains lots of entities, you could go through it and add the BILUO string for each token (e.g. "B-PERSON"). If you already have the POS tags, you also already know what the noun phrases – so this might help a lot.

can I use tracy instead and convert it to spacy's format? is it possible?

What's tracy? 😃

@screamfest
Copy link
Author

screamfest commented Jul 27, 2018

This one is tracy.
I thought this one is good enough to actually create, divide, and adding more knowledge for Intents and Entities (but it's not.. haha)

To start with the more abstract explanation, there are mainly two ways of providing entity annotations:

  1. character offsets relative to the text, for example "Apple is a company" and (0, 5, 'ORG') (to describe the span "Apple").
  2. token-based BILUO tags, for example "Tim Cook is the CEO of Apple" and ['B-PERSON', 'L-PERSON', 'O', 'O', 'O', 'O', 'U-ORG']. B means beginning of an entity, I means inside an entity, L means last token of an entity, U means entity unit (single-token entity) and O means outside an entity.

I think I kinda understand for this matter now, Thank you @ines for the great explanation about the format.


I've already training the corpus data from Universal Dependencies (this one seems like a training for tagger & parser right?) , but I can't upgrade it.

how to upgrade the language model and add it to my previous one? I seems kinda a bit loss and start it all over again because the files got overwrite in the process.

@oonid
Copy link

oonid commented Nov 28, 2018

hi @screamfest
have you check #2752
so you can continue from latest integration of Indonesian model, and then we can close this issue.
what do you think?

@ines ines closed this as completed Dec 6, 2018
@lock
Copy link

lock bot commented Jan 5, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators Jan 5, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
lang / id Indonesian language data and models training Training and updating models
Projects
None yet
Development

No branches or pull requests

3 participants