Need help to add new language (Indonesian) #2588

screamfest · 2018-07-24T10:00:15Z

Hello,
I'm new to NLP and spacy. I've read the documentation but still got problems to adding new language (Indonesian) to Spacy.

I've realize that I need to train more steps, tokenizing, POS Tagging, NER, etc. Now, I have train the new models for Indonesian language for itn-30 and itn-100 using Indonesian-GSD from Universal Dependencies. The result is Token 100%, but there are no NER tagged in the result.

What steps I should proceed to actually use the final Indonesian language model for spacy?
How to actually prepare training dataset to train NER?

Note:

I have this from the Geovedi's github about Indonesian Treebank.

Can I use this?
@ines Really need help because I can't even progress more than just training from the UD..

Thank you in advance

`

UD	ITB	First position	Second position	Third position
`PROPN`	`E--`	`E` (Proper Noun)
`NOUN`	`NPF`	`N` (Noun)	`P` (Plural)	`F` (Feminine)
`NOUN`	`NPM`	`N` (Noun)	`P` (Plural)	`M` (Masculine)
`NOUN`	`NPD`	`N` (Noun)	`P` (Plural)	`D` (Non-Specified)
`NOUN`	`NSF`	`N` (Noun)	`S` (Singular)	`F` (Feminine)
`NOUN`	`NSM`	`N` (Noun)	`S` (Singular)	`M` (Masculine)
`NOUN`	`NSD`	`N` (Noun)	`S` (Singular)	`D` (Non-Specified)
`PRON`	`PP1`	`P` (Personal Pronoun)	`P` (Plural)	`1` (First Person)
`PRON`	`PP2`	`P` (Personal Pronoun)	`P` (Plural)	`2` (Second Person)
`PRON`	`PP3`	`P` (Personal Pronoun)	`P` (Plural)	`3` (Third Person)
`PRON`	`PS1`	`P` (Personal Pronoun)	`S` (Singular)	`1` (First Person)
`PRON`	`PS2`	`P` (Personal Pronoun)	`S` (Singular)	`2` (Second Person)
`PRON`	`PS3`	`P` (Personal Pronoun)	`S` (Singular)	`3` (Third Person)
`VERB`	`VPA`	`V` (Verb)	`P` (Plural)	`A` (Active Voice)
`VERB`	`VPP`	`V` (Verb)	`P` (Plural)	`P` (Passive Voice)
`VERB`	`VSA`	`V` (Verb)	`S` (Singular)	`A` (Active Voice)
`VERB`	`VSP`	`V` (Verb)	`S` (Singular)	`P` (Passive Voice)
`NUM`	`CC-`	`C` (Numeral)	`C` (Cardinal Numeral)
`NUM`	`CO-`	`C` (Numeral)	`O` (Ordinal Numeral)
`NUM`	`CD-`	`C` (Numeral)	`D` (Collective Numeral)
`ADJ`	`APP`	`A` (Adjective)	`P` (Plural)	`P` (Positive)
`ADJ`	`APS`	`A` (Adjective)	`P` (Plural)	`S` (Superlative)
`ADJ`	`ASP`	`A` (Adjective)	`S` (Singular)	`P` (Positive)
`ADJ`	`ASS`	`A` (Adjective)	`S` (Singular)	`S` (Superlative)
`CONJ`	`H--`	`H` (Coordinating Conjunction)
`SCONJ`	`S--`	`S` (Subordinating Conjunction)
`X`	`F--`	`F` (Foreign Word)
`ADP`	`R--`	`R` (Preposition)
`AUX`	`M--`	`M` (Modal)
`DET`	`B--`	`B` (Determiner)
`ADV`	`D--`	`D` (Adverb)
`PART`	`T--`	`T` (Particle)
`PART`	`G--`	`G` (Negation)
`INTJ`	`I--`	`I` (Interjection)
`VERB`	`O--`	`O` (Copula)
`PRON`	`WP-`	`W` (Question)	`P` (Pronoun)
`ADV`	`WD-`	`W` (Question)	`D` (Adverb)
`DET`	`WB-`	`W` (Question)	`B` (Determiner)
`X`	`X--`	`X` (Unknown)
`PUNCT`	`Z--`	`Z` (Punctuation)

*) untuk W-- sebagai interrogative pronouns (who; siapa, siapakah) menjadi PRON, W-- sebagai interrogative adverbs (where, when, how, why; di mana, kapan, bilamana, bagaimana, mengapa) menjadi ADV, W-- sebagai interrogative determiners (which; yang mana) menjadi DET

**) ~~kelas PROPN masih belum ditentukan. oleh MorphInd, kelas yang digunakan adalah X-- dan F--~~

Lemma Tagset

.	Tagset
`n`	Noun
`p`	Personal Pronoun
`v`	Verb
`c`	Numeral
`q`	Adjective
`h`	Coordinating Conjunction
`s`	Subordinating Conjunction
`f`	Foreign Word
`r`	Preposition
`m`	Modal
`b`	Determiner
`d`	Adverb
`t`	Particle
`g`	Negation
`i`	Interjection
`o`	Copula
`w`	Question
`x`	Unknown
`z`	Punctuation

The text was updated successfully, but these errors were encountered:

ines · 2018-07-24T16:59:43Z

What steps I should proceed to actually use the final Indonesian language model for spacy? How to actually prepare training dataset to train NER?

What you did all looks good so far – I think the problem is that the Indonesian Treebank just doesn't have NER data. You might find existing datasets for this online and then train the entity recognizer separately afterwards. The dataset you use for training should be similar to the data you want your model to process later on, otherwise you won't see very good results. This is also part of what makes it difficult to train new language models, especially for named entity recognition.

This section in the docs has some background on how to get problem-specific training data using different approaches: https://spacy.io/usage/training#training-data

screamfest · 2018-07-24T17:13:06Z

Thank you for the fast reply, @ines

Now I understand the probs and will continue to read more document and try to set the entity and intent itself.
Okay, so I need to prepare my own datasets to train the NER.
How to update the 'id' language models in the spacy/lang/id directory? or can I use it to train the new models?
and what does the "start" and "end" means in train_data.json?
which one is the right format for NER training? (i see like tons of format from spacy and tracy?)
can I use tracy instead and convert it to spacy's format? is it possible?
(sorry for asking too much, my curiosity is killing me right now)

Thank you for your help.

PS:
I found this train_data.json format but seems a bit too much. I dont understand.

[{
    "id": int,                      # ID of the document within the corpus
    "paragraphs": [{                # list of paragraphs in the corpus
        "raw": string,              # raw text of the paragraph
        "sentences": [{             # list of sentences in the paragraph
            "tokens": [{            # list of tokens in the sentence
                "id": int,          # index of the token in the document
                "dep": string,      # dependency label
                "head": int,        # offset of token head relative to token index
                "tag": string,      # part-of-speech tag
                "orth": string,     # verbatim text of the token
                "ner": string       # BILUO label, e.g. "O" or "B-ORG"
            }],
            "brackets": [{          # phrase structure (NOT USED by current models)
                "first": int,       # index of first token
                "last": int,        # index of last token
                "label": string     # phrase label
            }]
        }]
    }]
}]

ines · 2018-07-24T17:30:43Z

How to update the 'id' language models in the spacy/lang/id directory? or can I use it to train the new models?

The resources in spacy/lang/id are just the language data – for example, tokenization rules, stop words etc. All of this is static and will be used in the model, but it won't be changed when you train a model. However, if you've found something that you want to change, you can always edit the code and re-build spaCy and/or submit a pull request (if you've found a mistake or want to improve the data).

which one is the right format for NER training?

To start with the more abstract explanation, there are mainly two ways of providing entity annotations:

character offsets relative to the text, for example "Apple is a company" and (0, 5, 'ORG') (to describe the span "Apple").
token-based BILUO tags, for example "Tim Cook is the CEO of Apple" and ['B-PERSON', 'L-PERSON', 'O', 'O', 'O', 'O', 'U-ORG']. B means beginning of an entity, I means inside an entity, L means last token of an entity, U means entity unit (single-token entity) and O means outside an entity.

The full JSON format is what you use when training with the spacy train command. It's also the format that the .conllu treebank files will be converted to before you train from them. The relevant line for NER is this one:

"ner": string       # BILUO label, e.g. "O" or "B-ORG"

So if your data already contains lots of entities, you could go through it and add the BILUO string for each token (e.g. "B-PERSON"). If you already have the POS tags, you also already know what the noun phrases – so this might help a lot.

can I use tracy instead and convert it to spacy's format? is it possible?

What's tracy? 😃

screamfest · 2018-07-27T22:05:57Z

This one is tracy.
I thought this one is good enough to actually create, divide, and adding more knowledge for Intents and Entities (but it's not.. haha)

To start with the more abstract explanation, there are mainly two ways of providing entity annotations:

character offsets relative to the text, for example "Apple is a company" and (0, 5, 'ORG') (to describe the span "Apple").

token-based BILUO tags, for example "Tim Cook is the CEO of Apple" and ['B-PERSON', 'L-PERSON', 'O', 'O', 'O', 'O', 'U-ORG']. B means beginning of an entity, I means inside an entity, L means last token of an entity, U means entity unit (single-token entity) and O means outside an entity.

I think I kinda understand for this matter now, Thank you @ines for the great explanation about the format.

I've already training the corpus data from Universal Dependencies (this one seems like a training for tagger & parser right?) , but I can't upgrade it.

how to upgrade the language model and add it to my previous one? I seems kinda a bit loss and start it all over again because the files got overwrite in the process.

oonid · 2018-11-28T15:36:04Z

hi @screamfest
have you check #2752
so you can continue from latest integration of Indonesian model, and then we can close this issue.
what do you think?

lock · 2019-01-05T15:55:32Z

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

ines added lang / id Indonesian language data and models training Training and updating models labels Jul 24, 2018

ines closed this as completed Dec 6, 2018

lock bot locked as resolved and limited conversation to collaborators Jan 5, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Need help to add new language (Indonesian) #2588

Need help to add new language (Indonesian) #2588

screamfest commented Jul 24, 2018 •

edited

Loading

ines commented Jul 24, 2018

screamfest commented Jul 24, 2018 •

edited

Loading

ines commented Jul 24, 2018

screamfest commented Jul 27, 2018 •

edited

Loading

oonid commented Nov 28, 2018

lock bot commented Jan 5, 2019

Need help to add new language (Indonesian) #2588

Need help to add new language (Indonesian) #2588

Comments

screamfest commented Jul 24, 2018 • edited Loading

Lemma Tagset

ines commented Jul 24, 2018

screamfest commented Jul 24, 2018 • edited Loading

ines commented Jul 24, 2018

screamfest commented Jul 27, 2018 • edited Loading

oonid commented Nov 28, 2018

lock bot commented Jan 5, 2019

screamfest commented Jul 24, 2018 •

edited

Loading

screamfest commented Jul 24, 2018 •

edited

Loading

screamfest commented Jul 27, 2018 •

edited

Loading