-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Need help to add new language (Indonesian) #2588
Comments
What you did all looks good so far – I think the problem is that the Indonesian Treebank just doesn't have NER data. You might find existing datasets for this online and then train the entity recognizer separately afterwards. The dataset you use for training should be similar to the data you want your model to process later on, otherwise you won't see very good results. This is also part of what makes it difficult to train new language models, especially for named entity recognition. This section in the docs has some background on how to get problem-specific training data using different approaches: https://spacy.io/usage/training#training-data |
Thank you for the fast reply, @ines Now I understand the probs and will continue to read more document and try to set the entity and intent itself. Thank you for your help. PS:
|
The resources in
To start with the more abstract explanation, there are mainly two ways of providing entity annotations:
The full JSON format is what you use when training with the
So if your data already contains lots of entities, you could go through it and add the BILUO string for each token (e.g.
What's tracy? 😃 |
This one is tracy.
I think I kinda understand for this matter now, Thank you @ines for the great explanation about the format. I've already training the corpus data from Universal Dependencies (this one seems like a training for tagger & parser right?) , but I can't upgrade it. how to upgrade the language model and add it to my previous one? I seems kinda a bit loss and start it all over again because the files got overwrite in the process. |
hi @screamfest |
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
Hello,
I'm new to NLP and spacy. I've read the documentation but still got problems to adding new language (Indonesian) to Spacy.
I've realize that I need to train more steps, tokenizing, POS Tagging, NER, etc. Now, I have train the new models for Indonesian language for itn-30 and itn-100 using Indonesian-GSD from Universal Dependencies. The result is Token 100%, but there are no NER tagged in the result.
What steps I should proceed to actually use the final Indonesian language model for spacy?
How to actually prepare training dataset to train NER?
Note:
I have this from the Geovedi's github about Indonesian Treebank.
Can I use this?
@ines Really need help because I can't even progress more than just training from the UD..
Thank you in advance
`
PROPN
E--
E
(Proper Noun)NOUN
NPF
N
(Noun)P
(Plural)F
(Feminine)NOUN
NPM
N
(Noun)P
(Plural)M
(Masculine)NOUN
NPD
N
(Noun)P
(Plural)D
(Non-Specified)NOUN
NSF
N
(Noun)S
(Singular)F
(Feminine)NOUN
NSM
N
(Noun)S
(Singular)M
(Masculine)NOUN
NSD
N
(Noun)S
(Singular)D
(Non-Specified)PRON
PP1
P
(Personal Pronoun)P
(Plural)1
(First Person)PRON
PP2
P
(Personal Pronoun)P
(Plural)2
(Second Person)PRON
PP3
P
(Personal Pronoun)P
(Plural)3
(Third Person)PRON
PS1
P
(Personal Pronoun)S
(Singular)1
(First Person)PRON
PS2
P
(Personal Pronoun)S
(Singular)2
(Second Person)PRON
PS3
P
(Personal Pronoun)S
(Singular)3
(Third Person)VERB
VPA
V
(Verb)P
(Plural)A
(Active Voice)VERB
VPP
V
(Verb)P
(Plural)P
(Passive Voice)VERB
VSA
V
(Verb)S
(Singular)A
(Active Voice)VERB
VSP
V
(Verb)S
(Singular)P
(Passive Voice)NUM
CC-
C
(Numeral)C
(Cardinal Numeral)NUM
CO-
C
(Numeral)O
(Ordinal Numeral)NUM
CD-
C
(Numeral)D
(Collective Numeral)ADJ
APP
A
(Adjective)P
(Plural)P
(Positive)ADJ
APS
A
(Adjective)P
(Plural)S
(Superlative)ADJ
ASP
A
(Adjective)S
(Singular)P
(Positive)ADJ
ASS
A
(Adjective)S
(Singular)S
(Superlative)CONJ
H--
H
(Coordinating Conjunction)SCONJ
S--
S
(Subordinating Conjunction)X
F--
F
(Foreign Word)ADP
R--
R
(Preposition)AUX
M--
M
(Modal)DET
B--
B
(Determiner)ADV
D--
D
(Adverb)PART
T--
T
(Particle)PART
G--
G
(Negation)INTJ
I--
I
(Interjection)VERB
O--
O
(Copula)PRON
WP-
W
(Question)P
(Pronoun)ADV
WD-
W
(Question)D
(Adverb)DET
WB-
W
(Question)B
(Determiner)X
X--
X
(Unknown)PUNCT
Z--
Z
(Punctuation)*)
untukW--
sebagai interrogative pronouns (who; siapa, siapakah) menjadiPRON
,W--
sebagai interrogative adverbs (where, when, how, why; di mana, kapan, bilamana, bagaimana, mengapa) menjadiADV
,W--
sebagai interrogative determiners (which; yang mana) menjadiDET
**)
kelasPROPN
masih belum ditentukan. oleh MorphInd, kelas yang digunakan adalahX--
danF--
Lemma Tagset
n
p
v
c
q
h
s
f
r
m
b
d
t
g
i
o
w
x
z
The text was updated successfully, but these errors were encountered: