-
Notifications
You must be signed in to change notification settings - Fork 894
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Single words tend to be over-segmented in Spanish, resulting in non-word tokens #1410
Comments
Can confirm this is a problem. Thank you for reporting. The core of this problem is the expectation that a document ends in a sentence final punctuation, and it's going to make a sentence final punctuation even if that makes no sense. I tried to add a mechanism where the tokenizer would sometimes drop the final punctuation at the end of a document. It seemed to help with certain clitic pronouns, but clearly there are cases for which this does not fix the problem. One thing we can do to ameliorate this is to add the words in question as fake "sentences" specifically to the tokenizer training data. Can you confirm that the words which have endings that resemble a clitic - |
All the words (the whole words before segmentation) are valid words from a linguistic database, although some may be rare: as you have said, anseriforme is an order of birds (actually Latin), carambolo is a star fruit etc. So the important question is whether the words after segmentation would be valid words or not. I do not know for sure (I'm also not a Spanish expert and I haven't used a dictionary), but the probability of them being valid words is extremely low:
|
And the words from the first list (ending in -oso) are valid words from SPALEX as well. In their case it's clear that there is no clitic. In fact, the tokenizer in most cases splits the final "o" as a CCONJ. I don't know much about Spanish orthography, but I thought that conjunctions are always separated by spaces. |
I understand that this is a model, and it's only as good as its training data. At the same time, tweets, comments, subtitles, various user inputs etc. do not always end with punctuation, and NLP tools still should be able to process them correctly. So I appreciate that you are looking into how it could be fixed! |
Alright, sounds good. Are all the I find that building a tokenizer with only half of the words isn't sufficient - the others all wind up being incorrectly chopped up. So, I built one with all of the Let me know about the other words, or perhaps I can automate it in some way (but easier for me if you already have the answer). For the record, the pos & depparse are separate from the tokenizer, hence treating |
… other words that shouldn't be split. #1410
I went through some of the words myself, found in the process that there was no free version of Spanish WordNet that I could find, and gave up on giving them all set tags. I pushed the newest version of the tokenizer with those words added as non-tokenized segments, so hopefully that works better for you. |
Sorry for taking time to reply. According to the SPALEX paper mentioned earlier, all the words are extracted either from EsPal or BuscaPalabras. The latter seems to be defunct, but the former is still available and has a page for searching for word POS, lemma etc. I found all words from both of the lists in EsPal.
Yes, all of them are adjectives. Some can also be interpreted as nouns (see below), but considering them adjectives when there is no other context probably makes more sense. Below I list all of the words with the POS (comma-separated) found in Espal (not UD POS). I'm omitting cases, where the token can also be an inflected form of another word (i.e. in all the cases below the words are in their lemma form):
Additionally the following tokens can be interpreted as forms of different lemmas:
If you need more information, you can use the EsPal page directly. Thank you for your work! |
Thank you for doing that. It will save us the time of tracking that down ourselves. Have you had a chance to look at the new tokenizer? Hopefully it performs much better on the words you listed without sacrificing quality elsewhere. |
Describe the bug
Single words tend to be over-segmented in Spanish, e.g. "abundoso" is being split as "abundos" (noun) + "o" (conjunction), if it's the only input for the Spanish
tokenize,mwt
pipeline. What's particualrly problematic, is that the first token is typically a non-existing word.To Reproduce
Run the following code:
Expected behavior
All are single words and should therefore be segmented in one token. As a result we should get this output:
Counter({1: 394})
.Environment (please complete the following information):
OS: Linux and MacOS (reproducible on both)
Python version: python 3.11.9 (hb806964_0_cpython conda-forge)
Stanza version: current (dev)
torch: 2.3.1
Additional context
I'm doing an NLP experiment where I need to tokenize/lemmatize words without context. The data are from a psycholinguistic task, where no context was provided. I've found that adding a period to the words would work as a workaround for most of them, but I believe the tokenizer should work reasonably words for single words as well. (Exceptions from the words listed, where adding a period doesn't help, are 'estruendoso', 'fachendoso', 'hacendoso'.)
This issue affects other single words too, but with adjectives ending in "-oso", it seems very prominent and consistent. Other susceptible words often end with -lo (címbalo, crocodilo), -eo (machaqueo, maniqueo), -la (garla, hortícola), -le (diástole), -me (cuneiforme, adarme), -sa (mayonesa, galactosa). Again the first resulting token is typically a non-existing word (though there are some exceptions, e.g. "machaque" + "o"). Here is a longer list of examples, where the resulting tokens seem to contain a non-word (or at least a very rare word), sorted by the last two characters:
The text was updated successfully, but these errors were encountered: