-
Notifications
You must be signed in to change notification settings - Fork 119
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tokenization and morphology #700
Comments
[Continuing the discussion from issue #42.)
First I need to make it clear of what these Now we would like to extend breaking words to more that two tokens. One "natural" way to enable a word However, I found it more useful (for tokenization) to extend it in another way, for which the original one is a private case: Due to that I marked the Hebrew "prefix tokens" as (BTW, any character that cannot happen in a word may be used instead of this In my posts I mentioned the idea to define in the affix file something like Especially, I mentioned that I don't like the stem mark It still may be that some tokens should be marked as We have never talked about derivational vs inflectional use of morphemes. For the purpose of the current LG, it seems to me there is no need to support breaking into derivational morphemes because the dict (and even the current unsupervised language learning idea) cannot deal with that, as their combination don't necessarily have an exact predefined usage/meaning, by definition. |
I actually tested that, using a "trivial" tokenizer, basically as follows:
The problem of this tokenizer was "extra splitting". For example, Such a simple implementation has an heavy overhead. An efficient implementation is much more complex. EDIT: Fix a markdown error that caused a problem. |
I just added the following text to the README, it seems relevant: Handling zero/phantom words as re-write rules.A more principled approach to fixing the phantom-word issue is to Exactly how to implement this is unclear. However, it seems to open Another interesting possibility arises with regards to tokenization. |
There is some problem with that and the classic parsing:
In my to-do list is providing a feedback infrastructure. |
This has to somehow be done probabilistically -- one has to recognize likely words to insert, and not waste time trying all possibilities, most of which would be absurdly unlikely. Hmm. Like the punch-line of a joke is unexpected.... Solving this will require some kind of deep theoretical insight or inspiration. I've got only the vaguest glimmer right now. |
Say we have a perfectly correct sentence. |
An ongoing discussion of tokenization and its relation to morphlogy is happening in issue #42 It really should be happening here.
The text was updated successfully, but these errors were encountered: