Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tokenization and morphology #700

Open
linas opened this issue Mar 13, 2018 · 6 comments
Open

Tokenization and morphology #700

linas opened this issue Mar 13, 2018 · 6 comments

Comments

@linas
Copy link
Member

linas commented Mar 13, 2018

An ongoing discussion of tokenization and its relation to morphlogy is happening in issue #42 It really should be happening here.

@ampli
Copy link
Member

ampli commented Mar 14, 2018

[Continuing the discussion from issue #42.)

Hebrew: ​ "LPUNC* (WORD=+ | WORD=* WORD)? RPUNC*"

First I need to make it clear of what these = marks mean there.
Originally (in the Russian dict) they marked from which side you can join the tokens to which a word (sentence string) can be broken. Especially, you cannot join them in the side in which there is no = mark, and there cannot be a white space after/before the = mark.

Now we would like to extend breaking words to more that two tokens.

One "natural" way to enable a word ABC to break into A B C is to have 'A=, =B=and=C'.

However, I found it more useful (for tokenization) to extend it in another way, for which the original one is a private case:
Let the = denote where you can cut the word. And this process is done iteratively, one token by one.
And the above whitespace rule still apply.

Due to that I marked the Hebrew "prefix tokens" as x= even though they can appear in a middle of a sentence string, because during the tokenization process they are "stripped off" one by one at the marked cut point.

(BTW, any character that cannot happen in a word may be used instead of this =, as long as the same one is used in the dict, and it may be useful to use several different such notations.)

In my posts I mentioned the idea to define in the affix file something like PRENUM_MAX (I most probably used another name for it but I would like to save time by not looking now in old posts). For Russian it is 1, for Hebrew it is 4 or 5 (now defined in the source code as 5), and for Hungarian maybe 2.
The same idea applies to =x (at end of words). Note that PRE in PRENUM is not a short for a linguistic prefix, even though for some languages it can be the same.

Especially, I mentioned that I don't like the stem mark .= because it seems to me as a kind of a linguistic mark, while we are dealing with tokenization. In my scheme, stems can be denoted by x=.POS, when POS may be a subscript that denotes a stem.

It still may be that some tokens should be marked as =x= (just show me an example in which x= and =x are not enough, and we can think of the needed extension).

We have never talked about derivational vs inflectional use of morphemes. For the purpose of the current LG, it seems to me there is no need to support breaking into derivational morphemes because the dict (and even the current unsupervised language learning idea) cannot deal with that, as their combination don't necessarily have an exact predefined usage/meaning, by definition.

@ampli
Copy link
Member

ampli commented Mar 14, 2018

But it occured to me (and I also posted on that in details) that even these terms are not needed, if we just mark the possible internal links somehow, because the tokenizer can then infer everything from the dict. And as usual, for many things that I said, including this, I also wrote (before I said...) a demo program to check them.

I actually tested that, using a "trivial" tokenizer, basically as follows:

  1. Call anysplit(), when the number of possible splits is big (say >10).
  2. Discard splits for which not all the tokens are as in the dict
    -- Handle current = marks as expected (maybe not implemented in my demo - need to check).
    -- Don't allow more than one token that doesn't include punctuation.

The problem of this tokenizer was "extra splitting". For example, 10$ got split to 10 $.
As a result, unparsable sentences got strange parses.
If $ would be marked as e.g. $=, or the link between $ and NUMBER would be marked as "internal link", then this extra split could be avoided. Note that $ and NUMBER may have a link that is marked to be both external and internal (or OR of two link types) if supporting $ 10 (with a blank) is desired. When reading the dict, the tokens can be marked according to their possible type of linkage (but I have never tried that). Using link types is more flexible, but reading the dict and marking the words for tokenization can be tricky.

Such a simple implementation has an heavy overhead. An efficient implementation is much more complex.

EDIT: Fix a markdown error that caused a problem.

@linas
Copy link
Member Author

linas commented Apr 27, 2018

I just added the following text to the README, it seems relevant:

Handling zero/phantom words as re-write rules.

A more principled approach to fixing the phantom-word issue is to
borrow the diea of re-writing from the theory of
operator grammar.
That is, certain phrases and constructions can be (should be)
re-written into thier "proper form", prior to parsing. The re-writing
step would insert the missing words, then the parsing proceeds. One
appeal of such an approach is that re-writing can also handle other
"annoying" phenomena, such as typos (missing apostrophes, e.g. "lets"
vs. "let's", "its" vs. "its") as well as multi-word rewrites (e.g.
"let's" vs. "let us", or "it's" vs. "it is").

Exactly how to implement this is unclear. However, it seems to open
the door to more abstract, semantic analysis. Thus, for example, in
Meaning-Text Theory (MTT), one must move betweeen SSynt to DSynt
structures. Such changes require a graph re-write from the surface
syntax parse (e.g. provided by link-grammar) to the deep-syntactic
structure. By contrast, handling phantom words by graph re-writing
prior to parsing inverts the order of processing. This suggests that
a more holistic approach is needed to graph rewriting: it must somehow
be performed "during" parsing, so that parsing can both guide the
insertion of the phantom words, and, simultanously guide the deep
syntactic rewrites.

Another interesting possibility arises with regards to tokenization.
The current tokenizer is clever, in that it splits not only on
whitespace, but can also strip off prefixes, suffixes, and perform
certain limited kinds of morphological splitting. That is, it currently
has the ability to re-write single-words into sequences of words. It
currently does so in a conservative manner; the letters that compose
a word are preserved, with a few exceptions, such as making spelling
correction suggestions. The above considerations suggest that the
boundary between tokenization and parsing needs to become both more
fluid, and more tightly coupled.

@ampli
Copy link
Member

ampli commented Apr 27, 2018

This suggests that
a more holistic approach is needed to graph rewriting: it must somehow
be performed "during" parsing, so that parsing can both guide the
insertion of the phantom words

There is some problem with that and the classic parsing:
The classic parsing is done by attempting to parse ranges of words.
Even for perfectly correct sentences, absolutely most of these ranges are unparsable, but often they can be corrected by inserting some words...

The above considerations suggest that the
boundary between tokenization and parsing needs to become both more
fluid, and more tightly coupled.

In my to-do list is providing a feedback infrastructure.

@linas
Copy link
Member Author

linas commented Apr 27, 2018

often they can be corrected by inserting some words...

This has to somehow be done probabilistically -- one has to recognize likely words to insert, and not waste time trying all possibilities, most of which would be absurdly unlikely. Hmm. Like the punch-line of a joke is unexpected....

Solving this will require some kind of deep theoretical insight or inspiration. I've got only the vaguest glimmer right now.

@ampli
Copy link
Member

ampli commented Apr 27, 2018

Say we have a perfectly correct sentence.
Suppose that the the parase explores 10M ranges, and 99% of them are unparsable.
So we have about 10M unparsable ranges, and in many of them a "likely word" will be able to make them parsable. We will get any fix suggestions, to a sentence that is not even incorrect.
The problem is that each parse of a limited word range doesn't see the whole picture.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants