Tokenization and morphology #700

linas · 2018-03-13T23:10:47Z

An ongoing discussion of tokenization and its relation to morphlogy is happening in issue #42 It really should be happening here.

ampli · 2018-03-14T00:45:15Z

[Continuing the discussion from issue #42.)

Hebrew: "LPUNC* (WORD=+ | WORD=* WORD)? RPUNC*"

First I need to make it clear of what these = marks mean there.
Originally (in the Russian dict) they marked from which side you can join the tokens to which a word (sentence string) can be broken. Especially, you cannot join them in the side in which there is no = mark, and there cannot be a white space after/before the = mark.

Now we would like to extend breaking words to more that two tokens.

One "natural" way to enable a word ABC to break into A B C is to have 'A=, =B=and=C'.

However, I found it more useful (for tokenization) to extend it in another way, for which the original one is a private case:
Let the = denote where you can cut the word. And this process is done iteratively, one token by one.
And the above whitespace rule still apply.

Due to that I marked the Hebrew "prefix tokens" as x= even though they can appear in a middle of a sentence string, because during the tokenization process they are "stripped off" one by one at the marked cut point.

(BTW, any character that cannot happen in a word may be used instead of this =, as long as the same one is used in the dict, and it may be useful to use several different such notations.)

In my posts I mentioned the idea to define in the affix file something like PRENUM_MAX (I most probably used another name for it but I would like to save time by not looking now in old posts). For Russian it is 1, for Hebrew it is 4 or 5 (now defined in the source code as 5), and for Hungarian maybe 2.
The same idea applies to =x (at end of words). Note that PRE in PRENUM is not a short for a linguistic prefix, even though for some languages it can be the same.

Especially, I mentioned that I don't like the stem mark .= because it seems to me as a kind of a linguistic mark, while we are dealing with tokenization. In my scheme, stems can be denoted by x=.POS, when POS may be a subscript that denotes a stem.

It still may be that some tokens should be marked as =x= (just show me an example in which x= and =x are not enough, and we can think of the needed extension).

We have never talked about derivational vs inflectional use of morphemes. For the purpose of the current LG, it seems to me there is no need to support breaking into derivational morphemes because the dict (and even the current unsupervised language learning idea) cannot deal with that, as their combination don't necessarily have an exact predefined usage/meaning, by definition.

ampli · 2018-03-14T01:28:35Z

But it occured to me (and I also posted on that in details) that even these terms are not needed, if we just mark the possible internal links somehow, because the tokenizer can then infer everything from the dict. And as usual, for many things that I said, including this, I also wrote (before I said...) a demo program to check them.

I actually tested that, using a "trivial" tokenizer, basically as follows:

Call anysplit(), when the number of possible splits is big (say >10).
Discard splits for which not all the tokens are as in the dict
-- Handle current = marks as expected (maybe not implemented in my demo - need to check).
-- Don't allow more than one token that doesn't include punctuation.

The problem of this tokenizer was "extra splitting". For example, 10$ got split to 10 $.
As a result, unparsable sentences got strange parses.
If $ would be marked as e.g. $=, or the link between $ and NUMBER would be marked as "internal link", then this extra split could be avoided. Note that $ and NUMBER may have a link that is marked to be both external and internal (or OR of two link types) if supporting $ 10 (with a blank) is desired. When reading the dict, the tokens can be marked according to their possible type of linkage (but I have never tried that). Using link types is more flexible, but reading the dict and marking the words for tokenization can be tricky.

Such a simple implementation has an heavy overhead. An efficient implementation is much more complex.

EDIT: Fix a markdown error that caused a problem.

linas · 2018-04-27T00:27:43Z

I just added the following text to the README, it seems relevant:

Handling zero/phantom words as re-write rules.

A more principled approach to fixing the phantom-word issue is to
borrow the diea of re-writing from the theory of
operator grammar.
That is, certain phrases and constructions can be (should be)
re-written into thier "proper form", prior to parsing. The re-writing
step would insert the missing words, then the parsing proceeds. One
appeal of such an approach is that re-writing can also handle other
"annoying" phenomena, such as typos (missing apostrophes, e.g. "lets"
vs. "let's", "its" vs. "its") as well as multi-word rewrites (e.g.
"let's" vs. "let us", or "it's" vs. "it is").

Exactly how to implement this is unclear. However, it seems to open
the door to more abstract, semantic analysis. Thus, for example, in
Meaning-Text Theory (MTT), one must move betweeen SSynt to DSynt
structures. Such changes require a graph re-write from the surface
syntax parse (e.g. provided by link-grammar) to the deep-syntactic
structure. By contrast, handling phantom words by graph re-writing
prior to parsing inverts the order of processing. This suggests that
a more holistic approach is needed to graph rewriting: it must somehow
be performed "during" parsing, so that parsing can both guide the
insertion of the phantom words, and, simultanously guide the deep
syntactic rewrites.

Another interesting possibility arises with regards to tokenization.
The current tokenizer is clever, in that it splits not only on
whitespace, but can also strip off prefixes, suffixes, and perform
certain limited kinds of morphological splitting. That is, it currently
has the ability to re-write single-words into sequences of words. It
currently does so in a conservative manner; the letters that compose
a word are preserved, with a few exceptions, such as making spelling
correction suggestions. The above considerations suggest that the
boundary between tokenization and parsing needs to become both more
fluid, and more tightly coupled.

ampli · 2018-04-27T00:37:51Z

This suggests that
a more holistic approach is needed to graph rewriting: it must somehow
be performed "during" parsing, so that parsing can both guide the
insertion of the phantom words

There is some problem with that and the classic parsing:
The classic parsing is done by attempting to parse ranges of words.
Even for perfectly correct sentences, absolutely most of these ranges are unparsable, but often they can be corrected by inserting some words...

The above considerations suggest that the
boundary between tokenization and parsing needs to become both more
fluid, and more tightly coupled.

In my to-do list is providing a feedback infrastructure.

linas · 2018-04-27T00:45:30Z

often they can be corrected by inserting some words...

This has to somehow be done probabilistically -- one has to recognize likely words to insert, and not waste time trying all possibilities, most of which would be absurdly unlikely. Hmm. Like the punch-line of a joke is unexpected....

Solving this will require some kind of deep theoretical insight or inspiration. I've got only the vaguest glimmer right now.

ampli · 2018-04-27T01:04:35Z

Say we have a perfectly correct sentence.
Suppose that the the parase explores 10M ranges, and 99% of them are unparsable.
So we have about 10M unparsable ranges, and in many of them a "likely word" will be able to make them parsable. We will get any fix suggestions, to a sentence that is not even incorrect.
The problem is that each parse of a limited word range doesn't see the whole picture.

linas referenced this issue Apr 27, 2018

Attempted spelling fix...

7e76c93

linas mentioned this issue May 5, 2018

Capitalization is a kind-of pseduo-morphology #690

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tokenization and morphology #700

Tokenization and morphology #700

linas commented Mar 13, 2018

ampli commented Mar 14, 2018 •

edited

Loading

ampli commented Mar 14, 2018 •

edited

Loading

linas commented Apr 27, 2018

ampli commented Apr 27, 2018

linas commented Apr 27, 2018

ampli commented Apr 27, 2018

Tokenization and morphology #700

Tokenization and morphology #700

Comments

linas commented Mar 13, 2018

ampli commented Mar 14, 2018 • edited Loading

ampli commented Mar 14, 2018 • edited Loading

linas commented Apr 27, 2018

Handling zero/phantom words as re-write rules.

ampli commented Apr 27, 2018

linas commented Apr 27, 2018

ampli commented Apr 27, 2018

ampli commented Mar 14, 2018 •

edited

Loading

ampli commented Mar 14, 2018 •

edited

Loading