bug: khmer model custom wordbreaker issues #230

MakaraSok · 2023-05-11T08:58:44Z

Describe the bug

The crash happened after this activity was done. See the crash in action:

predictive.text.crashes.mov

Reproduce the bug

No response

Expected behavior

No response

Related issues

No response

Keyman apps

Keyman version

17.0.104-alpha

Operating system

iOS 16.4

Device

iPhone Pro Max Simulator

Target application

No response

Browser

No response

Keyboard name

sil_jarai

Keyboard version

1.0

Language name

Jarai

Additional context

https://keyman.com/keyboards/sil_jarai?bcp47=jra-khmr

jahorton · 2023-12-04T02:11:25Z

This is essentially the same issue seen at keymanapp/keyman#6900, but conflated with issues that arise when handling Khmer script.

Relevant codeblock from the corresponding lexical model:

lexical-models/release/sil/sil.jra-khmr.jarai/source/sil.jra-khmr.jarai.model.ts

Lines 12 to 21 in 0491eb0

    
           wordBreaker: function(str: string) { 
        
             return str.split(/\s/).map(function(token) { 
        
               return { 
        
                 left: str.indexOf(token), 
        
                 start: str.indexOf(token), 
        
                 right: str.indexOf(token) + token.length, 
        
                 end: str.indexOf(token) + token.length, 
        
                 text: token 
        
               } 
        
             });

  wordBreaker: function(str: string) {
    return str.split(/\s/).map(function(token) {
      return {
        left: str.indexOf(token),
        start: str.indexOf(token),
        right: str.indexOf(token) + token.length,
        end: str.indexOf(token) + token.length,
        text: token
      }
    });

For starters, note that this "wordbreaker" was always intended to be something of a stand-in until we develop a better way to handle cases with scripts that don't normally do wordbreaking. (The majority language for the script is Khmer, which doesn't... even if Jarai itself does.)

Furthermore, this wordbreaker is not aware of any implicit meaning behind any punctuation marks in the script - it only breaks on spaces and nothing else. Thus, the guillemets (the double angle-brackets acting as quotation marks) are considered the same as letters and thus part of the same word.

Refer to the video associated with keymanapp/keyman#6900:

lm.replace.quote.and.character.typed.with.the.selected.suggestion.mov

The guillemets are replaced because, as far as the system knows, they are part of the word, not separate. This, in turn, naturally has a strong knock-on effect of making predictions a lot more difficult. No Khmer word actually starts with a left-guillemet («), after all.

With my current attempts at reproducing it, the engine actually does recover on the first post-guillemet keystroke most of the time. Selecting such a suggestion also erases the guillemet due to the details noted above re: the model's wordbreaker. It also recovers instantly when starting a new word. Thus, it's not "crashing" - just "failing to find any suggestions."

Finally, note that the predictive-text engine will only allow so much corrections before it stops looking. Having to outright delete the « in order to make good suggestions for the text after it is quite costly, and that doesn't reset within the word at present. So, even when "working", corrections will seem markedly more limited in this context.

jahorton · 2023-12-04T02:25:53Z

Looking back through related issue and PR history, this thread seems particularly relevant: https://github.com/keymanapp/keyman/pull/6574/files#r861500917

If we did allow character-class overrides, that'd provide a way to avoid writing a complex custom wordbreaker. But, for now, perhaps I should just tweak this hacky would-be wordbreaker to hack off the « from the actual word.

jahorton · 2023-12-04T02:51:23Z

Here's my first-pass prototype at resolving this.

  wordBreaker: function(str: string) {
    const tokens = str.split(/\s/);
  
    for(let i=0; i < tokens.length; i++) {
      const token = tokens[i];
      if(token.length == 1) {
        continue;
      }
  
      // Opening quotes should be considered a separate token from the word they're next to.
      const punctuation = '«';
      let splitPoint = token.indexOf(punctuation);
      if(splitPoint > -1) {
        const left = token.substring(0, splitPoint);  // (0, -1) => ''
        const right = token.substring(splitPoint+1);  // Starting past the end of the string => ''
  
        if(left) {
          tokens.splice(i++, 0, left);
        }
        tokens.splice(i++, 1, punctuation);
        if(right) {
          tokens.splice(i, 0, right);
        }
        // Ensure that the next iteration puts `i` immediately after the punctuation token... even if
        // there was a `right` portion, as it may have extra marks that also need to be spun off.
        i--; 
      }
    }
  
    return tokens.map(function(token) {
      return {
        left: str.indexOf(token),
        start: str.indexOf(token),
        right: str.indexOf(token) + token.length,
        end: str.indexOf(token) + token.length,
        text: token
      }
    });

If there are other punctuation marks worth splitting off, I can extend it further, though there will be a bit of extra complexity needed: marks with earlier indices within a token should be processed before later indices for that same token. A bit of an edge case, to be sure, but it could matter at some point.

This suggestion has been tested locally with punctuation = ' and the string The quick brown 'fox' jumped over the lazy dog. 'qu'ot'at'i'o'n'. (The mangled 'qu'ot'at'i'o'n' was there to stress-test things.)

'fox' => ', fox, '
'qu'ot'at'i'o'n' => ', qu, ', ot, ', at, ', i, ', o, ', n, '

full output:

[
  'The', 'quick', 'brown',  "'",
  'fox', "'",     'jumped', 'over',
  'the', 'lazy',  'dog.',   '',
  "'",   'qu',    "'",      'ot',
  "'",   'at',    "'",      'i',
  "'",   'o',     "'",      'n',
  "'"
]

Note that dog. remains because this code isn't checking for . - just '.

jahorton · 2024-01-26T07:55:15Z

Enhancing this to allow splitting off multiple punctuation marks, rather than just one...

  wordBreaker: function(str: string) {
    const tokens = str.split(/\s/);
  
    for(let i=0; i < tokens.length; i++) {
      const token = tokens[i];
      if(token.length == 1) {
        continue;
      }
  
      // Certain punctuation marks should be considered a separate token from the word they're next to.
      const punctuationMarks = ['«', '»' /* add extras here */];
      const punctSplitIndices = [];
      // Find if and where each mark exists within the token
      for(let i = 0; i < punctuationMarks.length; i++) {
        const split = token.indexOf(punctuationMarks[i]);
        if(split >= 0) {
          punctSplitIndices.push(splilt);
        }
      }
      // Sort and pick the earliest mark's location.  If none exists, use -1.
      punctSplitIndices.sort();
      const splitPoint = punctSplitIndices[0] || -1;

      if(splitPoint > -1) {
        const left = token.substring(0, splitPoint);  // (0, -1) => ''
        const right = token.substring(splitPoint+1);  // Starting past the end of the string => ''
  
        if(left) {
          tokens.splice(i++, 0, left);
        }
        tokens.splice(i++, 1, punctuation);
        if(right) {
          tokens.splice(i, 0, right);
        }
        // Ensure that the next iteration puts `i` immediately after the punctuation token... even if
        // there was a `right` portion, as it may have extra marks that also need to be spun off.
        i--; 
      }
    }
  
    return tokens.map(function(token) {
      return {
        left: str.indexOf(token),
        start: str.indexOf(token),
        right: str.indexOf(token) + token.length,
        end: str.indexOf(token) + token.length,
        text: token
      }
    });

As a reminder, this is a custom wordbreaker used within lexical-model projects. Anywhere you've used this one:

lexical-models/release/sil/sil.jra-khmr.jarai/source/sil.jra-khmr.jarai.model.ts

Lines 12 to 21 in 0491eb0

    
           wordBreaker: function(str: string) { 
        
             return str.split(/\s/).map(function(token) { 
        
               return { 
        
                 left: str.indexOf(token), 
        
                 start: str.indexOf(token), 
        
                 right: str.indexOf(token) + token.length, 
        
                 end: str.indexOf(token) + token.length, 
        
                 text: token 
        
               } 
        
             });

This new one is an enhancement of that, allowing you to also split off whatever specific punctuation marks you define within the array saying to /* add extras here */.

MakaraSok added the bug Something isn't working label May 11, 2023

trosel mentioned this issue May 17, 2023

bug (iOS): Shavian keyboard freezes when typing third character keymanapp/keyman#8786

Open

8 tasks

mcdurdin added this to the A17S21 milestone Jul 7, 2023

mcdurdin removed this from the A17S21 milestone Jul 31, 2023

sgschantz added this to the A17S24 milestone Aug 3, 2023

jahorton modified the milestones: A17S24, A17S25 Oct 16, 2023

sgschantz modified the milestones: A17S25, A17S26 Oct 30, 2023

mcdurdin assigned jahorton Nov 8, 2023

jahorton assigned jahorton and unassigned jahorton Nov 23, 2023

darcywong00 modified the milestones: A17S26, A17S27 Nov 27, 2023

jahorton mentioned this issue Dec 4, 2023

bug(common/models): LM replaces left quote and character(s) with the selected option keymanapp/keyman#6900

Closed

jahorton transferred this issue from keymanapp/keyman Dec 8, 2023

jahorton changed the title ~~bug(ios): predictive text crashes~~ bug: khmer model custom wordbreaker issues Dec 8, 2023

mcdurdin modified the milestones: A17S27, A17S28 Dec 8, 2023

jahorton assigned MakaraSok Dec 12, 2023

mcdurdin modified the milestones: A17S28, A17S29 Dec 30, 2023

mcdurdin modified the milestones: A17S29, A17S30 Jan 6, 2024

mcdurdin modified the milestones: A17S30, A17S31 Jan 20, 2024

darcywong00 modified the milestones: B17S3, B17S4 Mar 16, 2024

mcdurdin modified the milestones: B17S4, B17S5 Mar 30, 2024

darcywong00 modified the milestones: B17S5, B17S6 Apr 12, 2024

darcywong00 modified the milestones: B17S6, A18S1 Apr 28, 2024

darcywong00 modified the milestones: A18S1, A18S2 May 11, 2024

mcdurdin modified the milestones: A18S2, A18S3 May 24, 2024

mcdurdin modified the milestones: A18S3, A18S4 Jun 7, 2024

darcywong00 modified the milestones: A18S4, A18S5 Jun 21, 2024

darcywong00 modified the milestones: A18S5, A18S6 Jul 5, 2024

darcywong00 modified the milestones: A18S6, A18S7 Jul 19, 2024

darcywong00 modified the milestones: A18S7, A18S8 Aug 2, 2024

darcywong00 modified the milestones: A18S8, A18S9 Aug 17, 2024

darcywong00 modified the milestones: A18S9, A18S10 Aug 31, 2024

darcywong00 modified the milestones: A18S10, A18S11 Sep 14, 2024

darcywong00 modified the milestones: A18S11, A18S12 Sep 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug: khmer model custom wordbreaker issues #230

bug: khmer model custom wordbreaker issues #230

MakaraSok commented May 11, 2023

jahorton commented Dec 4, 2023

jahorton commented Dec 4, 2023

jahorton commented Dec 4, 2023 •

edited

Loading

jahorton commented Jan 26, 2024 •

edited

Loading

bug: khmer model custom wordbreaker issues #230

bug: khmer model custom wordbreaker issues #230

Comments

MakaraSok commented May 11, 2023

Describe the bug

Reproduce the bug

Expected behavior

Related issues

Keyman apps

Keyman version

Operating system

Device

Target application

Browser

Keyboard name

Keyboard version

Language name

Additional context

jahorton commented Dec 4, 2023

jahorton commented Dec 4, 2023

jahorton commented Dec 4, 2023 • edited Loading

jahorton commented Jan 26, 2024 • edited Loading

jahorton commented Dec 4, 2023 •

edited

Loading

jahorton commented Jan 26, 2024 •

edited

Loading