word tokenizer works only for 7 bit ascii #29

plasmagunman · 2014-07-28T16:11:42Z

the word tokenizer for WordDiff and WordWithSpaceDiff uses \b in its regular expression. that considers word characters as [a-zA-Z0-9_], which fails on anything beyond 7 bit.

f.e. the german phrase "wir üben" splits to:

'wir üben'.split(/\b/);
-> ["wir", " ü", "ben"]

replacing the tokenizer with value.split(/(\s+)/) is sufficient in my use-case, but i don't have newlines in my text. some further testing needed, i think.

further reading:
http://stackoverflow.com/questions/10590098/javascript-regexp-word-boundaries-unicode-characters/10590620#10590620

The text was updated successfully, but these errors were encountered:

matanox · 2014-08-06T19:06:58Z

Handling this may be related to handling #30.
Regular expressions don't really shine in this, but at least the ability to override them could be provided, to provide users more control over word tokenization..

A clean solution would require ditching the use of regex, as it is inflexible in the definition of what is a word and what is not and difficult to tinker for that. Will non-regex comparisons make things much slower??

papandreou · 2014-09-29T20:47:36Z

@matanster Hmm, what exactly is wrong with regexps in this regard?

A couple of these could come in handy btw.: https://github.com/One-com/unicoderegexp/blob/master/lib/unicodeRegExp.js#L16-L23

matanox · 2014-09-29T21:53:42Z

@papandreou I have moved on since, but well regex is very opinionated about what is a word character, which simply doesn't cut it in some cases. It's also quick-and-dirty, my sentiment on that is similar to that expressed at https://github.com/sirthias/parboiled/wiki/RegEx-vs.-parboiled-vs.-Parser-Generators.

papandreou · 2014-09-29T22:02:12Z

Yes, \w is opinionated in JavaScript regexps, but you can also spell out exactly which characters you want (see the link I gave).

matanox · 2014-09-29T22:11:09Z

Yes, thanks for the link! but regex is only poorly composable - you would not engineer any other part of your software in the arcane messy way that regex are crafted, unless you write assembly code for fun. Yes that link is a good example of trying to be modular about regex.... in a sense likewise to https://github.com/VerbalExpressions/JSVerbalExpressions. Since Javascript is so nice with function passing, why not have jsdiff let you provide a matching function rather than a regex? it's a standard hallmark of some other javascript libraries (e.g. d3.js to name one) where flexibility has been held in high regard.

papandreou · 2014-09-30T05:52:34Z

Fair enough to dislike regular expressions, but I'm not sure that I understand the alternative you're proposing? If the job at hand is to make a word tokenizer, and you need to distinguish between all (unicode) letters and non-letters, I don't see how you could do much better than to have something like unicodeRegExp.letter (https://github.com/One-com/unicoderegexp/blob/master/lib/unicodeRegExp.js#L17) in a var, then build a tokenizer regexp using unicodeRegExp.letter.source and new RegExp. That's why I pasted the link.

If your building blocks are functions, you cannot do that trick.

papandreou · 2014-09-30T05:56:11Z

An alternative is to use xregexp with the unicode addon. Then you can just use \p{Letter} and \p{^Letter} inside your regexp. Incidentally, that's where those huge character classes in unicodeRegExp are lifted from :)

Hopefully these features will make it into JavaScript proper so that we'll have real Unicode support in that area as well.

kpdecker · 2015-08-07T20:34:42Z

Released in 2.0.0

kpdecker mentioned this issue Aug 4, 2015

Special characters #58

Closed

kpdecker closed this as completed in 7c9a1fe Aug 6, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

word tokenizer works only for 7 bit ascii #29

word tokenizer works only for 7 bit ascii #29

plasmagunman commented Jul 28, 2014

matanox commented Aug 6, 2014

papandreou commented Sep 29, 2014

matanox commented Sep 29, 2014

papandreou commented Sep 29, 2014

matanox commented Sep 29, 2014

papandreou commented Sep 30, 2014

papandreou commented Sep 30, 2014

kpdecker commented Aug 7, 2015

word tokenizer works only for 7 bit ascii #29

word tokenizer works only for 7 bit ascii #29

Comments

plasmagunman commented Jul 28, 2014

matanox commented Aug 6, 2014

papandreou commented Sep 29, 2014

matanox commented Sep 29, 2014

papandreou commented Sep 29, 2014

matanox commented Sep 29, 2014

papandreou commented Sep 30, 2014

papandreou commented Sep 30, 2014

kpdecker commented Aug 7, 2015