Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

word tokenizer works only for 7 bit ascii #29

Closed
plasmagunman opened this issue Jul 28, 2014 · 8 comments
Closed

word tokenizer works only for 7 bit ascii #29

plasmagunman opened this issue Jul 28, 2014 · 8 comments

Comments

@plasmagunman
Copy link

the word tokenizer for WordDiff and WordWithSpaceDiff uses \b in its regular expression. that considers word characters as [a-zA-Z0-9_], which fails on anything beyond 7 bit.

f.e. the german phrase "wir üben" splits to:

'wir üben'.split(/\b/);
-> ["wir", " ü", "ben"]

replacing the tokenizer with value.split(/(\s+)/) is sufficient in my use-case, but i don't have newlines in my text. some further testing needed, i think.

further reading:
http://stackoverflow.com/questions/10590098/javascript-regexp-word-boundaries-unicode-characters/10590620#10590620

@matanox
Copy link

matanox commented Aug 6, 2014

Handling this may be related to handling #30.
Regular expressions don't really shine in this, but at least the ability to override them could be provided, to provide users more control over word tokenization..

A clean solution would require ditching the use of regex, as it is inflexible in the definition of what is a word and what is not and difficult to tinker for that. Will non-regex comparisons make things much slower??

@papandreou
Copy link
Contributor

@matanster Hmm, what exactly is wrong with regexps in this regard?

A couple of these could come in handy btw.: https://github.com/One-com/unicoderegexp/blob/master/lib/unicodeRegExp.js#L16-L23

@matanox
Copy link

matanox commented Sep 29, 2014

@papandreou I have moved on since, but well regex is very opinionated about what is a word character, which simply doesn't cut it in some cases. It's also quick-and-dirty, my sentiment on that is similar to that expressed at https://github.com/sirthias/parboiled/wiki/RegEx-vs.-parboiled-vs.-Parser-Generators.

@papandreou
Copy link
Contributor

Yes, \w is opinionated in JavaScript regexps, but you can also spell out exactly which characters you want (see the link I gave).

@matanox
Copy link

matanox commented Sep 29, 2014

Yes, thanks for the link! but regex is only poorly composable - you would not engineer any other part of your software in the arcane messy way that regex are crafted, unless you write assembly code for fun. Yes that link is a good example of trying to be modular about regex.... in a sense likewise to https://github.com/VerbalExpressions/JSVerbalExpressions. Since Javascript is so nice with function passing, why not have jsdiff let you provide a matching function rather than a regex? it's a standard hallmark of some other javascript libraries (e.g. d3.js to name one) where flexibility has been held in high regard.

@papandreou
Copy link
Contributor

Fair enough to dislike regular expressions, but I'm not sure that I understand the alternative you're proposing? If the job at hand is to make a word tokenizer, and you need to distinguish between all (unicode) letters and non-letters, I don't see how you could do much better than to have something like unicodeRegExp.letter (https://github.com/One-com/unicoderegexp/blob/master/lib/unicodeRegExp.js#L17) in a var, then build a tokenizer regexp using unicodeRegExp.letter.source and new RegExp. That's why I pasted the link.

If your building blocks are functions, you cannot do that trick.

@papandreou
Copy link
Contributor

An alternative is to use xregexp with the unicode addon. Then you can just use \p{Letter} and \p{^Letter} inside your regexp. Incidentally, that's where those huge character classes in unicodeRegExp are lifted from :)

Hopefully these features will make it into JavaScript proper so that we'll have real Unicode support in that area as well.

@kpdecker
Copy link
Owner

kpdecker commented Aug 7, 2015

Released in 2.0.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants