Unicode Default Word Boundary

Implements the Unicode UAX #29 §4.1 default word boundary specification, for finding word breaks in multilingual text.

Use this to split words in text! Using UAX #29 is a lot smarter than the \b word boundary in JavaScript's regular expressions! Note that character classes like \b, \w, \d only work on ASCII characters.

Usage

Import the module and use the split() function:

const split = require('unicode-default-word-boundary').split;

console.log(split(`The quick (“brown”) fox can’t jump 32.3 feet, right?`));

Output:

[ 'The', 'quick', '(', '“', 'brown', '”', ')', 'fox', 'can’t', 'jump', '32.3', 'feet', ',', 'right', '?' ]

But that's not all! Try it with non-English text, like Russian:

split(`В чащах юга жил бы цитрус? Да, но фальшивый экземпляр!`)

[ 'В', 'чащах', 'юга', 'жил', 'бы', 'цитрус', '?', 'Да', ',', 'но', 'фальшивый', 'экземпляр', '!' ]

...Hebrew:

split(`איך בלש תפס גמד רוצח עז קטנה?`);

[ 'איך', 'בלש', 'תפס', 'גמד', 'רוצח', 'עז', 'קטנה', '?' ]

...nêhiyawêwin:

split(`ᑕᐻ ᒥᔪ ᑭᓯᑲᐤ ᐊᓄᐦᐨ᙮`);

[ 'ᑕᐻ', 'ᒥᔪ ᑭᓯᑲᐤ', 'ᐊᓄᐦᐨ', '᙮' ]

...and many more!

More advanced use cases will want to use the findSpans() function.

What doesn't work

Languages that do not have obvious word breaks, such as Chinese, Japanese, Thai, Lao, and Khmer. You'll need to use statistical or dictionary-based approaches to split words in these languages.

API Documentation

There are two exported function: split() and findSpans().

`split(text: string): string[]`

split() splits the text at word boundaries, returning an array of all "words" from the text that contain characters other than whitespace.

See above for examples.

`findSpans(text: string): Iterable<BasicSpan>`

findSpans() is a generator that yields successive basic spans from the text. A basic span is a chunk of text that is guaranteed to start at a word boundary and end at the next word boundary. In other words, basic spans are indivisible in that there are no word boundaries contained within a basic span.

A basic span has the following properties:

interface BasicSpan {
    /** Where the span starts, relative to the input text. */
    start: number;
    /** At what index does the **next** span begin. */
    end: number;
    /** How many characters are in this span. */
    length: number;
    /** The text contained within this span. */
    text: string;
}

Note that unlike, split(), findSpans() does yield spans that contain whitespace.

Example

Array.from(findSpans("Hello, world🌎!"))

Will yield spans with the following properties:

[ { start: 0, end: 5, length: 5, text: 'Hello' },
  { start: 5, end: 6, length: 1, text: ',' },
  { start: 6, end: 7, length: 1, text: ' ' },
  { start: 7, end: 12, length: 5, text: 'world' },
  { start: 12, end: 14, length: 2, text: '🌎' },
  { start: 14, end: 15, length: 1, text: '!' } ]

N.B.: findSpans() may not yield plain JavaScript objects, as shown above. The objects that findSpans() yield will adhere to the BasicSpan interface, however what findSpans() actually yields may differ from simple objects.

Contributing and Maintaining

When maintaining this package, you might notice something strange. index.ts depends on ./src/gen/WordBreakProperty.ts, but this file does not exist! It is a generated file, created by reading Unicode property data files, downloaded from Unicode's website. These data files have been compressed and committed to this repository in libexec/:

libexec/
├── WordBreakProperty-12.0.0.txt.gz
├── compile-word-break.js
└── emoji-data-12.0.0.txt.gz

Note that compile-word-break.js actually creates ./src/gen/WordBreakProperty.ts!

How to generate `./src/gen/WordBreakProperty.ts`

When you have just cloned the repository, this file will be generated when you run npm install:

npm install

If you want to regenerate it afterwards, you can run the build script:

npm run build

Measuring performance

Run npm run test-performance to measure the performance of the split function from the lib directory. You can also run npm run ava -- --config ava-performance.config.cjs to skip the tsc compilation step.

Unicode version

Unicode version is specified at the top of the compile-word-break.js file. If you want to update the Unicode version, you will need to update the version in the compile-word-break.js file and download the new Unicode data files from the Unicode website. The Unicode data files must match the version specified in the compile-word-break.js file.

License

The algorithm comes from UAX #29: Unicode Text Segmentation, an integral part of the Unicode Standard, version 12.0.

Name		Name	Last commit message	Last commit date
Latest commit History 109 Commits
libexec		libexec
src		src
test		test
.gitignore		.gitignore
.travis.yml		.travis.yml
LICENSE		LICENSE
README.md		README.md
ava-performance.config.cjs		ava-performance.config.cjs
ava.config.cjs		ava.config.cjs
package-lock.json		package-lock.json
package.json		package.json
tsconfig.json		tsconfig.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Unicode Default Word Boundary

Usage

What doesn't work

API Documentation

`split(text: string): string[]`

`findSpans(text: string): Iterable<BasicSpan>`

Example

Contributing and Maintaining

How to generate `./src/gen/WordBreakProperty.ts`

Measuring performance

Unicode version

License

About

Releases

Packages

Languages

License

centre-for-humanities-computing/unicode-default-word-boundary

Folders and files

Latest commit

History

Repository files navigation

Unicode Default Word Boundary

Usage

What doesn't work

API Documentation

split(text: string): string[]

findSpans(text: string): Iterable<BasicSpan>

Example

Contributing and Maintaining

How to generate ./src/gen/WordBreakProperty.ts

Measuring performance

Unicode version

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

`split(text: string): string[]`

`findSpans(text: string): Iterable<BasicSpan>`

How to generate `./src/gen/WordBreakProperty.ts`

Packages