Skip to content

A high-performance wrapper around Intl.Segmenter for efficient text segmentation. This class resolves memory handling issues seen with large strings and "maximum call stack exceeded" exceptions that occur when strings exceed 40-50k characters. Enhances performance by 50-500x. Only ~70 loc (with comments) and no dependencies.

Notifications You must be signed in to change notification settings

jonschlinkert/intl-segmenter

Repository files navigation

intl-segmenter NPM version NPM monthly downloads NPM total downloads

A high-performance wrapper around Intl.Segmenter for efficient text segmentation. This class resolves memory handling issues seen with large strings and can enhance performance by 50-500x. Only ~60 loc and no dependencies.

Please consider following this project's author, Jon Schlinkert, and consider starring the project to show your ❀️ and support.

Install

Install with npm:

$ npm install --save intl-segmenter

Install with pnpm:

$ pnpm install intl-segmenter

Overview

If you do any text processing, parsing, or formatting, especially for the terminal, you know the challenges of handling special characters, emojis, and extended Unicode characters.

The Intl.Segmenter object was introduced to simplify text segmentation and correctly handle these special characters. However, it has notable limitations and potential risks:

  • Predictable "Maximum call stack exceeded" exceptions occur when strings exceed 40-50k characters.
  • Performance degrades geometrically as the number of non-ASCII/extended Unicode characters increases.

For context:

  • Blog posts average 2k-10k chars
  • Novels 450k-500k chars
  • This README ~7k chars

This library wraps Intl.Segmenter to address these issues:

  • Handles strings up to millions of characters in length (tested to ~24m chars on M2 Macbook Pro).
  • Improves performance by 50-500x compared to direct Intl.Segmenter usage.
  • Prevents "Maximum call stack exceeded" exceptions that predictably occur with long strings over a certain length.

Use this as a drop-in replacement for Intl.Segmenter when accurate text segmentation is needed, particularly for strings with non-ASCII/extended Unicode characters.

Usage

// Use Segmenter instead of Intl.Segmenter
import { Segmenter } from 'intl-segmenter';

const segmenter = new Segmenter('en', { granularity: 'grapheme' });
const segments = [];

// The segmenter.segment method is a generator
for (const segment of segmenter.segment('Your input string here.')) {
  segments.push(segment);
}

// You can also use Array.from, but read the "Heads up" section first
console.log(Array.from(segmenter.segment('Your input string here.')));

Heads up!

I recommend using manual iteration (traditional loops) if there's any chance the string will exceed a few hundred characters.

Note on Array.from's Iterator Handling

When Array.from processes an iterator/generator, it retains the entire iteration state in memory. Unlike a for...of loop, it can't process and discard items one by one. Instead, it:

  • Keeps the full generator state alive
  • Maintains the entire call stack for iteration
  • Holds all intermediate values
  • Builds up the final array

This creates a deeper call stack and increased memory usage compared to manual iteration, where each step can be completed and garbage collected.

API

Segmenter

Params

Example

const segmenter = new Segmenter('en', { maxChunkLength: 100 });

.segment

Segments the provided string into an iterable sequence of Intl.Segment objects, optimized for performance and memory management.

Params

  • input {String}: The string to be segmented.

Returns

  • {Generator}: Yields Intl.Segment objects.

Example

const segmenter = new Segmenter('en', { localeMatcher: 'lookup' });

for (const segment of segmenter.segment('This is a test')) {
  console.log(segment);
}

.findSafeBreakPoint

Mostly an internal method, but documented here in case you need to use it directly, or override it in a subclass.

This method determines a safe position to break the string into chunks for efficient processing without splitting essential non-ASCII/extended Unicode character groups.

Params

  • input {String}: The string to analyze.

Returns

  • {Number}: Position index to use for safely breaking the string.

Example

const segmenter = new Segmenter();
const breakPoint = segmenter.findSafeBreakPoint('This is a test');
console.log(breakPoint); // e.g., 4

.getSegments

Returns all segments of the input string as an array, using the efficient generator from .segment().

Params

  • input {String}: The string to be segmented.

Returns

  • {Array}: An array of Intl.Segment objects.

Example

const segmenter = new Segmenter();
const segments = segmenter.getSegments('This is a test');
console.log(segments);
// Returns:
// [
//   { segment: 'T', index: 0, input: 'This is a test' },
//   { segment: 'h', index: 1, input: 'This is a test' },
//   { segment: 'i', index: 2, input: 'This is a test' },
//   { segment: 's', index: 3, input: 'This is a test' },
//   { segment: ' ', index: 4, input: 'This is a test' },
//   { segment: 'i', index: 5, input: 'This is a test' },
//   { segment: 's', index: 6, input: 'This is a test' },
//   { segment: ' ', index: 7, input: 'This is a test' },
//   { segment: 'a', index: 8, input: 'This is a test' },
//   { segment: ' ', index: 9, input: 'This is a test' },
//   { segment: 't', index: 10, input: 'This is a test' },
//   { segment: 'e', index: 11, input: 'This is a test' },
//   { segment: 's', index: 12, input: 'This is a test' },
//   { segment: 't', index: 13, input: 'This is a test' }
// ]

Segmenter.getSegments

Static method for segmenting a string. Creates a Segmenter instance and returns the segments as an array.

Params

  • input {String}: The string to be segmented.
  • language {String}: A BCP 47 language tag, or an Intl.Locale instance.
  • options {Object}: (optional) Intl.Segmenter options.

Returns

  • {Array}: An array of Intl.Segment objects.

Example

const segments = Segmenter.getSegments('This is a test', 'en');
console.log(segments);

FAQ

In a nutshell, this library prevents maximum call stack exceeded exceptions caused by memory management issues in Intl.Segmenter, and improves performance by 50-500x over using Intl.Segmenter directly.

What does this do?

This library wraps Intl.Segmenter and serves as a drop-in replacement that not only improves performance by 50-500x over Intl.Segmenter directly, but prevents maximum call stack exceeded exceptions that predictably occur with long strings.

Without this library, exceptions reliably occur with strings exceeding 20-50k in length, depending on the number of non-ASCII/extended Unicode characters. These characters significantly affect performance and trigger exceptions sooner.

Simply import the library and use Segmenter instead of Intl.Segmenter.

Why use this?

If you use Intl.Segmenter, your application is at risk of being terminated due to maximum call stack exceed exceptions. To prevent the exception from happening, you need to either prevent input strings from exceeding a certain length, say 10k characters, or wrap segment method to iterate over longer strings.

However, this is not as trivial as it sounds. If you limit the length of the input string, in theory this would still allow users to break their input into chunks, then programmatically loop over those chunks. But now you've created the potential to split on a non-ASCII/extended unicode character, completely negating the entire point of using Intl.Segmenter in the first place.

Alternatively, you can use this library, since it solves those problems for you and ensures that Intl.Segmenter handles all characters correctly. This library not only improves performance by 50-500x over Intl.Segmenter directly, but it prevents maximum call stack exceeded exceptions that consistently occur when long strings are passed.

What is Intl.Segmenter?

The newly introduced (2024) built-in Intl.Segmenter object enables locale-sensitive text segmentation, enabling you to get meaningful items (graphemes, words or sentences) from a string.

What causes the exception?

As of Nov. 17, 2024, a maximum call stack exceeded exception occurs when using Intl.Segmenter on strings that exceed 40-60k characters. The avg. blog post is around 2,500-10,000 characters, so you'll only encounter the call stack error when working with longer strings. However, even on shorter strings you might notice performance issues.

Notably, performance in Intl.Segmenter degrades geometrically as the number of non-ASCII/extended unicode characters present in string increases, same goes for when when the exception occurs.

(On a related note, stack traces from exceptions indicate that the issue is related to the way Node.js interacts with V8 and how memory management is occurring at the application level via Node.js. We're still looking into this.)

Will the exception be fixed?

I'm not sure yet if this is a bug, or a limitation in Intl.Segmenter. But there has been an open issue about this for almost a year, and it doesn't seem to be a priority.

Please create an issue on this library if you have information or updates related to this issue.

Comparison to Intl.Segmenter

In this example, we compare the performance of Segmenter to Intl.Segmenter when processing a string with a length of 1200 characters.

const text = ' πŸ‘¨β€πŸ‘©β€πŸ‘§β€πŸ‘¦ 🌍✨HeπŸ‘¨β€πŸ‘©β€πŸ‘§β€πŸ‘¦llo πŸ‘¨β€πŸ‘©β€πŸ‘§β€πŸ‘¦ world! 🌍✨'.repeat(1200);

// With Intl.Segmenter
const intlSegmenter = new Intl.Segmenter('en', { granularity: 'grapheme', localeMatcher: 'best fit' });
console.time('total time');
Array.from(intlSegmenter.segment(text));
console.timeEnd('total time');
// total time: 3.040s

// With Segmenter
const segmenter = new Segmenter('en', { granularity: 'grapheme', localeMatcher: 'best fit' });
console.time('total time');
Array.from(segmenter.segment(text));
console.timeEnd('total time');
// total time: 18.102ms

Segmenter is ~167x faster than Intl.Segmenter. The performance difference would be even more pronounced with longer strings, but the call stack exceeded exception would prevent you from testing that.

About

Contributing

Pull requests and stars are always welcome. For bugs and feature requests, please create an issue.

Running Tests

Running and reviewing unit tests is a great way to get familiarized with a library and its API. You can install dependencies and run tests with the following command:

$ npm install && npm test
Building docs

(This project's readme.md is generated by verb, please don't edit the readme directly. Any changes to the readme must be made in the .verb.md readme template.)

To generate the readme, run the following command:

$ npm install -g verbose/verb#dev verb-generate-readme && verb

Author

Jon Schlinkert

License

Copyright Β© 2025, Jon Schlinkert. Released under the MIT License.


This file was generated by verb-generate-readme, v0.8.0, on January 26, 2025.

About

A high-performance wrapper around Intl.Segmenter for efficient text segmentation. This class resolves memory handling issues seen with large strings and "maximum call stack exceeded" exceptions that occur when strings exceed 40-50k characters. Enhances performance by 50-500x. Only ~70 loc (with comments) and no dependencies.

Topics

Resources

Stars

Watchers

Forks

Sponsor this project

 

Packages

No packages published