Porting librime to WebAssembly

Dependency Libraries

The RIME engine is built upon five key libraries: boost, yaml-cpp, marisa, OpenCC, and glog. Emscripten's robust compilation environment enables us to cross-compile these libraries with tools such as emmake and emcmake. Once compiled, a simple "make install" command for each library places them into a virtual sysroot provisioned by Emscripten, paving the way for their incorporation in future compilations.

Downloading and compiling dependencies can be a laborious process, with each library presenting its own unique compilation procedures and options that need troubleshooting. It took me two days to decode all dependencies when I first ported the RIME engine to the WASM environment. To streamline the installation of these dependencies, we've encapsulated the process into a one-click script:

*wasm-builder/dep/build-all*

Executing this script will auto-compile and install the necessary libraries. The installation process also involves two minor patches for boost and OpenCC libraries, which the script manages automatically.

File System

RIME heavily depends on the file system for configuration and lexicons. The WebAssembly standard doesn't inherently include a file system, necessitating the use of Emscripten's file system support. Of the several file system options Emscripten offers, we've chosen the newer WasmFS approach. WasmFS's core comprises standardized interfaces that support diverse backend implementations, providing a unified file system API to top-tier programs while permitting backends to employ various data handling methods.

Enhancing the input method's startup speed is paramount for an optimal user experience, necessitating a highly efficient filesystem. For optimal performance, we developed our own backend, Fast IndexedDb Backend, instead of using existing ones. This backend's implementation involves two segments: a C++ segment and a TypeScript segment. The C++ code, housed in wasm_fast_indexeddb.cpp in librime repository's tools/custom-backend directory, primarily acts as a wrapper for the TypeScript segment. The TypeScript segment, found in rime-extension repository's fs.ts file, utilizes IndexedDb for specific functionality.

Our file system implementation generates two stores in IndexedDB: "files" and "blobs". The "files" store contains file metadata, including basic information like file name, modification time, and a list of blobs housing the file's content. The list comprises each blob's ID and size. The 'blobs' store follows a key-value structure, where blob IDs act as keys and blob data as values.

Operations like directory listing or file info reading require only metadata, which is relatively small. Full content reading is unnecessary unless required, in which case we read the appropriate blob based on blob size in the blob list, the requested file position, and the length.

File content writing involves pinpointing file regions needing updates due to IndexedDB's full updates support. Our code either swaps corresponding blobs in those regions or appends new blobs at the end of the file's blob list. This implementation has a drawback: multiple file writes can increase blob numbers and cause fragmentation issues. However, this issue doesn't arise with the RIME engine.

IndexedDB API usage involves a delay, approximately 3ms for a single query result return. Hence, it's recommended to limit database interactions. For frequently accessed small blobs, we cache them for direct cache retrieval on subsequent reads, eliminating IndexedDB interaction. The cache is cleared when the blob-associated file is closed. To enhance performance, the file system omits certain correctness checks that add delays. For instance, no check is conducted to verify the existence of a file's parent directory upon its creation.

Unused blobs may be discarded during file writing or deletion. The file system provides a "collectGarbage" function to traverse all file system files and delete unnecessary blobs. No garbage collection button exists in the front-end interface currently. If a custom solution is implemented in the future, this button will be necessary.

Solution Dictionary (mmap)

The RIME engine employs a binary data structure for its solution dictionary (table) and lookup table (prism). Upon utilizing an input method scheme, the word list, in either TXT or YAML format, is compiled into a trie data structure and stored as a .bin file. This method enables quick and efficient lookup access. In most operating system environments, the RIME engine leverages mechanisms such as mmap or similar techniques to map the file directly into the address space, thus enabling access to the file without loading it entirely into memory. This functionality is embedded in the "mapped_file.cc" file found in the "src/rime/dict" directory of the librime repository.

However, WebAssembly environments don't support mmap-like mechanisms due to the absence of virtual memory support. As a result, we reengineered MappedFile in our RIME implementation. Upon opening a file, its entire content is read at once, and when closing the file, if the contents were opened in read-write mode, the entire contents are rewritten to the file system. Our empirical tests indicate that reading a 60MB file takes approximately 500ms or less, which is deemed acceptable. The only disadvantage is the heightened memory consumption, which is currently an unavoidable limitation.

User Dictionary (LevelDB)

The RIME user dictionary utilizes a LevelDB database for storage. Contrary to the system dictionary, the user dictionary is relatively small in size but necessitates dynamic read and write operations at runtime. Initially, we compiled the LevelDB library for the WASM platform. Still, we encountered shortcomings in the WasmFS library, such as the lack of file locking support, rendering the LevelDB library inoperable. Consequently, we eliminated the LevelDB dependency and tried to integrate the user lexicon code directly into the browser's IndexedDB library (interestingly, Chrome employs LevelDB as the underlying implementation for IndexedDB). Although this method was functional, its performance was not optimal. For instance, when typing 'r', RIME conducts numerous queries to LevelDB, sequentially searching for all pronunciations starting with 'ri', 'rang', 'rui', etc. in the user's dictionary. Each query incurs a 3ms delay, leading to a significantly disjointed and sluggish typing experience. Eventually, we implemented a 'hybrid' solution. The data remains stored in IndexedDB, but on each startup, the data is fully loaded into a C++ std::map data structure (similar to LevelDB's structure). When reading, the data is directly accessed from the std::map, ensuring swift retrieval. When updating the dictionary, the modifications are written concurrently to both IndexedDB and std::map, ensuring data consistency.

JavaScript Interface (embind)

The entire RIME library exposes its interface to external users via Embind. Embind encapsulates C++ functions as JavaScript functions and C++ classes as JavaScript objects. It automatically performs parameter type conversions, such as converting C++ std::string to JavaScript string, and manages the lifecycle of classes. All the necessary interfaces for the project are defined in the tools/rime_emscripten.cpp. The content defined within EMSCRIPTEN_BINDINGS at the bottom of this file represents the C++ components exposed to external users. These interfaces are further encapsulated in the background/engine.ts file within the extension, transformed into TypeScript interfaces with strong typing, and given asynchronous locking protection.

Provide feedback

Saved searches