This document was written with the goal of explaining the inner workings of Sonic, as well as the whys of the design choices that were made while building Sonic.
Anyone reading this documentation should quickly get more familiar in how such a search index can be built from scratch, to the point that they should be able to start building their own Sonic from scratch.
If you feel something is missing from this document, or if it did not help you understand a concept Sonic implements, please open an issue and explain precisely which part you did not get and why you think you did not get it.
A search index is nothing more than a specialized database. It should expose primitives such as: query the index, push text in the index, pop text from the index, flush parts of the index.
The search index server is responsible for organizing the index data in a way that makes writes and reads efficient. It makes uses of specialized data structures for some very specific operations like typos corrections. The overall goal of such a search index system is: speed, lightweightness and data compactness (ie. it should minimize the resulting output database size given a text input size).
As to provide flexibility to organized indexed data, the search index is organized into collections that contain buckets. Buckets contain indexed objects. This means that you can organize your search index within a depth of 2 layers. Objects are actual search results; you could push an object result_1
to collection messages
within bucket user_1
. This would index messages
for user_1
with result result_1
. Later on, one could search for messages
matching a given query for user_1
. If the Sonic user use case does not require using buckets, the bucket value can still be set to a generic value, eg. default
.
Sonic, unlike many other search index systems, does not serve actual documents as search results. A strategic choice was made to store only identifiers referring to primary keys in an external database, which makes the data stored on-disk as compact as it can be. Users can still refer to their external database to fetch actual search result documents, using identifiers provided by Sonic.
It is worth nothing that any project initiated as of 2019 should make use of modern server hardware, which is mostly all about multi-core CPUs and SSDs. Also, Sonic should be very wary of minimizing its resource requirements — from a cold start to running under high load — as a lot of developers nowadays expect to run software on cheap VPS servers with limited CPU time, small disk space and little RAM. Those modern VPS are nonetheless powered by modern SSDs with fast random I/O. Last but not least, it would definitely be a plus if we could make software a bit greener.
In order to address the above, Sonic is capable to run queries over multiple CPUs in parallel. It leverages SSDs fast random I/O by using RocksDB as its main key-value store. It also avoids eating all available RAM by storing most data on-disk (via memory mapping), which is not an issue anymore as of 2019, as SSDs have low I/O latency and can sustain an unlimited number of reads over their lifetimes. Though, as writes are Achilles' heel of SSD disks, Sonic aims at minimizing writes and buffers a lot of those writes in RAM, which are committed to disk at periodic intervals. This should maximize the lifespan of the SSD disk under heavy index write load. Unfortunately, the side-effect of doing this is that in case of server power loss, non-committed writes will vanish.
Sonic stores result objects in a key-value database (abbreviated KV), powered by RocksDB.
When a text is pushed to Sonic, this text gets normalized, cleaned up and split in separate words. Each word is then associated to the pushed object result, and committed to the KV database as word <-> object
.
Upon cleaning the text, overhead is eluded. For instance, in the text the lazy dog
there would be no point in indexing the word the
, which is what is called a stopword. Sonic does not push stopwords to the index (read more on stopwords).
When objects are pushed to the search index for a given bucket in a given collection, for instance object session_77f2e05e-5a81-49f0-89e3-177e9e1d1f32
, Sonic converts this object to a compact 32 bits format, for instance 10292198
. We call the user-provided object identifier the OID, while the compact internal identifier is named the IID. The IID is mapped internally to indexed words, and is much more compact in terms of storage than the OID. You can think of OIDs and IIDs as basically the same thing, except that the IID is the compact version of an OID. OIDs are only used for user-facing input and output objects, while IIDs are only used for internal storage of those objects. On very long indexed texts, this helps save a lot of disk space.
The KV store has a simple schema, where we associate a binary key to binary data. The following types of keys exist:
- Meta-To-Value: state data for the bucket, eg. stores the count increment of indexed objects (data is in arbitrary format) (code: StoreKeyerIdx::MetaToValue);
- Term-To-IIDs: maps a word (ie. term) to an internal identifier (ie. IID), which is essentially a word-to-result mapping (data is an array of 32 bits numbers encoded to binary as little-endian) (code: StoreKeyerIdx::TermToIIDs);
- OID-To-IID: maps an object identifier (ie. OID) to an internal identifier (ie. IID), which converts an user-provided object to a compact internal object (data is a 32 bits number encoded to binary as little-endian) (code: StoreKeyerIdx::OIDToIID);
- IID-To-OID: this is the reverse mapping of OID-To-IID, which lets convert an IID back to an OID (data is a variable-length UTF-8 string encoded in binary) (code: StoreKeyerIdx::IIDToOID);
- IID-To-Terms: this lists all words (ie. terms) associated to an internal identifier (ie. IID) (data is an array of 32 bits numbers encoded to binary as little-endian) (code: StoreKeyerIdx::IIDToTerms);
A key is formatted as such, in binary: [idx<1B> | bucket<4B> | route<4B>]
(code: StoreKeyerBuilder::build_key), which makes it 9-bytes long. The index stands for the type of key, eg. Term-To-IIDs. The bucket and what we call the route are hashed as 32 bits numbers, and appended in little-endian binary format to the key.
Both IIDs and terms are stored as 32 bits numbers in binary format. 64 bits numbers could have been used instead, increasing the total number of objects that can be indexed per-bucket. Though, storing such 64 bits numbers instead of 32 bits numbers would double required storage space. As they make up most of stored space, it was important to keep them as small as possible. Those 32 bits numbers are generated using a fast and low-collision hash family called XxHash, from the OID in the case of the IID, and from the word in the case of the term hash (code: StoreTermHash).
When most users input text to a computer system using an actual keyboard, they make typos and mistakes. A nice property of a good search system should be that those typos can be forgiven and accurate search results still come up for the bogus user query. Sonic implements a data structure that lets it correct typos or autocomplete incomplete words.
For instance, if our index has the word english
but the user, for some reason, inputs englich
, Sonic would still return results for english
. Similarly, if the user inputs an incomplete word eg. eng
, Sonic would expand this word to english
, if there were no or not enough exact matches for eng
.
The store system responsible for such a feat is the FST (Finite-State Transducer). It can be grossly compared to a graph of characters, where nodes are characters and edges connect those characters to produce words.
Sonic stores a single FST file per bucket. This FST file is memory-mapped, and read directly from the disk when Sonic needs to read it. The fst crate is used to implement the FST data structure.
One downside of the FST implementation that Sonic uses, is that once built, an FST is immutable. It means that in order to add a new word to the search index (for a given bucket), Sonic needs to re-build the entire FST (ie. iterate word-by-word on the existing FST and stream those words plus the added word to a new on-disk FST file). In order to do that in an efficient manner, Sonic implements an FST consolidation tasker, which stores FST changes in-memory and consolidates them to disk at periodic intervals (this interval can be configured) (code: StoreFSTPool::consolidate).
Any text that gets pushed to Sonic needs to be normalized (eg. lower-cased) and cleaned up (eg. remove stopwords) before it can be added to the index. This task is handled by the lexer system, also called tokenizer.
Sonic's tokenizer is built around an iterator pattern (code: Iterator->TokenLexer), and yields lexed words one-by-one. Iteration can be stopped before the end of the text is reached, for instance if we did not get enough search results for the first words of the query. This ensures no extraneous lexing work is done.
Given that stopwords depend on the text language, Sonic first needs to detect the language of the text that is being cleaned up. This is done using an hybrid method of either counting the number of stopwords that appear in the text for long-enough texts (which is faster) (code: TokenLexerBuilder::detect_lang_fast), or performing an n-gram pass on the text for smaller texts (which is an order of magnitude slower) (code: TokenLexerBuilder::detect_lang_slow).
As the n-gram method is better at guessing the language for small texts than the stopwords method is, we prefer it, although it is crazy slow in comparison to the stopwords method. For long-enough texts, the stopwords method becomes reliable enough, so we can use it. In either cases, if the first chosen guessing method result is judged as non-reliable, Sonic fallbacks on the other method (code: !detector.is_reliable()).
By the way, Sonic builds up its own list of stopwords for all supported languages, which can be found here (languages are referred to via their ISO 639-3 codes). People are welcome to improve those lists of stopwords by submitting a Pull Request.
Looking at the source code of Sonic, you will find a module named tasker
(see here). This module performs background tasks, and is triggered periodically.
The tasker performs the following actions:
- Janitor: it closes cached collection and bucket stores that were not used recently, freeing up memory (code: Tasker::tick);
- Consolidate: it writes in-memory FST changes to the on-disk FST data structure (code: Tasker::tick);
As in all databases, a lot of locking is involved while the tasker is performing heavy-duty work on a KV or FST store. Thus, when the tasker system kicks-in, stores may experience higher than expected latency for all consumers attempting to read or write to them. The tasker system has been optimized to minimize thread contention caused by locks, so the impact of those locks on Sonic consumers should be minimum.
In order for a client to communicate with the search index system, one needs a protocol. Sonic uses the Sonic Channel protocol, which defines a way for clients to send commands (ie. requests) to a Sonic server over the network (via a raw TCP socket); and get responses from the Sonic server. For instance, a client may send a search query command such as QUERY collection bucket "search query"
and get a response with search results such as EVENT QUERY isgsHQYu result_1 result_2
.
On that Sonic Channel protocol, technical choices that may seem to go against common sense were made:
- Sonic does not expose any HTTP API interface, as it adds a network and processing overhead cost we do not want to bear;
- Sonic only exposes a raw TCP socket with which clients interact via the Sonic Channel protocol, which was designed to be simple, lightweight and extensible;
- Most Sonic Channel commands are synchronous, for simplicity's sake (Redis does the same). You can still run multiple Sonic Channel connections in parallel, and enjoy increased parallelism, but on a given Sonic Channel connection, you must wait for the previous command to return before issuing the next one;
- Some Sonic Channel commands are asynchronous, when a lot of commands may be issued in a short period of time, in a burst pattern. This is typical of read operations such as search queries, which should be submitted as jobs to a dedicated thread pool, which can be upsized and downsized at will. To handle this, a special eventing protocol format should be used;
The Sonic Channel protocol is specified in a separate document, which you can read here.
As always, examples are the way to go to explain any complex system. This section drafts the journey of a search query in Sonic, from receiving the search query command over Sonic Channel, to serving results to the Sonic Channel consumer.
Given a collection messages
and a bucket acme_corp
(ie. indexed messages for Acme Corp), John Doe wants to find messages that match the query text "The robber has stolen our corporate car"
.
First off, John Doe would connect to Sonic over a Sonic Channel client, for instance node-sonic-channel. Using this client, he would issue the following query: QUERY messages acme_corp "The robber has stolen our corporate car"
to find conversations that contain messages about a recent robbery at Acme Corp.
After receiving the raw command above, the Sonic server would, in order:
- Read the raw command from the Sonic Channel TCP stream buffer (code: Self::on_message);
- Route the unpacked command message to the proper command handler, which would be
ChannelCommandSearch::dispatch_query
(code: ChannelMessage::on); - Commit the search query for processing (code: ChannelCommandBase::commit_pending_operation);
- Dispatch the search query to its executor (code: StoreOperationDispatch::dispatch);
- Run the search executor (code: ExecutorSearch::search);
- Open both the KV and FST stores for the collection
messages
and bucketacme_corp
(code: StoreKVPool::acquire + StoreFSTPool::acquire); - Perform search query text lexing, and search word-by-word, which would yield in order:
robber
,stolen
,corporate
,car
(code: lexer.next); - If not enough search results are found, tries to suggest other words eg. typos corrections (code: fst_action.suggest_words);
- Perform paging on found OIDs from KV store to limit results (code: found_iids.iter);
- Return found OIDs from the executor (code: result_oids);
- Write back the final results to the TCP stream (code: response_args_groups);
This is it! John Doe would receive the following response from Sonic Channel: EVENT QUERY isgsHQYu conversation_3459 conversation_29398
, which indicates that there are 2 conversations that contain messages matching the search text "The robber has stolen our corporate car"
.