Skip to content

Optimization: Zero‐copy insert‐batch

Hieu Pham edited this page Jan 4, 2025 · 5 revisions

Background

MuopDB has an insert API that looks like this:

service IndexServer {
  rpc Insert(InsertRequest) returns (InsertResponse) {}
}

message InsertRequest {
  string collection_name = 1;

  repeated uint64 ids = 2;

  // flattened vector
  repeated float vectors = 3;
}

In the field vectors, it's basically a flattened list of floats. Basically, if your collection has a dimension of 2, and you vectors has 10 floats, that means you have 5 vectors that you want to insert. Also, the list of ids needs to have 5 ids as well.

Problem

When inserting 100,000 vectors, the RPC from the moment we send until when we receive acknowledgement from the server is 1s. That's pretty long.

Looking at the profile, we can see this:

Screenshot_20250104_062559

As you can see, protobuf tries to deserialize the f32 field with get_f32_le (read: get float32 little-endian) (in this case, the vectors field).

Optimization

It turns out, in the code, we don't even need an actual vector of f32. All we need is the contiguous buffer of f32 numbers (which Vec<f32> provides, but we don't necessary need to have a Vec<f32>). This is how the insert function in Collection looks like:

pub fn insert(&self, doc_id: u64, data: &[f32]) -> Result<()> {
    self.mutable_segment.write().unwrap().insert(doc_id, data)
}

As you can see, you just need a contiguous buffer of f32. That means, we can just skip the protobuf decoding if we just pack the f32's in the client side into a string, and pass that string along. So I introduce the new API:

message InsertBinaryRequest {
  string collection_name = 1;
  bytes ids = 2;
  bytes vectors = 3;  // <-- this is the manually packed f32 buffer
}

The difference between the 2 code path will be:

Old:

* Client
  * Protobuf pack the vec<f32> into a byte buffer
  * Send the RPC
* Server
  * Protobuf unpacks the byte buffer into vec<f32> in a loop, float by float <--- This is VERY expensive!
  * invokes collection.insert

New:

* Client
  * We pack the vec<f32> into a byte buffer
  * Send the RPC
* Server
  * Probuf does NOT unpack the byte buffer in a loop, just copy that buffer into a string
  * invokes collection.insert

As you can see, the code just cast a series of u8 (byte) into a series of f32:

let vectors_buffer: String = req.vectors;
let vectors: &[f32] = transmute_u8_to_slice(&vectors_buffer);  // <-- This is just a cast!

This avoids deserialization in a loop.

Result

Before

[2025-01-01T18:10:55.167Z INFO  insert] Inserted all documents in 16.947639482s

After

[2025-01-01T18:46:04.902Z INFO  insert] Inserted all documents in 8.948953746s

We cut the runtime in half!

Clone this wiki locally