Skip to content

Optimization: Reindexing

Hieu Pham edited this page Dec 17, 2024 · 18 revisions

This doc describes an optimization MuopDB uses in various places: reindexing of ids inside an index.

Background

Currently, each MuopDB index (be it HNSW or IVF) is immutable. We plan to add mutable index types in the future. Having immutable index makes it easier to apply certain optimizations in the index building phase.

Right now, the index usually comprises of 2 files

  • vector_storage
    • This file contains the embeddings of all documents inside the index. It's just a flat file.
  • index
    • This file is the on-disk representation of the index itself, be it HNSW graph or IVF.

Also, in this document, embedding and vector will be used interchangeably. Currently we only support dense vector, so it's just a vector of float32.

Intuition

The usecase for our vector index is to find k nearest neighbors. That means, as we traverse the index, we will need to access the vector storage in order to compute the distance between the query vector and each embedding. Consequently, if we reorder the embeddings in the vector storage so that it's much more likely to request the next embedding in the same page, we will hit the page cache and avoid going to disk.

HSNW

Read more about HNSW here

During query, starting from the top level, it will traverse the graph (in an BFS fashion) to find the nearest nodes, then moving on to the next level, starting from the nodes we found in the previous level.

Naturally, to optimize for locality, for each level, we will do the BFS in that level, and if we have not seen that node, we will assign that node the current value of counter, and then increment counter. Why? Because we will likely assign all the siblings of the same parent node to have consecutive ids. Since during HNSW traversal, we will need to fetch embeddings for all the siblings to get the closest nodes to the query vector, this naturally helps with the locality when fetching embeddings from vector storage.

For concrete example, take a look at the graph below. Note that this is just a portion of this graph. In this example, we try to remap the ids so that connected nodes are closer in terms of ids (so it's likely to be in the same page in the vector_storage)

Screenshot 2024-11-30 at 3 55 28 PM

IVF

Read more about IVF here

IVF consists of multiple posting lists. Each posting list is a list of documents who are in the same clusters (aka list of "similar" documents). Note that a document can exist in multiple posting lists, and in each posting list, documents are sorted in increasing order of their ids (this is useful when we want to intersect 2 posting lists for example). Each posting list will be represented by its centroid.

During query, we will find a few closest centroids to the query vector, and scan the posting lists of those centroids, and for each document inside those posting lists, we compute the distance between that document and the query vector, and we pick the k nearest ones.

For example: we have 2 centroids: c1, c2. Each centroid will have 5 documents in its posting list, and we have document 9 being in both centroids' posting lists.

c1 -> [1, 3, 5, 7, 9]
c2 -> [2, 4, 6, 8, 9]

That means, if a page can only contain 2 vectors, then scanning c1 will require 5 pages read. However, if we re-index the document ids by mapping:

1 -> 0
3 -> 1
5 -> 2
7 -> 3
9 -> 8
2 -> 4
4 -> 5
6 -> 6
8 -> 7

then the posting lists will look like

c1 -> [0,1,2,3,8]
c2 -> [4,5,6,7,8]

Then if we scan c1, we will just need 3 pages instead of 5 pages.

Note that, we require posting lists to be sorted before or after remapping, and even after remapping, scanning the posting list and translating them to original doc_id, we still need them to be sorted.

The way we assign ids is:

counter = 0
For each posting list p_i:
  For each d_j in posting list p_i:
    If d_j not in multiple posting list and not assigned: assign new id to be counter. Increment counter.
    Else if d_j in multiple posting list and not assigned:
      Assume there are x documents before d_j, assign d_j to be x
      Go and assign ids to all unassigned document before d_j in the posting lists that d_j is in. 
      Make sure finish assigning a posting list before moving on to the next posting list 
      to optimize for locality.