Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add tantivy index #150

Closed
wants to merge 10 commits into from
Closed

Add tantivy index #150

wants to merge 10 commits into from

Conversation

Dalvany
Copy link
Contributor

@Dalvany Dalvany commented May 14, 2023

This pull request aims to address #19 issue by adding a Tantivy index.
While there is still room for improvement, it might be a first step.

Configuration

Add a section [search] with a field directory that contains where Tantivy should store its files.

[search]
directory = "/tmp/alexandrie/tantivy

Search

As it uses QueryParser you can use full Tantivy query language.

By default search use all fields except name.prefix (see below) and suggester search amongst name, name.full and name.prefix (see below).

Implementation

fts module

This module contains all full text search related structures.

Tantivy structure handles all boiler plate to setup an index, search and suggest. It also delegate method to index document, commit documents.

TantivyDocument is a structure that represents a crate and can be converted into a Tantivy's Document

Indices

Crate's name are index multiple times to improve both result relevance of suggester and search.

  • name : a simple tokenized version of crate's name :
    • tokenize on non alphanumeric character using SimpleTokenizer
    • apply English stop words
    • apply lower-casing to make search case insensitive
  • name.full : not tokenized, only lower-cased. It's main purpose is to increase relevance when the searched text match exactly a crate name
  • name.prefix : index word prefix to handle suggester.
    • tokenize on non alphanumeric characters
    • lower case
    • apply a custom filter, edge ngram to index word prefixes.

Other fields that are indexed :

  • categories are index using the same pipeline as name.full as they should be amongst a precise list
  • keywords are index using the same pipeline as ̀name` as they are free text
  • description and readme use the same pipeline as name.

Note that at search time, we should not apply apply the edge ngram filter to reduce noise.

How to index

When Alexandrie starts, it index everything.

Things that still need work

  • Actually running indexer endpoint causes 500 HTTP error when trying to access UI. It comes from a lock on the database since I browse all crates for indexing in a single transaction. Use run method and index at startup instead in an endpoint.
  • Need to change API search endpoint as I only change frontend search
  • New crates aren't yet indexed
  • Though the field exists in Tantivy, readme aren't indexed

Dalvany added 10 commits May 14, 2023 22:44
Instead, the RwLock is in the index writer to allow interior mutability
for commit operation. Though IndexWriter should be thread safe, I don't
know how to achieve mutability without the RwLock.
In Tantivy's example, it is stated that for a search server you typically
want one reader for the lifetime of the application
Indexing at startup is faster than indexing in an endpoint. 114k crates
from crates.io get indexed within less than 10s.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant