Add tantivy index #150

Dalvany · 2023-05-14T21:38:00Z

This pull request aims to address #19 issue by adding a Tantivy index.
While there is still room for improvement, it might be a first step.

Configuration

Add a section [search] with a field directory that contains where Tantivy should store its files.

[search]
directory = "/tmp/alexandrie/tantivy

Search

As it uses QueryParser you can use full Tantivy query language.

By default search use all fields except name.prefix (see below) and suggester search amongst name, name.full and name.prefix (see below).

Implementation

`fts` module

This module contains all full text search related structures.

Tantivy structure handles all boiler plate to setup an index, search and suggest. It also delegate method to index document, commit documents.

TantivyDocument is a structure that represents a crate and can be converted into a Tantivy's Document

Indices

Crate's name are index multiple times to improve both result relevance of suggester and search.

name : a simple tokenized version of crate's name :
- tokenize on non alphanumeric character using SimpleTokenizer
- apply English stop words
- apply lower-casing to make search case insensitive
name.full : not tokenized, only lower-cased. It's main purpose is to increase relevance when the searched text match exactly a crate name
name.prefix : index word prefix to handle suggester.
- tokenize on non alphanumeric characters
- lower case
- apply a custom filter, edge ngram to index word prefixes.

Other fields that are indexed :

categories are index using the same pipeline as name.full as they should be amongst a precise list
keywords are index using the same pipeline as ̀name` as they are free text
description and readme use the same pipeline as name.

Note that at search time, we should not apply apply the edge ngram filter to reduce noise.

How to index

When Alexandrie starts, it index everything.

Things that still need work

~~Actually running indexer endpoint causes 500 HTTP error when trying to access UI. It comes from a lock on the database since I browse all crates for indexing in a single transaction.~~ Use run method and index at startup instead in an endpoint.
Need to change API search endpoint as I only change frontend search
New crates aren't yet indexed
Though the field exists in Tantivy, readme aren't indexed

Instead, the RwLock is in the index writer to allow interior mutability for commit operation. Though IndexWriter should be thread safe, I don't know how to achieve mutability without the RwLock.

In Tantivy's example, it is stated that for a search server you typically want one reader for the lifetime of the application

Indexing at startup is faster than indexing in an endpoint. 114k crates from crates.io get indexed within less than 10s.

Dalvany added 10 commits May 14, 2023 22:44

Add tantivy index

22227a2

Run ft

e5bb5b0

Fix clippy warning

217fc68

Reduce logs

6706797

Remove RwLock on Tantivy

b2987f5

Instead, the RwLock is in the index writer to allow interior mutability for commit operation. Though IndexWriter should be thread safe, I don't know how to achieve mutability without the RwLock.

Remove Index and add IndexReader

6313c4e

In Tantivy's example, it is stated that for a search server you typically want one reader for the lifetime of the application

Index at startup and remove endpoint

efbcb7c

Indexing at startup is faster than indexing in an endpoint. 114k crates from crates.io get indexed within less than 10s.

Run fmt

2759c53

Improve merging crates with keywords and categories

3aad767

Fix and handle publish

3054040

Dalvany force-pushed the master branch from faa2aad to 3054040 Compare May 22, 2023 20:11

Dalvany closed this Jun 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add tantivy index #150

Add tantivy index #150

Dalvany commented May 14, 2023 •

edited

Loading

Add tantivy index #150

Add tantivy index #150

Conversation

Dalvany commented May 14, 2023 • edited Loading

Configuration

Search

Implementation

fts module

Indices

How to index

Things that still need work

Dalvany commented May 14, 2023 •

edited

Loading

`fts` module