Skip to content

cbren/dovecot-fts-flatcurve

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

FTS Flatcurve plugin for Dovecot

Why This Fork?

EL9 repos provide Dovecot 2.3.16 built with stemmer and icu as Flatcurve requires, along with the added benefit of security backports and long-term support. So far the only breaking change from version 0.2 and 0.3+ is the switch Dovecot 2.3.17 to file based locking instead of dot locking. By reverting that change, ideally FTS-Flatcurve will be usable with EL9 distros without having to rebuild Dovecot until such time as they adopt Dovecot 2.4+ once it's released.

fts-flatcurve will become the default Dovecot Community Edition (CE) FTS driver in v2.4 (merged into Dovecot core in April 2022: https://github.com/dovecot/core/commit/137572e77fdf79b2e8d607021667741ed3f19da1). fts-flatcurve will continue to be maintained in this repository for backwards support with Dovecot CE v2.3.x. However, it is possible that configuration and features may differ between this v2.3 code and core v2.4 code.

What?

This is a Dovecot FTS plugin to enable message indexing using the Xapian Open Source Search Engine Library.

The plugin relies on Dovecot to do the necessary stemming. It is intended to act as a simple interface to the Xapian storage/search query functionality.

This driver supports match scoring and substring matches, which means it is RFC 3501 (IMAP4rev1) compliant (although substring searches are off by default). This driver does not support fuzzy searches, as there is no built-in support in Xapian for it.

The driver passes all of the ImapTest search tests.

Why Flatcurve?

This plugin was originally written during the initial stages of the 2020 Coronavirus pandemic.

Get it?

For details on design philosophy, see slusarz#4 (comment).

Requirements

  • Dovecot CE v2.3.17+
  • Xapian 1.2.x+ (tested on Xapian 1.2.22, 1.4.11, 1.4.18, 1.4.19)
    • 1.4+ is required for automatic optimization support
      • 1.2.x versions require manual optimization (this is a limitation of the Xapian library)

Compilation

If you downloaded this package using Git, you will first need to run autogen.sh to generate the configure script and some other files:

./autogen.sh

The following compilation software/packages must be installed:

  • autoconf
  • automake
  • libtool
  • GNU make

After this script is executed successfully, configure needs to be executed with the following parameters:

  • --with-dovecot=<path>

    Path to the dovecot-config file. This can either be a compiled dovecot source tree or point to the location where the dovecot-config file is installed on your system (typically in the $prefix/lib/dovecot directory).

When these parameters are omitted, the configure script will try to find the local Dovecot installation implicitly.

For example, when compiling against compiled Dovecot sources:

./configure --with-dovecot=../dovecot-src

Or when compiling against a Dovecot installation:

./configure --with-dovecot=/path/to/dovecot

To compile and install, execute the following:

make
sudo make install

Configuration

See https://doc.dovecot.org/configuration_manual/fts/ for configuration information regarding general FTS plugin options.

Note: flatcurve REQUIRES the core Dovecot FTS stemming feature.

FTS-Flatcurve Plugin Settings

The default parameters should be fine for most people.

fts_flatcurve_commit_limit

  • Default: 500
  • Value: integer, set to 0 to use the Xapian default

Commit database changes after this many documents are updated. Higher commit limits will result in faster indexing for large transactions (i.e. indexing a large mailbox) at the expense of high memory usage. The default value should be sufficient to allow indexing in a 256 MB maximum size process.

fts_flatcurve_max_term_size

  • Default: 30
  • Value: integer, maximum 200

The maximum number of characters in a term to index.

fts_flatcurve_min_term_size

  • Default: 2
  • Value: integer

The minimum number of characters in a term to index.

fts_flatcurve_optimize_limit

  • Default: 10
  • Value: integer, set to 0 to disable

Once the database reaches this number of shards, automatically optimize the DB at shutdown.

fts_flatcurve_rotate_size

  • Default: 5000
  • Value: integer, set to 0 to disable rotation

When the "current" fts database reaches this number of messages, it is rotated to a read-only database and replaced by a new write DB. Most people should not change this setting.

fts_flatcurve_rotate_time

  • Default: 5000
  • Value: integer, set to 0 to disable rotation

When the "current" fts database exceeds this length of time (in msecs) to commit changes, it is rotated to a read-only database and replaced by a new write DB. Most people should not change this setting.

fts_flatcurve_substring_search

  • Default: no
  • Value: boolean (yes or no)

If enabled, allows substring searches (RFC 3501 compliant). However, this requires significant additional storage space. Most users today expect "Google-like" behavior, which is prefix searching, so substring searching is arguably not the "modern, expected" behavior. Therefore, even though it is not strictly RFC compliant, prefix (non-substring) searching is enabled by default.

FTS-Flatcurve Plugin Settings Example

mail_plugins = $mail_plugins fts fts_flatcurve

plugin {
  fts = flatcurve

  # Recommended default FTS core configuration
  fts_filters = normalizer-icu snowball stopwords
  fts_filters_en = lowercase snowball english-possessive stopwords

  # All of these are optional, and indicate the default values.
  # They are listed here for documentation purposes; most people should
  # not need to define/override in their config.
  fts_flatcurve_commit_limit = 500
  fts_flatcurve_max_term_size = 30
  fts_flatcurve_min_term_size = 2
  fts_flatcurve_optimize_limit = 10
  fts_flatcurve_rotate_size = 5000
  fts_flatcurve_rotate_time = 5000
  fts_flatcurve_substring_search = no
}

Data Storage

Xapian search data is stored separately for each mailbox.

The data is stored under a 'fts-flatcurve' directory in the Dovecot index file location for the mailbox. The Xapian library is responsible for all data stored in that directory - no Dovecot code directly writes to any file.

Logging/Events

This plugin emits events with the category fts-flatcurve (a child of the category fts).

Named Events

The following named events are emitted:

fts_flatcurve_expunge

Emitted when a message is expunged from a mailbox.

Field Description
mailbox The mailbox name
uid The UID that was expunged from FTS index

fts_flatcurve_index

Emitted when a message is indexed.

Field Description
mailbox The mailbox name
uid The UID that was added to the FTS index

fts_flatcurve_last_uid

Emitted when the system queries for the last UID indexed.

Field Description
mailbox The mailbox name
uid The last UID contained in the FTS index

fts_flatcurve_optimize

Emitted when a mailbox is optimized.

Field Description
mailbox The mailbox name

fts_flatcurve_query

Emitted when a query is completed.

Field Description
count The number of messages matched
mailbox The mailbox name
maybe Are the results uncertain? [yes | no]
query The query text sent to Xapian
uids The list of UIDs returned by the query

fts_flatcurve_rescan

Emitted when a rescan is completed.

Field Description
expunged The list of UIDs that were expunged during rescan
mailbox The mailbox name
status Status of rescan [expunge_msgs | missing_msgs | ok]
uids The list of UIDs that triggered a non-ok status response

fts_flatcurve_rotate

Emitted when a mailbox has its underlying Xapian DB rotated.

Field Description
mailbox The mailbox name

Debugging

Flatcurve outputs copious debug information. To view, add this to dovecot.conf:

# This requires Dovecot v2.3.13+
log_debug = category=fts-flatcurve

doveadm Commands

This plugin implements several fts-flatcurve specific doveadm commands.

doveadm fts-flatcurve check <mailbox mask>

Run a simple check on Dovecot Xapian databases, and attempt to fix basic errors (it is the same checking done by the xapian-check command with the F option).

<mailbox mask> is the list of mailboxes to process. It is possible to use wildcards (* and ?) in this value.

For each mailbox that has FTS data, it outputs the following key/value fields:

Key Value
mailbox The human-readable mailbox name. (key is hidden)
guid The GUID of the mailbox.
errors The number of errors reported by the Xapian library.
shards The number of index shards processed.

doveadm fts-flatcurve dump [-h] <mailbox mask>

Dump the headers or terms of the Xapian databases.

If -h command line option is given, a list of headers and the number of times that header was indexed is output. Without that option, the list of search terms are output with the number of times it appears in the databse.

<mailbox mask> is the list of mailboxes to process. It is possible to use wildcards (* and ?) in this value.

All mailboxes are processed together and a single value for all headers/terms is given.

The following key/value fields are output:

Key Value
count The number of times the header/term appears in the DB
header The header (if -h is given)
term Term (if -h is NOT given)

doveadm fts-flatcurve remove <mailbox mask>

Removes all FTS data for a mailbox.

<mailbox mask> is the list of mailboxes to process. It is possible to use wildcards (* and ?) in this value.

For each mailbox removed, it outputs the following key/value fields:

Key Value
mailbox The human-readable mailbox name. (key is hidden)
guid The GUID of the mailbox.

doveadm fts-flatcurve rotate <mailbox mask>

Triggers an index rotation for a mailbox.

<mailbox mask> is the list of mailboxes to process. It is possible to use wildcards (* and ?) in this value.

For each mailbox rotated, it outputs the following key/value fields:

Key Value
mailbox The human-readable mailbox name. (key is hidden)
guid The GUID of the mailbox.

doveadm fts-flatcurve stats <mailbox mask>

Returns FTS data for a mailbox.

<mailbox mask> is the list of mailboxes to process. It is possible to use wildcards (* and ?) in this value.

For each mailbox that has FTS data, it outputs the following key/value fields:

Key Value
mailbox The human-readable mailbox name. (key is hidden)
guid The GUID of the mailbox.
last_uid The last UID indexed in the mailbox.
messages The number of messages indexed in the mailbox.
shards The number of index shards.
version The (Dovecot internal) version of the FTS data.

Acknowledgements

Thanks to:

  • Joan Moreau jom@grosjo.net; fts-xapian was the inspiration to use Xapian as the FTS library, although fts-flatcurve is not based or derived from that code
  • Aki Tuomi aki.tuomi@open-xchange.com and Jeff Sipek jeff.sipek@open-xchange.com; conversations with them directly convinced me to pursue this project
  • Marco Bettini, who did the heavy lifting necessary to merge this code into Dovecot core; most backported fixes from 2.4 is due to his work.
  • Timo Siriainen for helping Marco with code review and cleaning up rough edges in the design.

Benchmarking

Indexing benchmark with substring matching ENABLED

Linux 5.14.18-300.fc35.x86_64 (Fedora 35)
Dovecot 2.3.17; Xapian 1.4.18
Host CPU: AMD RYZEN 7 1700 8-Core 3.0 GHz (3.7 GHz Turbo)
Using fts_flatcurve as of 20 November 2021

-- Indexing Trash Mailbox w/25867 messages
-- (e.g. this is "legitimate" mail; it does not include Spam)
-- FTS index deleted before run (Dovecot caches NOT deleted)
-- Dovecot plugin configuration: "fts_flatcurve ="
-- Limit process to 256 MB
$ ulimit -v 256000 && /usr/bin/time -v doveadm index -u foo Trash
	User time (seconds): 200.83
	System time (seconds): 2.79
	Percent of CPU this job got: 99%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 3:24.66
	Maximum resident set size (kbytes): 104972
	Minor (reclaiming a frame) page faults: 26176
	Voluntary context switches: 39
	Involuntary context switches: 1569
	File system outputs: 2410928

Median throughput: ~125 msgs/second

$ doveadm fts-flatcurve stats -u foo Trash
Trash guid=72dfe40cb7f4996156000000da7fd742 last_uid=25867 messages=25867 shards=6 version=1

-- Compacting mailbox
$ du -s fts-flatcurve/
753448 fts-flatcurve/
$ /usr/bin/time -v doveadm fts optimize -u foo
	User time (seconds): 5.87
	System time (seconds): 0.48
	Percent of CPU this job got: 99%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 0:06.39
        Maximum resident set size (kbytes): 13024
        Minor (reclaiming a frame) page faults: 1202
        Voluntary context switches: 7
        Involuntary context switches: 109
        File system outputs: 1240504
$ du -s fts-flatcurve/
399476 fts-flatcurve/

-- Comparing to size of Trash mailbox
$ doveadm mailbox status -u foo vsize Trash
Trash vsize=1162552360
$ echo "scale=3; (512348 * 1024) / 1162426786" | bc
.351  [Index = ~35% the size of the total mailbox data size]

Indexing benchmark with substring matching DISABLED (DEFAULT configuration)

Linux 5.14.18-300.fc35.x86_64 (Fedora 35)
Dovecot 2.3.17; Xapian 1.4.18
Host CPU: AMD RYZEN 7 1700 8-Core 3.0 GHz (3.7 GHz Turbo)
Using fts_flatcurve as of 20 November 2021

-- Indexing Trash Mailbox w/25867 messages
-- (e.g. this is "legitimate" mail; it does not include Spam)
-- FTS index deleted before run (Dovecot caches NOT deleted)
-- Dovecot plugin configuration: "fts_flatcurve = substring_search=no"
-- Limit process to 256 MB
$ ulimit -v 256000 && /usr/bin/time -v doveadm index -u foo Trash
	User time (seconds): 93.90
	System time (seconds): 1.18
	Percent of CPU this job got: 99%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 1:35.52
        Maximum resident set size (kbytes): 46316
        Minor (reclaiming a frame) page faults: 10224
        Voluntary context switches: 40
        Involuntary context switches: 460
        File system outputs: 3479522

Median throughput: ~270 msgs/second

$ doveadm fts-flatcurve stats -u foo Trash
Trash guid=126e7a0269fc99615c0000006d6fda7a last_uid=25867 messages=25867 shards=6 version=1

-- Compacting mailbox
$ du -s fts-flatcurve/
147400 fts-flatcurve/
$ /usr/bin/time -v doveadm fts optimize -u foo
	User time (seconds): 0.82
	System time (seconds): 0.09
	Percent of CPU this job got: 98%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.93
        Maximum resident set size (kbytes): 13104
        Minor (reclaiming a frame) page faults: 1162
        Voluntary context switches: 7
        Involuntary context switches: 7
        File system outputs: 242472
$ du -s fts-flatcurve/
84812 fts-flatcurve/

-- Comparing to size of Trash mailbox
$ doveadm mailbox status -u foo vsize Trash
Trash vsize=1162552360
$ echo "scale=3; (84812 * 1024) / 1162552360" | bc
.074  [Index = ~7.4% the size of the total mailbox data size]

FTS Plugin configuration for the tests

plugin {
  fts = flatcurve
  fts_autoindex = no
  fts_enforced = yes
  fts_filters = normalizer-icu snowball stopwords
  fts_filters_en = lowercase snowball english-possessive stopwords
  fts_flatcurve_substring_search = [yes|no]
  fts_index_timeout = 60s
  fts_languages = en es de
  fts_tokenizer_generic = algorithm=simple
  fts_tokenizers = generic email-address
}

Technical Information

Database Design

See https://github.com/slusarz/dovecot-fts-flatcurve/blob/master/src/fts-backend-flatcurve-xapian.cpp#L25

Licensing

LGPL v2.1 (see COPYING)

(c) Michael Slusarz

About

Dovecot FTS Flatcurve plugin (Xapian)

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • C++ 43.0%
  • C 35.3%
  • M4 19.3%
  • Shell 1.6%
  • Makefile 0.8%