Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove fieldnorm #1134

Closed
fulmicoton opened this issue Feb 11, 2022 · 5 comments · Fixed by #1215
Closed

Remove fieldnorm #1134

fulmicoton opened this issue Feb 11, 2022 · 5 comments · Fixed by #1215
Assignees
Labels
enhancement New feature or request good first issue Good for newcomers help wanted Extra attention is needed

Comments

@fulmicoton
Copy link
Contributor

fulmicoton commented Feb 11, 2022

Tantivy now makes it possible to not index fieldnorm. This is perfectly useless for logs so we might as well disable it in quickwit. This should make all index a few percent smaller.

@fulmicoton fulmicoton added the enhancement New feature or request label Feb 11, 2022
@fulmicoton fulmicoton changed the title Remove fieldnorm by default Remove fieldnorm Feb 11, 2022
@fulmicoton fulmicoton added good first issue Good for newcomers help wanted Extra attention is needed labels Mar 4, 2022
@fulmicoton
Copy link
Contributor Author

Context:
Fieldnorm is a datastructure that encodes, for a given field, an approximation of the number of tokens in th field per document.

It's purpose is to compute the BM25 similarity metric (you can consider it a refined version of TfIdf).

Until recently, producing fieldnorms was not optional in tantivy.
An option was recently added in the schema to skip the creation of this datastructure.
(It takes exactly one byte per document and per indexed field)
See quickwit-oss/tantivy#1001

In Quickwit, we do not support BM25 at the moment so this datastructure is perfectly useless.
It is however produced because the code of quickwit pre-dates the change in tantivy that made it optional.

@infiniteregrets
Copy link
Contributor

Since the options are set using https://docs.rs/tantivy/latest/tantivy/schema/struct.TextOptions.html#method.set_indexing_options which takes in the TextFieldIndexing configuration. So I believe the only change to not use fieldnorms should be like so:
indexing_options = indexing_options.set_index_option(index_option).set_fieldnorms(false); in default_doc_mapper/field_mapping_entry.rs

In case I am missing something important let me know, Ill take another look otherwise ill open a PR to fix this

@fmassot
Copy link
Contributor

fmassot commented Mar 23, 2022

In case I am missing something important let me know, Ill take another look otherwise ill open a PR to fix this

What would be nice is to be able to choose if we want a fieldnorm or not. By default, no fieldnorm. Can you add a parameter to configure this new option?

@infiniteregrets
Copy link
Contributor

@fmassot
That's a great idea! This is the diff of what I could come up with

diff --git a/quickwit-doc-mapper/src/default_doc_mapper/field_mapping_entry.rs b/quickwit-doc-mapper/src/default_doc_mapper/field_mapping_entry.rs
index 7c6897d..49942fc 100644
--- a/quickwit-doc-mapper/src/default_doc_mapper/field_mapping_entry.rs
+++ b/quickwit-doc-mapper/src/default_doc_mapper/field_mapping_entry.rs
@@ -486,6 +486,8 @@ struct FieldMappingEntryForSerialization {
     #[serde(skip_serializing_if = "Option::is_none")]
     indexed: Option<bool>,
     #[serde(skip_serializing_if = "Option::is_none")]
+    fieldnorms: Option<bool>,
+    #[serde(skip_serializing_if = "Option::is_none")]
     tokenizer: Option<String>,
     #[serde(skip_serializing_if = "Option::is_none")]
     record: Option<IndexRecordOption>,
@@ -527,6 +529,7 @@ impl From<FieldMappingEntry> for FieldMappingEntryForSerialization {
         let type_with_cardinality = value.mapping_type.type_with_cardinality();
         let mut fast = false;
         let mut indexed = None;
+        let mut fieldnorms = None;
         let mut record = None;
         let mut stored = false;
         let mut tokenizer: Option<String> = None;
@@ -538,6 +541,7 @@ impl From<FieldMappingEntry> for FieldMappingEntryForSerialization {
                     record = Some(indexing_options.index_option());
                 } else {

...skipping 1 line
+                    fieldnorms = Some(false);
                 }
             }
             FieldMappingType::I64(options, _)
@@ -561,6 +565,7 @@ impl From<FieldMappingEntry> for FieldMappingEntryForSerialization {
             type_with_cardinality,
             fast,
             indexed,
+            fieldnorms,
             record,
             stored,
             tokenizer,
@@ -602,7 +607,7 @@ impl FieldMappingEntryForSerialization {
         if self.indexed.unwrap_or(true) {
             let mut indexing_options = TextFieldIndexing::default();
             if let Some(index_option) = self.record {
-                indexing_options = indexing_options.set_index_option(index_option);
+                indexing_options = indexing_options.set_index_option(index_option).set_fieldnorms(self.fieldnorms.unwrap_or(false));
             }
             if let Some(tokenizer) = &self.tokenizer {
                 indexing_options = indexing_options.set_tokenizer(tokenizer);
@@ -796,6 +801,30 @@ mod tests {
         Ok(())
     }

...skipping 1 line
+    #[test]
+    fn test_deserialize_valid_fieldnorms() -> anyhow::Result<()> {
+        let result = serde_json::from_str::<FieldMappingEntry>(
+        r#"
+        {
+            "name": "my_field_name",
+            "type": "text",
+            "stored": true,
+            "indexed": true,
+            "fieldnorms": true,
+            "record": "basic",
+            "tokenizer": "english"
+        }"#);
+         match result.unwrap().mapping_type {
+            FieldMappingType::Text(options, _) => {
+                assert_eq!(options.is_stored(), true);
+                let index_options = options.get_indexing_options().unwrap();
+                assert_eq!(index_options.fieldnorms(), true);
+        }
+            _   => panic!("wrong property type"),
+      }
+         Ok(())
+    }
+

...skipping 1 line
     fn test_error_on_text_with_invalid_options() -> anyhow::Result<()> {
         let result = serde_json::from_str::<FieldMappingEntry>(
@@ -938,6 +967,7 @@ mod tests {
         match result.mapping_type {
             FieldMappingType::I64(options, cardinality) => {
                 assert_eq!(options.is_indexed(), true); // default
+                assert_eq!(options.fieldnorms(), false); // default
                 assert_eq!(options.is_fast(), false); // default
                 assert_eq!(options.is_stored(), true); // default
                 assert_eq!(cardinality, Cardinality::MultiValues);

It seems to be passing all the tests and the custom test I wrote, however I am not sure. If it helps to make it more readable i will open a pull request but I am not sure if I am supposed to fork first or directly push by creating a separate branch. Thank you!

@fmassot
Copy link
Contributor

fmassot commented Mar 23, 2022

I think you have the rights to create a branch, push it and open a PR.

fulmicoton pushed a commit that referenced this issue Apr 8, 2022
* Add option for fieldnorm
* Updated backward compatibility tests.

Closes #1134
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue Good for newcomers help wanted Extra attention is needed
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants