Skip to content

Commit

Permalink
[DOCS] Reformat common grams token filter (elastic#48426)
Browse files Browse the repository at this point in the history
  • Loading branch information
jrodewig committed Oct 30, 2019
1 parent 4ce5c2d commit 569fb14
Showing 1 changed file with 157 additions and 99 deletions.
256 changes: 157 additions & 99 deletions docs/reference/analysis/tokenfilters/common-grams-tokenfilter.asciidoc
Original file line number Diff line number Diff line change
@@ -1,93 +1,54 @@
[[analysis-common-grams-tokenfilter]]
=== Common Grams Token Filter
=== Common grams token filter
++++
<titleabbrev>Common grams</titleabbrev>
++++

Token filter that generates bigrams for frequently occurring terms.
Single terms are still indexed. It can be used as an alternative to the
<<analysis-stop-tokenfilter,Stop
Token Filter>> when we don't want to completely ignore common terms.
Generates https://en.wikipedia.org/wiki/Bigram[bigrams] for a specified set of
common words.

For example, the text "the quick brown is a fox" will be tokenized as
"the", "the_quick", "quick", "brown", "brown_is", "is", "is_a", "a",
"a_fox", "fox". Assuming "the", "is" and "a" are common words.
For example, you can specify `is` and `the` as common words. This filter then
converts the tokens `[the, quick, fox, is, brown]` to `[the, the_quick, quick,
fox, fox_is, is, is_brown, brown]`.

When `query_mode` is enabled, the token filter removes common words and
single terms followed by a common word. This parameter should be enabled
in the search analyzer.
You can use the `common_grams` filter in place of the
<<analysis-stop-tokenfilter,stop token filter>> when you don't want to
completely ignore common words.

For example, the query "the quick brown is a fox" will be tokenized as
"the_quick", "quick", "brown_is", "is_a", "a_fox", "fox".
This filter uses Lucene's
https://lucene.apache.org/core/{lucene_version_path}/analyzers-common/org/apache/lucene/analysis/commongrams/CommonGramsFilter.html[CommonGramsFilter].

The following are settings that can be set:
[[analysis-common-grams-analyze-ex]]
==== Example

[cols="<,<",options="header",]
|=======================================================================
|Setting |Description
|`common_words` |A list of common words to use.
The following <<indices-analyze,analyze API>> request creates bigrams for `is`
and `the`:

|`common_words_path` |A path (either relative to `config` location, or
absolute) to a list of common words. Each word should be in its own
"line" (separated by a line break). The file must be UTF-8 encoded.

|`ignore_case` |If true, common words matching will be case insensitive
(defaults to `false`).

|`query_mode` |Generates bigrams then removes common words and single
terms followed by a common word (defaults to `false`).
|=======================================================================

Note, `common_words` or `common_words_path` field is required.

Here is an example:

[source,js]
[source,console]
--------------------------------------------------
PUT /common_grams_example
GET /_analyze
{
"settings": {
"analysis": {
"analyzer": {
"index_grams": {
"tokenizer": "whitespace",
"filter": ["common_grams"]
},
"search_grams": {
"tokenizer": "whitespace",
"filter": ["common_grams_query"]
}
},
"filter": {
"common_grams": {
"type": "common_grams",
"common_words": ["the", "is", "a"]
},
"common_grams_query": {
"type": "common_grams",
"query_mode": true,
"common_words": ["the", "is", "a"]
}
}
}
"tokenizer" : "whitespace",
"filter" : [
"common_grams", {
"type": "common_grams",
"common_words": ["is", "the"]
}
],
"text" : "the quick fox is brown"
}
--------------------------------------------------
// CONSOLE

You can see the output by using e.g. the `_analyze` endpoint:
The filter produces the following tokens:

[source,js]
[source,text]
--------------------------------------------------
POST /common_grams_example/_analyze
{
"analyzer" : "index_grams",
"text" : "the quick brown is a fox"
}
[ the, the_quick, quick, fox, fox_is, is, is_brown, brown ]
--------------------------------------------------
// CONSOLE
// TEST[continued]

And the response will be:

[source,js]
/////////////////////
[source,console-result]
--------------------------------------------------
{
"tokens" : [
Expand All @@ -114,58 +75,155 @@ And the response will be:
"position" : 1
},
{
"token" : "brown",
"token" : "fox",
"start_offset" : 10,
"end_offset" : 15,
"end_offset" : 13,
"type" : "word",
"position" : 2
},
{
"token" : "brown_is",
"token" : "fox_is",
"start_offset" : 10,
"end_offset" : 18,
"end_offset" : 16,
"type" : "gram",
"position" : 2,
"positionLength" : 2
},
{
"token" : "is",
"start_offset" : 16,
"end_offset" : 18,
"start_offset" : 14,
"end_offset" : 16,
"type" : "word",
"position" : 3
},
{
"token" : "is_a",
"start_offset" : 16,
"end_offset" : 20,
"token" : "is_brown",
"start_offset" : 14,
"end_offset" : 22,
"type" : "gram",
"position" : 3,
"positionLength" : 2
},
{
"token" : "a",
"start_offset" : 19,
"end_offset" : 20,
"token" : "brown",
"start_offset" : 17,
"end_offset" : 22,
"type" : "word",
"position" : 4
},
{
"token" : "a_fox",
"start_offset" : 19,
"end_offset" : 24,
"type" : "gram",
"position" : 4,
"positionLength" : 2
},
{
"token" : "fox",
"start_offset" : 21,
"end_offset" : 24,
"type" : "word",
"position" : 5
}
]
}
--------------------------------------------------
// TESTRESPONSE
/////////////////////

[[analysis-common-grams-tokenfilter-analyzer-ex]]
==== Add to an analyzer

The following <<indices-create-index,create index API>> request uses the
`common_grams` filter to configure a new
<<analysis-custom-analyzer,custom analyzer>>:

[source,console]
--------------------------------------------------
PUT /common_grams_example
{
"settings": {
"analysis": {
"analyzer": {
"index_grams": {
"tokenizer": "whitespace",
"filter": ["common_grams"]
}
},
"filter": {
"common_grams": {
"type": "common_grams",
"common_words": ["a", "is", "the"]
}
}
}
}
}
--------------------------------------------------

[[analysis-common-grams-tokenfilter-configure-parms]]
==== Configurable parameters

`common_words`::
+
--
(Required+++*+++, array of strings)
A list of tokens. The filter generates bigrams for these tokens.

Either this or the `common_words_path` parameter is required.
--

`common_words_path`::
+
--
(Required+++*+++, string)
Path to a file containing a list of tokens. The filter generates bigrams for
these tokens.

This path must be absolute or relative to the `config` location. The file must
be UTF-8 encoded. Each token in the file must be separated by a line break.

Either this or the `common_words` parameter is required.
--

`ignore_case`::
(Optional, boolean)
If `true`, matches for common words matching are case-insensitive.
Defaults to `false`.

`query_mode`::
+
--
(Optional, boolean)
If `true`, the filter excludes the following tokens from the output:

* Unigrams for common words
* Unigrams for terms followed by common words

Defaults to `false`. We recommend enabling this parameter for
<<search-analyzer,search analyzers>>.

For example, you can enable this parameter and specify `is` and `the` as
common words. This filter converts the tokens `[the, quick, fox, is, brown]` to
`[the_quick, quick, fox_is, is_brown,]`.
--

[[analysis-common-grams-tokenfilter-customize]]
==== Customize

To customize the `common_grams` filter, duplicate it to create the basis
for a new custom token filter. You can modify the filter using its configurable
parameters.

For example, the following request creates a custom `common_grams` filter with
`ignore_case` and `query_mode` set to `true`:

[source,console]
--------------------------------------------------
PUT /common_grams_example
{
"settings": {
"analysis": {
"analyzer": {
"index_grams": {
"tokenizer": "whitespace",
"filter": ["common_grams_query"]
}
},
"filter": {
"common_grams_query": {
"type": "common_grams",
"common_words": ["a", "is", "the"],
"ignore_case": true,
"query_mode": true
}
}
}
}
}
--------------------------------------------------

0 comments on commit 569fb14

Please sign in to comment.