Docs: Improved tokenizer docs #18356

clintongormley · 2016-05-14T18:21:36Z

Added descriptions and runnable examples to all tokenizers

nik9000 · 2016-05-15T02:45:51Z

docs/reference/analysis/tokenizers.asciidoc

-used to build <<analysis-custom-analyzer,custom analyzers>>.
+The tokenizer is also responsible for recording the order or _position_ of
+each term (used for phrase and word proximity queries) and the start and end
+_character offsets_ of the original word which the term represents (used foor


s/foor/for/

nik9000 · 2016-05-15T03:07:36Z

Left some points about wording. I'd love to be able to test those results arrays. We could probably build something to make that work, given the number of _analyze results that we have floating around the in the docs. It is nice to have such a short response form in the docs.

nik9000 · 2016-05-15T03:08:24Z

I think the "it'd be nice to test" part can come later. This already adds more tests by adding // CONSOLE snippets. I'd be happy to make a quick change to get the responses tested after this is merged.

clintongormley · 2016-05-15T12:04:39Z

Thanks for the review @nik9000. I addressed most of your comments in a9b3db65625b52664825ba2db18057f45c12a9c6

In e79d5a85afc6fde119c7ab58f04a1076a1456391 I added TESTRESPONSEs for all of the tokenizer examples, wrapping them in comment blocks so that they don't render in the docs, eg:

/////////////////////

[source,js]
----------------------------
{  response body }
----------------------------
// TESTRESPONSE

/////////////////////

There's just one test in the edge ngrams which I'm not sure how to fix, as the took value will be different on every test run

clintongormley · 2016-05-15T12:07:00Z

docs/reference/analysis/tokenizers/edgengram-tokenizer.asciidoc

+}
+-----------------------------------
+// CONSOLE
+// TEST[skip:The took value will change on every request]


This GET request will return a different took value every time. I tried just removing it with with TEST[s/"took".*"//] but that didn't work.

s/"took": 123/"took": "$body.took"/ is the way I've done it. Might be the
best way but it works.
On May 15, 2016 8:07 AM, "Clinton Gormley" notifications@github.com wrote:

In docs/reference/analysis/tokenizers/edgengram-tokenizer.asciidoc
#18356 (comment)
:

+POST my_index/_refresh
+
+GET my_index/_search
+{

"query": {

"match": {

"title": {

"query": "Quick Fo", <2>

"operator": "and"

}

}

}
+}
+-----------------------------------
+// CONSOLE
+// TEST[skip:The took value will change on every request]

This GET request will return a different took value every time. I tried
just removing it with with TEST[s/"took".*"//] but that didn't work.

—
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
https://github.com/elastic/elasticsearch/pull/18356/files/e79d5a85afc6fde119c7ab58f04a1076a1456391#r63290592

clintongormley · 2016-05-15T14:57:52Z

I've added TESTRESPONSES for analyzers in 38b89f9, and docs and TESTRESPONSES for character filters in 89bbc7f

The pattern_replace tests aren't working as //TEST[continued] isn't working as expected.

clintongormley · 2016-05-15T14:59:14Z

docs/reference/analysis/charfilters/pattern-replace-charfilter.asciidoc

--------------------------------------------------
+----------------------------
+<1> Note the incorrect highlight.
+// TESTRESPONSE


I get the following exception when running the docs check:

TESTRESPONSE not paired with a snippet at .../charfilters/pattern-replace-charfilter.asciidoc:296

That means it didn't match the // TESTRESPONSE with the snippet above it.
It doesn't like if there is a newline between the snippet and the marker.
Or something like that.
On May 15, 2016 10:59 AM, "Clinton Gormley" notifications@github.com
wrote:

In docs/reference/analysis/charfilters/pattern-replace-charfilter.asciidoc
#18356 (comment)
:

}

+----------------------------
+<1> Note the incorrect highlight.
+// TESTRESPONSE

I get the following exception when running the docs check:

TESTRESPONSE not paired with a snippet at .../charfilters/pattern-replace-charfilter.asciidoc:296

—
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
https://github.com/elastic/elasticsearch/pull/18356/files/89bbc7f861ea99c3af0f0db8718be64f7d62db79#r63292994

Switch the <1> and the comment. Would you like to propose a more useful error message?

clintongormley · 2016-05-16T10:13:47Z

Thanks @nik9000 - fixed that. There are two issues which remain:

The replacement of "$1" is interpreted as a stash variable and so fails. This would fail in the REST tests too
The "took" value is always different. I tried `TESTRESPONSE[s/"took".*//] but it seems to remove this from the provided response snippet, rather than from the received response.

nik9000 · 2016-05-16T16:32:56Z

The replacement of "$1" is interpreted as a stash variable and so fails. This would fail in the REST tests too

\\$1 I think will do it, but I'm not sure. If not then maybe disable that test and merge and I'll hack on the rest infrastructure to make it work?

TESTRESPONSE[s/"took".//]
`// TESTRESPONSE[s/"took"./"took": "$body.took"/]

s// just changes the generated tests. "$body.took" gets it from the stash.

clintongormley · 2016-05-17T08:37:37Z

`// TESTRESPONSE[s/"took"./"took": "$body.took"/]

this worked, thanks

$1 I think will do it,

That won't work because it changes what is displayed/sent to ES during the test. It's skipped for now.

nik9000 · 2016-05-17T15:37:30Z

docs/reference/analysis/charfilters/pattern-replace-charfilter.asciidoc


 The output from the above is:

 [source,js]
 ----------------------------
 {
  "timed_out": false,
+  "took": $body.took,


I usually put a number here and use the replacement below to put the "$body.took" bit in. I usually use however long it took on my laptop so a particularly savvy ready gets a sense of how long it might take.

nik9000 · 2016-05-17T15:38:37Z

That won't work because it changes what is displayed/sent to ES during the test. It's skipped for now.

LGTM. If you merge I'll have a look at the escape thing.

Added descriptions and runnable examples

One interprets "$1" as a stack variable - same problem exists with the REST tests The other because the "took" value is always different

…alyzer

clintongormley added >docs General docs changes review :Search Relevance/Analysis How text is split into tokens labels May 14, 2016

nik9000 reviewed May 15, 2016
View reviewed changes

clintongormley reviewed May 15, 2016
View reviewed changes

nik9000 mentioned this pull request May 16, 2016

Convert all the docs to // CONSOLE #18160

Closed

nik9000 reviewed May 17, 2016
View reviewed changes

clintongormley force-pushed the analysis_docs branch from 60d9e13 to 0c6446e Compare May 19, 2016 17:39

Docs: Improved tokenizer docs

cd52538

Added descriptions and runnable examples

clintongormley added 7 commits May 19, 2016 19:41

Addressed Nik's comments

b60132a

Added TESTRESPONSEs for all tokenizer examples

4daf1c6

Added TESTRESPONSEs for all analyzer examples too

c9131b8

Added docs, examples, and TESTRESPONSES for character filters

f96bd61

Skipping two tests:

6546e02

One interprets "$1" as a stack variable - same problem exists with the REST tests The other because the "took" value is always different

Fixed tests with "took"

096a50d

Fixed failing tests and removed preserve_original from fingerprint an…

3817516

…alyzer

clintongormley force-pushed the analysis_docs branch from 0c6446e to 3817516 Compare May 19, 2016 17:41

clintongormley merged commit 5da9e5d into elastic:master May 19, 2016

clintongormley deleted the analysis_docs branch May 19, 2016 17:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Docs: Improved tokenizer docs #18356

Docs: Improved tokenizer docs #18356

clintongormley commented May 14, 2016

nik9000 May 15, 2016

nik9000 commented May 15, 2016

nik9000 commented May 15, 2016

clintongormley commented May 15, 2016

clintongormley May 15, 2016

nik9000 May 15, 2016

clintongormley commented May 15, 2016

clintongormley May 15, 2016

nik9000 May 15, 2016

nik9000 May 16, 2016

clintongormley commented May 16, 2016

nik9000 commented May 16, 2016

clintongormley commented May 17, 2016

nik9000 May 17, 2016

nik9000 commented May 17, 2016

Docs: Improved tokenizer docs #18356

Docs: Improved tokenizer docs #18356

Conversation

clintongormley commented May 14, 2016

nik9000 May 15, 2016

Choose a reason for hiding this comment

nik9000 commented May 15, 2016

nik9000 commented May 15, 2016

clintongormley commented May 15, 2016

clintongormley May 15, 2016

Choose a reason for hiding this comment

nik9000 May 15, 2016

Choose a reason for hiding this comment

clintongormley commented May 15, 2016

clintongormley May 15, 2016

Choose a reason for hiding this comment

nik9000 May 15, 2016

Choose a reason for hiding this comment

}

nik9000 May 16, 2016

Choose a reason for hiding this comment

clintongormley commented May 16, 2016

nik9000 commented May 16, 2016

clintongormley commented May 17, 2016

nik9000 May 17, 2016

Choose a reason for hiding this comment

nik9000 commented May 17, 2016