Merged from upstream #1

pdegeus · 2013-03-20T08:19:13Z

This pull request merges all changes from upstream (elasticsearch). Fixes sorting problem!

The `term` suggester provides a very convenient API to access word alternatives on token basis within a certain string distance. The API allows accessing each token in the stream individually while suggest-selection is left to the API consumer. Yet, often already ranked / selected suggestions are required in order to present to the end-user. Inside ElasticSearch we have the ability to access way more statistics and information quickly to make better decision which token alternative to pick or if to pick an alternative at all. This `phrase` suggester adds some logic on top of the `term` suggester to select entire corrected phrases instead of individual tokens weighted based on a *ngram-langugage models*. In practice it will be able to make better decision about which tokens to pick based on co-occurence and frequencies. The current implementation is kept quite general and leaves room for future improvements. # API Example The `phrase` request is defined along side the query part in the json request: ```json curl -s -XPOST 'localhost:9200/_search' -d { "suggest" : { "text" : "Xor the Got-Jewel", "simple_phrase" : { "phrase" : { "analyzer" : "body", "field" : "bigram", "size" : 1, "real_word_error_likelihood" : 0.95, "max_errors" : 0.5, "gram_size" : 2, "direct_generator" : [ { "field" : "body", "suggest_mode" : "always", "min_word_len" : 1 } ] } } } } ``` The response contains suggested sored by the most likely spell correction first. In this case we got the expected correction `xorr the god jewel` first while the second correction is less conservative where only one of the errors is corrected. Note, the request is executed with `max_errors` set to `0.5` so 50% of the terms can contain misspellings (See parameter descriptions below). ```json { "took" : 37, "timed_out" : false, "_shards" : { "total" : 5, "successful" : 5, "failed" : 0 }, "hits" : { "total" : 2938, "max_score" : 0.0, "hits" : [ ] }, "suggest" : { "simple_phrase" : [ { "text" : "Xor the Got-Jewel", "offset" : 0, "length" : 17, "options" : [ { "text" : "xorr the god jewel", "score" : 0.17877324 }, { "text" : "xor the god jewel", "score" : 0.14231323 } ] } ] } } ```` # Phrase suggest API ## Basic parameters * `field` - the name of the field used to do n-gram lookups for the language model, the suggester will use this field to gain statistics to score corrections. * `gram_size` - sets max size of the n-grams (shingles) in the `field`. If the field doesn't contain n-grams (shingles) this should be omitted or set to `1`. * `real_word_error_likelihood` - the likelihood of a term being a misspelled even if the term exists in the dictionary. The default it `0.95` corresponding to 5% or the real words are misspelled. * `confidence` - The confidence level defines a factor applied to the input phrases score which is used as a threshold for other suggest candidates. Only candidates that score higher than the threshold will be included in the result. For instance a confidence level of `1.0` will only return suggestions that score higher than the input phrase. If set to `0.0` the top N candidates are returned. The default is `1.0`. * `max_errors` - the maximum percentage of the terms that at most considered to be misspellings in order to form a correction. This method accepts a float value in the range `[0..1)` as a fraction of the actual query terms a number `>=1` as an absolut number of query terms. The default is set to `1.0` which corresponds to that only corrections with at most 1 misspelled term are returned. * `separator` - the separator that is used to separate terms in the bigram field. If not set the whitespce character is used as a separator. * `size` - the number of candidates that are generated for each individual query term Low numbers like `3` or `5` typically produce good results. Raising this can bring up terms with higher edit distances. The default is `5`. * `analyzer` - Sets the analyzer to analyse to suggest text with. Defaults to the search analyzer of the suggest field passed via `field`. * `shard_size` - Sets the maximum number of suggested term to be retrieved from each individual shard. During the reduce phase the only the top N suggestions are returned based on the `size` option. Defaults to `5`. * `text` - Sets the text / query to provide suggestions for. ## Smoothing Models The `phrase` suggester supports multiple smoothing models to balance weight between infrequent grams (grams (shingles) are not existing in the index) and frequent grams (appear at least once in the index). * `laplace` - the default model that uses an additive smoothing model where a constant (typically `1.0` or smaller) is added to all counts to balance weights, The default `alpha` is `0.5`. * `stupid_backoff` - a simple backoff model that backs off to lower order n-gram models if the higher order count is `0` and discounts the lower order n-gram model by a constant factor. The default `discount` is `0.4`. * `linear_interpolation` - a smoothing model that takes the weighted mean of the unigrams, bigrams and trigrams based on user supplied weights (lambdas). Linear Interpolation doesn't have any default values. All parameters (`trigram_lambda`, `bigram_lambda`, `unigram_lambda`) must be supplied. ## Candidate Generators The `phrase` suggester uses candidate generators to produce a list of possible terms per term in the given text. A single candidate generator is similar to a `term` suggester called for each individual term in the text. The output of the generators is subsequently scored in in combination with the candidates from the other terms to for suggestion candidates. Currently only one type of candidate generator is supported, the `direct_generator`. The Phrase suggest API accepts a list of generators under the key `direct_generator` each of the generators in the list are called per term in the original text. ## Direct Generators The direct generators support the following parameters: * `field` - The field to fetch the candidate suggestions from. This is an required option that either needs to be set globally or per suggestion. * `analyzer` - The analyzer to analyse the suggest text with. Defaults to the search analyzer of the suggest field. * `size` - The maximum corrections to be returned per suggest text token. * `suggest_mode` - The suggest mode controls what suggestions are included or controls for what suggest text terms, suggestions should be suggested. Three possible values can be specified: * `missing` - Only suggest terms in the suggest text that aren't in the index. This is the default. * `popular` - Only suggest suggestions that occur in more docs then the original suggest text term. * `always` - Suggest any matching suggestions based on terms in the suggest text. * `max_edits` - The maximum edit distance candidate suggestions can have in order to be considered as a suggestion. Can only be a value between 1 and 2. Any other value result in an bad request error being thrown. Defaults to 2. * `min_prefix` - The number of minimal prefix characters that must match in order be a candidate suggestions. Defaults to 1. Increasing this number improves spellcheck performance. Usually misspellings don't occur in the beginning of terms. * `min_query_length` - The minimum length a suggest text term must have in order to be included. Defaults to 4. * `max_inspections` - A factor that is used to multiply with the `shards_size` in order to inspect more candidate spell corrections on the shard level. Can improve accuracy at the cost of performance. Defaults to 5. * `threshold_frequency` - The minimal threshold in number of documents a suggestion should appear in. This can be specified as an absolute number or as a relative percentage of number of documents. This can improve quality by only suggesting high frequency terms. Defaults to 0f and is not enabled. If a value higher than 1 is specified then the number cannot be fractional. The shard level document frequencies are used for this option. * `max_query_frequency` - The maximum threshold in number of documents a sugges text token can exist in order to be included. Can be a relative percentage number (e.g 0.4) or an absolute number to represent document frequencies. If an value higher than 1 is specified then fractional can not be specified. Defaults to 0.01f. This can be used to exclude high frequency terms from being spellchecked. High frequency terms are usually spelled correctly on top of this this also improves the spellcheck performance. The shard level document frequencies are used for this option. * pre_filter - a filter (analyzer) that is applied to each of the tokens passed to this candidate generator. This filter is applied to the original token before candidates are generated. (optional) * post_filter - a filter (analyzer) that is applied to each of the generated tokens before they are passed to the actual phrase scorer. (optional) The following example shows a `phrase` suggest call with two generators, the first one is using a field containing ordinary indexed terms and the second one uses a field that uses terms indexed with a `reverse` filter (tokens are index in reverse order). This is used to overcome the limitation of the direct generators to require a constant prefix to provide high-performance suggestions. The `pre_filter` and `post_filter` options accept ordinary analyzer names. ```json curl -s -XPOST 'localhost:9200/_search' -d { "suggest" : { "text" : "Xor the Got-Jewel", "simple_phrase" : { "phrase" : { "analyzer" : "body", "field" : "bigram", "size" : 4, "real_word_error_likelihood" : 0.95, "confidence" : 2.0, "gram_size" : 2, "direct_generator" : [ { "field" : "body", "suggest_mode" : "always", "min_word_len" : 1 }, { "field" : "reverse", "suggest_mode" : "always", "min_word_len" : 1, "pre_filter" : "reverse", "post_filter" : "reverse" } ] } } } } ``` `pre_filter` and `post_filter` can also be used to inject synonyms after candidates are generated. For instance for the query `captain usq` we might generate a candidate `usa` for term `usq` which is a synonym for `america` which allows to present `captain america` to the user if this phrase scores high enough. Closes elastic#2709

Closes elastic#2710

…re flipped creating non-existing bigram

(yet, validate early and exit when relevant)

…array or an object. Closes elastic#2275

…notified. Closes elastic#2692

…cts. Previously this commit either sort modes `min` or `max` (depending on sort order) was used when sort modes `avg` and `sum` were picked. Closes elastic#2701

…e shard response

Closes elastic#2656

fixes elastic#2694

fixes elastic#2624

The order in which routing and parent parameters are set is important. The routing parameter must be set first or it will overwrite the parent routing value.

closes elastic#2718

… run it correctly

…ssed help with tests that run on slow machines

…f it can't be parsed. Closes elastic#2547

* Exposed the spatial strategy to be configurable as part of the geo_shape mappings * Exposed the spatial strategy to be customizable at query time (will be used to generate the geo_shape filter/query) * Removed XTermQueryPrefixTreeStrategy and reverted to use the lucene TermQueryPrefixTreeStrategy instead * Made the RecursivePrefixTreeStrategy the default strategy to be used * Removed support for all spatial operations except "intersects" * Updated both the GeoShapeQueryBuilder and GeoShapeFilterBuilder with all the changes (removed the option of specifying the operation type (as only intersects is supported) and added the option of setting the filter/query spatial strategy Closes elastic#2720

Closes elastic#2683

This reverts commit 98f06c9.

Closes elastic#2772

Closes elastic#2773

…n if 'crs' field is included. Fixes elastic#2763

The REST Suggester API binds the 'Suggest API' to the REST Layer directly. Hence there is no need to touch the query layer for requesting suggestions. This API extracts the Phrase Suggester API and makes 'suggestion request' top-level objects in suggestion requests. The complete API can be found in the underlying ["Suggest Feature API"](http://www.elasticsearch.org/guide/reference/api/search/suggest.html). # API Example The following examples show how Suggest Actions work on the REST layer. According to this a simple request and its response will be shown. ## Suggestion Request ```json curl -XPOST 'localhost:9200/_suggest?pretty=true' -d '{ "text" : "Xor the Got-Jewel", "simple_phrase" : { "phrase" : { "analyzer" : "bigram", "field" : "bigram", "size" : 1, "real_word_error_likelihood" : 0.95, "max_errors" : 0.5, "gram_size" : 2 } } }' ``` This example shows how to query a suggestion for the global text 'Xor the Got-Jewel'. A 'simple phrase' suggestion is requested and a 'direct generator' is configured to generate the candidates. ## Suggestion Response On success the request above will reply with a response like the following: ```json { "simple_phrase" : [ { "text" : "Xor the Got-Jewel", "offset" : 0, "length" : 17, "options" : [ { "text" : "xorr the the got got jewel", "score" : 3.5283546E-4 } ] } ] } ``` The 'suggest'-response contains a single 'simple phrase' which contains an 'option' in turn. This option represents a suggestion of the queried text. It contains the corrected text and a score indicating the probability of this option to be meant. Closes elastic#2774

- make sure we close the parser - fail when no content is provided in the rest request - reuse the suggest parse element

Closes elastic#2780

… sort order in a Sort object. Closes elastic#2767

Closes elastic#2781

…StoreStats Closes elastic#2785

fixes elastic#2789

also add the list of current indices

- that isAnnotationPresent bug is known, and probably will be fixed in later versions, but it costs us nothing to not use it now - some tests fail, mainly due to consistent ordering expected from Map (within versions) which does not seem to be preserved, need to fix those tests to be agnostic to it

…ically closes elastic#2795

…ways

this happens for example because we list assigned shards, and they might not have been allocated on the relevant node yet, no need to list those as actual failures in some APIs

also, throttle on socket failures, so it won't spin out of control... relates to elastic#2783

# By Shay Banon (43) and others # Via Shay Banon * master-upstream: (97 commits) better comment... if multicast socket closes, try and restart it also, throttle on socket failures, so it won't spin out of control... relates to elastic#2783 multicastSocket should be volatile as well... broadcast API to by default ignore missing / illegal shard state this happens for example because we list assigned shards, and they might not have been allocated on the relevant node yet, no need to list those as actual failures in some APIs upgrade to guava 14.0.1 tar.gz distro by mistake include a windows lib fix javadoc Correct filter strategy opt: random_access_random to random_access_always Field Data: optimize long type to use narrowest possible type automatically closes elastic#2795 make ES compile with java 8 - that isAnnotationPresent bug is known, and probably will be fixed in later versions, but it costs us nothing to not use it now - some tests fail, mainly due to consistent ordering expected from Map (within versions) which does not seem to be preserved, need to fix those tests to be agnostic to it use ImmutableList.Builder instead of ArrayList fix logging message to include the index also add the list of current indices Mapping: dynamic flag is explicitly returned even when not set fixes elastic#2789 Fix bug in RateLimiter.SimpleRateLimiter causing numeric overflow in StoreStats improve TODO comment add CamelCase support to Suggester where missing Remove `sort_order` and `sort_mode` in favor of `order` and `mode` Add `sort_oder` and `sortOrder` as valid field names for defining the sort order in a Sort object. Make StupidBackoff the default smoothing model for phrase suggester minor cleanup suggest api - make sure we close the parser - fail when no content is provided in the rest request - reuse the suggest parse element ...

Merged from upstream

kimchy and others added 30 commits February 28, 2013 16:02

not bytes...

2bc6248

Expose _explain via POST

b4b3e35

Closes elastic#2710

fix bug in StupidBackoffScorer were previous word and current word we…

c90c5cb

…re flipped creating non-existing bigram

improve timing in test to wait for state with graceful timeouts

849a367

(yet, validate early and exit when relevant)

add info in test for actual search failures

30075bb

always use the max score across the shards in suggest response

9c38989

throw IAE if fieldname is null - Closes elastic#2711

b03f3fc

Fail in metadata parsing if the id path is not a value but rather an …

3c1f291

…array or an object. Closes elastic#2275

Short Curcuit response if no indices exits and make sure listener is …

39f3623

…notified. Closes elastic#2692

Supporting sort modes avg and sum when sorting inside nested obje…

d99b532

…cts. Previously this commit either sort modes `min` or `max` (depending on sort order) was used when sort modes `avg` and `sum` were picked. Closes elastic#2701

ensure that suggestion only added on reduce if they are present in th…

fced68c

…e shard response

Throw IAE if indices is null or contains a null value.

aaa3c48

Closes elastic#2656

Throw correct ClassNotFoundException to debug classloader issues

d16efbe

more strict check before trying to parse and detect a string as a date

9b68e98

fixes elastic#2694

Analyze API returns in YAML format if analyzed string begins with ---

2eea992

fixes elastic#2624

Correct order of routing and parent params for Get

dfd9226

The order in which routing and parent parameters are set is important. The routing parameter must be set first or it will overwrite the parent routing value.

Query DSL: Filtered query to make query optional (defaults to mach_all)

6687ecb

closes elastic#2718

lazy set the indices on the search request now that its validated

fe8b372

spin a bit to wait for condition in test, so slow machines will still…

361d6bf

… run it correctly

add proper testing for bool filter

ea097af

Make BoolFilterBuilder output proper json

9273d76

add ability for cluster health to wait for current events to be proce…

50d1213

…ssed help with tests that run on slow machines

proper reason for cluster state task

5dd18ac

fix local flag in cluster health

0be5a78

Check for null query on Percolator query loading and omit the query i…

b951351

…f it can't be parsed. Closes elastic#2547

Fix bug when searching concrete and routing aliased indices

09f20e3

Closes elastic#2683

simplify searchShard selection when routing is present

e9ba989

add evictions stats to field data

e01879a

drewr and others added 28 commits March 12, 2013 19:09

Add s3-publishing script.

98f06c9

Revert "Add s3-publishing script."

bea18d9

This reverts commit 98f06c9.

tieBreaker in MultiMatchQueryBuilder should be a float, not an integer

93ca6e2

Closes elastic#2772

Use numOrds rather than numDocs as upperbound for sorting

365cde8

Closes elastic#2773

GeoJSONShapeParser parses JSON correctly and extracts coordinates eve…

125b33d

…n if 'crs' field is included. Fixes elastic#2763

avoiding NPE in Sigar FS

a127f2d

minor cleanup suggest api

91c51ef

- make sure we close the parser - fail when no content is provided in the rest request - reuse the suggest parse element

Make StupidBackoff the default smoothing model for phrase suggester

5f20d81

Closes elastic#2780

Add sort_oder and sortOrder as valid field names for defining the…

33608c3

… sort order in a Sort object. Closes elastic#2767

Remove sort_order and sort_mode in favor of order and mode

e0eff7d

Closes elastic#2781

add CamelCase support to Suggester where missing

0e3b88b

improve TODO comment

d5da8f2

Fix bug in RateLimiter.SimpleRateLimiter causing numeric overflow in …

c25eb7d

…StoreStats Closes elastic#2785

Mapping: dynamic flag is explicitly returned even when not set

111a132

fixes elastic#2789

fix logging message to include the index

2ed6ea2

also add the list of current indices

use ImmutableList.Builder instead of ArrayList

e347a62

Field Data: optimize long type to use narrowest possible type automat…

7d9cef9

…ically closes elastic#2795

Correct filter strategy opt: random_access_random to random_access_al…

2123ab5

…ways

fix javadoc

566d1d1

tar.gz distro by mistake include a windows lib

aca713d

upgrade to guava 14.0.1

bea7bdd

broadcast API to by default ignore missing / illegal shard state

c92207f

this happens for example because we list assigned shards, and they might not have been allocated on the relevant node yet, no need to list those as actual failures in some APIs

multicastSocket should be volatile as well...

f4a2124

if multicast socket closes, try and restart it

d5beea4

also, throttle on socket failures, so it won't spin out of control... relates to elastic#2783

better comment...

54e7e30

msimons added a commit that referenced this pull request Mar 20, 2013

Merge pull request #1 from pdegeus/master

ee2c9f4

Merged from upstream

msimons merged commit ee2c9f4 into msimons:master Mar 20, 2013

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Merged from upstream #1

Merged from upstream #1

pdegeus commented Mar 20, 2013

Merged from upstream #1

Merged from upstream #1

Conversation

pdegeus commented Mar 20, 2013