-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added lenient flag for synonym token filter #31484
Added lenient flag for synonym token filter #31484
Conversation
add to whitelist |
|
||
@Override | ||
public void add(CharsRef input, CharsRef output, boolean includeOrig) { | ||
if (!lenient || (input.length > 0 && output.length > 0)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nik9000 I am not sure if this looks very clean but it avoids a fork of the SolrSynonymParser
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nik I am not sure if this looks very clean but it avoids a fork of the SolrSynonymParser.
Wrong Nik, sadly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah corrected it. But I still got the attention of the right one ;)
Pinging @elastic/es-search-aggs |
Additional settings are: | ||
|
||
* `expand` (defaults to `true`). | ||
* `lenient` (defaults to `false`). If `true` ignores exceptions while parsing the synonym configuration. | ||
|
||
This filter tokenize synonyms with whatever tokenizer and token filters |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we want to add a description of the lenient
setting to synonym-graph-tokenfilter.asciidoc as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. I missed that in the initial commit. Thanks!
|
||
public class ElasticSynonymParser extends SolrSynonymParser { | ||
|
||
private boolean lenient; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to make lenient
variable final
as well?
|
||
@Override | ||
public void add(CharsRef input, CharsRef output, boolean includeOrig) { | ||
if (!lenient || (input.length > 0 && output.length > 0)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
!lenient
we usually write as lenient == false
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sohaibiftikhar I am not sure I understand this condition (!lenient || (input.length > 0 && output.length > 0)
, can you please explain what you are trying to achieve here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So if lenient
is not set we should proceed as before. For the case when lenient is set, if there are exceptions during analyze
we return a zero-length CharsRef
. When these synonym mappings are added using the add
method we skip the ones that were left empty (super.add
would throw exceptions otherwise). Does this make sense?
I believe if the condition was only input.length > 0 && output.length > 0
it would still work cause it never really reaches this part with zero length input
or output
with the non-lenient setting since it would throw an exception during analyze. But I just wanted to be sure that I only change the logic when lenient
is set.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sohaibiftikhar Thanks for the explanation, it does makes sense and the logic is correct, but your condition by itself is not obvious. I would either provide more extensive explanation before this condition, or I would instead override addInternal
method of SolrSynonymParser
parser where the logic can be obvious.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think overriding addInternal
may not be the best idea because the method is private and it would mean repetition of the logic (it calls some other private methods that we'd need to copy as well). Unless we want to go for a fork of SolrSynonymParser
.
I will add the explanation above the condition as you suggested.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, the PR is well implemented, and addresses the problem. Just some minor changes are necessary.
|
||
@Override | ||
public void add(CharsRef input, CharsRef output, boolean includeOrig) { | ||
if (!lenient || (input.length > 0 && output.length > 0)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sohaibiftikhar Thanks for the explanation, it does makes sense and the logic is correct, but your condition by itself is not obvious. I would either provide more extensive explanation before this condition, or I would instead override addInternal
method of SolrSynonymParser
parser where the logic can be obvious.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ooops! I had some comment that I forgot to click "submit" on.
|
||
import java.io.IOException; | ||
|
||
public class ElasticSynonymParser extends SolrSynonymParser { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ElasticsearchSynonymParser
is probably better.
public ElasticSynonymParser(boolean dedup, boolean expand, boolean lenient, Analyzer analyzer) { | ||
super(dedup, expand, analyzer); | ||
this.lenient = lenient; | ||
logger = Loggers.getLogger(getClass(), "ElasticSynonymParser"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A static logger should be fine here. The point of these methods is to make loggers look good when referring to a particular shard.
@jpountz, would you like to take a look at this one too because we'd talked about it a while back? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sohaibiftikhar Thanks! LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good in general, can we make the documentation more specific about the fact that only illegal synonyms are going to be ignored, not the entire rule? For instance if you have a rule that is foo, bar => baz
and bar
is a stop word, then Elasticsearch will handle it like if there was a foo => baz
rule. And add tests for this case?
@jpountz You mean in the asciidoc? |
Yes. |
@nik9000 @mayya-sharipova @jpountz Looking at this just now I realised with this PR we only support the |
@sohaibiftikhar yes, I think we should add |
public CharsRef analyze(String text, CharsRefBuilder reuse) throws IOException { | ||
try { | ||
return super.analyze(text, reuse); | ||
} catch (IllegalArgumentException ex) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand that there is code duplication in this class. But I don't know how to avoid it in this situation.
"synonym_graph" : { | ||
"type" : "synonym_graph", | ||
"lenient": true, | ||
"synonyms" : ["foo, bar => baz"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
However, if it was "foo, baz => bar"
or "bar, foo, baz"
no mapping would be added. Is this acceptable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sohaibiftikhar Yes, I think it is acceptable, as long as we state it in the documentaiton
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the case of foo, baz => bar
I think this is expected, but the other case is a bit unexpected to me, I would expect it to behave like foo, baz
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In that case, the parser maps everything to the first word. Similarly for WordnetParser
if you have something like
s(100000001,1,'bar',v,1,0).
s(100000001,2,'foo',v,1,0).
s(100000001,3,'baz',v,1,0).
everything gets mapped to bar
. But since that is a stop word no mappings get added.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In that case, the parser maps everything to the first word.
This is only true if expand
is explicitly set to false
. By default, all combinations are generated, see this recreation for instance:
PUT test
{
"settings": {
"analysis": {
"filter": {
"my_syns": {
"type": "synonym_graph",
"synonyms": [ "foo, bar, baz" ]
}
},
"analyzer": {
"my_syn_analyzer": {
"tokenizer": "whitespace",
"filter": [ "my_syns" ]
}
}
}
}
}
GET test/_analyze
{
"analyzer": "my_syn_analyzer",
"text": "a bar"
}
// Returns:
{
"tokens": [
{
"token": "a",
"start_offset": 0,
"end_offset": 1,
"type": "word",
"position": 0
},
{
"token": "foo",
"start_offset": 2,
"end_offset": 5,
"type": "SYNONYM",
"position": 1
},
{
"token": "baz",
"start_offset": 2,
"end_offset": 5,
"type": "SYNONYM",
"position": 1
},
{
"token": "bar",
"start_offset": 2,
"end_offset": 5,
"type": "word",
"position": 1
}
]
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the late reply as I somehow missed this.
Yes. I forgot to mention the expand
part in the documentation. Should I add that or remove that example altogether?
// CONSOLE | ||
With the above request the word `bar` gets skipped but a mapping `foo => baz` is still added. However, if the mapping | ||
being added was "foo, baz => bar" or "bar, foo, baz" nothing would get added to the synonym list. This is because the | ||
target word for the mapping is itself eliminated because it was a stop word. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sohaibiftikhar I think you can have your script example at it is. It is fine. Just in the explanation after script, you can elaborate more with expand
option, e.g. with "bar, foo, baz"
nothing will be added when expand=false
, and foo, baz => foo, baz
would be added if expand=true
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
-- Also made lenient final -- changed from !lenient to lenient == false
-- Renamed to ElasticsearchSynonymParser -- Added explanation for ElasticsearchSynonymParser::add method -- Changed ElasticsearchSynonymParser::logger instance to static
-- also added more documentation
e6e547c
to
198ebed
Compare
@sohaibiftikhar Thanks, +1 from me. @jpountz Do you have any other comments for this PR, or we can merge it? |
@sohaibiftikhar We would like to backport these changes also to 6.x. Would you mind to create another PR for 6.x (mostly copy of the same PR)? |
@mayya-sharipova Okay. I will create a patch |
* master: [TEST] Mute SlackMessageTests.testTemplateRender Docs: Explain closing the high level client [ML] Re-enable memory limit integration tests (#31328) [test] disable packaging tests for suse boxes Add nio transport to security plugin (#31942) XContentTests : Insert random fields at random positions (#30867) Force execution of fetch tasks (#31974) Fix unreachable error condition in AmazonS3Fixture (#32005) Tests: Fix SearchFieldsIT.testDocValueFields (#31995) Add Expected Reciprocal Rank metric (#31891) [ML] Get ForecastRequestStats doc in RestoreModelSnapshotIT (#31973) SQL: Add support for single parameter text manipulating functions (#31874) [ML] Ensure immutability of MlMetadata (#31957) Tests: Mute SearchFieldsIT.testDocValueFields() muted tests due to #31940 Work around reported problem in eclipse (#31960) Move build integration tests out of :buildSrc project (#31961) Tests: Remove use of joda time in some tests (#31922) [Test] Reactive 3rd party tests on CI (#31919) SQL: Support for escape sequences (#31884) SQL: HAVING clause should accept only aggregates (#31872) Docs: fix typo in datehistogram (#31972) Switch url repository rest tests to new style requests (#31944) Switch reindex tests to new style requests (#31941) Docs: Added note about cloud service to installation and getting started [DOCS] Removes alternative docker pull example (#31934) Add Snapshots Status API to High Level Rest Client (#31515) ingest: date_index_name processor template resolution (#31841) Test: fix null failure in watcher test (#31968) Switch test framework to new style requests (#31939) Switch low level rest tests to new style Requests (#31938) Switch high level rest tests to new style requests (#31937) [ML] Mute test failing due to Java 11 date time format parsing bug (#31899) [TEST] Mute SlackMessageTests.testTemplateRender Fix assertIngestDocument wrongfully passing (#31913) Remove unused reference to filePermissionsCache (#31923) rolling upgrade should use a replica to prevent relocations while running a scroll HLREST: Bundle the x-pack protocol project (#31904) Increase logging level for testStressMaybeFlush Added lenient flag for synonym token filter (#31484) [X-Pack] Beats centralized management: security role + licensing (#30520) HLRest: Move xPackInfo() to xPack().info() (#31905) Docs: add security delete role to api call table (#31907) [test] port archive distribution packaging tests (#31314) Watcher: Slack message empty text (#31596) [ML] Mute failing DetectionRulesIT.testCondition() test Fix broken NaN check in MovingFunctions#stdDev() (#31888) Date: Add DateFormatters class that uses java.time (#31856) [ML] Switch native QA tests to a 3 node cluster (#31757) Change trappy float comparison (#31889) Fix building AD URL from domain name (#31849) Add opaque_id to audit logging (#31878) re-enable backcompat tests add support for is_write_index in put-alias body parsing (#31674) Improve release notes script (#31833) [DOCS] Fix broken link in painless example Handle missing values in painless (#30975) Remove the ability to index or query context suggestions without context (#31007) Ingest: Enable Templated Fieldnames in Rename (#31690) [Docs] Fix typo in the Rollup API Quick Reference (#31855) Ingest: Add ignore_missing option to RemoveProc (#31693) Add template config for Beat state to X-Pack Monitoring (#31809) Watcher: Add ssl.trust email account setting (#31684) Remove link to oss-MSI (#31844) Painless: Restructure Definition/Whitelist (#31879) HLREST: Add x-pack-info API (#31870)
* Added lenient flag for synonym-tokenfilter. Relates to #30968 * added docs for synonym-graph-tokenfilter -- Also made lenient final -- changed from !lenient to lenient == false * Changes after review (1) -- Renamed to ElasticsearchSynonymParser -- Added explanation for ElasticsearchSynonymParser::add method -- Changed ElasticsearchSynonymParser::logger instance to static * Added lenient option for WordnetSynonymParser -- also added more documentation * Added additional documentation * Improved documentation (cherry picked from commit 88c270d)
* 6.x: Watcher: Make settings reloadable (#31746) [Rollup] Histo group config should support scaled_floats (#32048) lazy snapshot repository initialization (#31606) Add secure setting for watcher email password (#31620) Watcher: cleanup ensureWatchExists use (#31926) Add second level of field collapsing (#31808) Added lenient flag for synonym token filter (#31484) (#31970) Test: Fix a second case of bad watch creation [Rollup] Use composite's missing_bucket (#31402) Docs: Restyled cloud link in getting started Docs: Change formatting of Cloud options Re-instate link in StringFunctionUtils javadocs Correct spelling of AnalysisPlugin#requriesAnalysisSettings (#32025) Fix problematic chars in javadoc [ML] Move open job failure explanation out of root cause (#31925) [ML] Switch ML native QA tests to use a 3 node cluster (#32011)
Relates to #30968