Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Uppercase Regex in String queries search #3578

Closed
yyyogev opened this issue Jun 14, 2022 · 24 comments · Fixed by #3967
Closed

[BUG] Uppercase Regex in String queries search #3578

yyyogev opened this issue Jun 14, 2022 · 24 comments · Fixed by #3967
Labels
bug Something isn't working Indexing & Search

Comments

@yyyogev
Copy link
Contributor

yyyogev commented Jun 14, 2022

Regex search on analyzed fields doesn’t work with capital letters now as if the analyzer saves them lower cased, or doesn't use the standard analyzer for some reason.

To Reproduce
Steps to reproduce the behavior:

  1. Create an index and add a doc containing uppercase letters to it (e.g message: this is a TLS handshake)
  2. Perform this search request:
    {"query": {"bool": {"must": [{"query_string": {"query": "message:/TLS/"}}]}}}

Expected behavior
We should find the doc we added above since it's an exact match. However, we got 0 results.

Plugins
None

Host/Environment (please complete the following information):
I used OpenSearch container with a single node (version 1.2.3), running on iOS 11.6.4

Additional context
If we specify the standard analyzer to the search request, we get the expected results.
{"query": {"bool": {"must": [{"query_string": {"query": "message:/TLS/", "analyzer":"standard"}}]}}}

@yyyogev yyyogev added bug Something isn't working untriaged labels Jun 14, 2022
@reta
Copy link
Collaborator

reta commented Jun 14, 2022

@yyyogev could you please share the mapping for message field?

@yyyogev
Copy link
Contributor Author

yyyogev commented Jun 14, 2022

I didn't not specify it, so whatever the default is

@reta
Copy link
Collaborator

reta commented Jun 14, 2022

@yyyogev OK, in this the assumption is that you use standard analyzer [1] for the message field as well, which includes Lower Case Token Filter. As such, the behavior your are observing is expected. There are three options to align search with the ingestion:

  • use "analyzer":"standard" (as you did)
  • use lowercased search queries {"query": "message:/tls/"}
  • use regexp query and case_insensitive search [2]:
 {"query": {"bool": {"must": [{"regexp": {"message": { "value": "TLS", "case_insensitive": true }}}]}}}

[1] https://www.elastic.co/guide/en/elasticsearch/reference/7.10/analysis-standard-analyzer.html
[2] https://www.elastic.co/guide/en/elasticsearch/reference/7.10/query-dsl-regexp-query.html

@yyyogev
Copy link
Contributor Author

yyyogev commented Jun 14, 2022

This behavior was different at ES 6.8, is there documentation for this change of behavior anywhere?

@reta
Copy link
Collaborator

reta commented Jun 14, 2022

I am not aware on any changes to be fair on search side, perhaps there are differences on ingestion side (mappings).

@yyyogev
Copy link
Contributor Author

yyyogev commented Jun 15, 2022

Doesn't seem so, this is the mapping for the doc in ES 6.8:
{ "_doc": { "properties": { "message": { "type": "text", "fields": { "keyword": { "type": "keyword", "ignore_above": 256 } } } } } }

and this is at OS 1.2.3:
{ "properties": { "message": { "type": "text", "fields": { "keyword": { "type": "keyword", "ignore_above": 256 } } } } }

@reta
Copy link
Collaborator

reta commented Jun 15, 2022

@yyyogev I was really curious to find out why, here is the thing: 6.8.x used default search analyzer (== standard) whereas it has changed by [1] to only use analyzer if it is specified at search (query) time.

[1] https://github.com/elastic/elasticsearch/pull/61013/files

@yyyogev
Copy link
Contributor Author

yyyogev commented Jun 15, 2022

And if not then it doesn't use any analyzers at all?
The thing is, it's very weird behavior, that the actual string won't be matched and the lowercase of it will..

@yyyogev
Copy link
Contributor Author

yyyogev commented Jun 20, 2022

@reta is there a way to use some analyzer (standard or a custom one) without specifying it in each and every request?

@reta
Copy link
Collaborator

reta commented Jun 20, 2022

And if not then it doesn't use any analyzers at all?

For regex specifically - yes

The thing is, it's very weird behavior, that the actual string won't be matched and the lowercase of it will..

You could use the case_insensitive in your query to overcome that

@reta is there a way to use some analyzer (standard or a custom one) without specifying it in each and every request?

To be honest, I was not able to find the other way around by just looking into code, could you please try the other options listed here [1]?

[1] https://www.elastic.co/guide/en/elasticsearch/reference/7.10/specify-analyzer.html#specify-search-analyzer

@yyyogev
Copy link
Contributor Author

yyyogev commented Jun 21, 2022

@reta case_insensitive is not one of query_string params

@reta
Copy link
Collaborator

reta commented Jun 21, 2022

@reta case_insensitive is not one of query_string params

Correct, if using regexp as specialized query is not an option - disregard it please, at worst - you could try to lowercase query before sending it to Opensearch.

@yyyogev
Copy link
Contributor Author

yyyogev commented Jun 21, 2022

This creates another problem, if we have lowercasing the query, it breaks the search for terms of type keyword since these might be saved with uppercase letters

@reta
Copy link
Collaborator

reta commented Jun 21, 2022

@yyyogev I see, we have limited number of choices than (at this moment), specifying analyzer seems to be the simplest one

@yyyogev
Copy link
Contributor Author

yyyogev commented Jun 21, 2022

@reta we tried this already and it doesn't work as well.. I specified the search analyzer for the whole index with these settings:

  "dodo": {
    "settings": {
      "index": {
        "number_of_shards": "1",
        "provided_name": "dodo",
        "creation_date": "1655830029877",
        "analysis": {
          "analyzer": {
            "default_search": {
              "type": "standard"
            }
          }
        },
        "number_of_replicas": "1",
        "uuid": "fFU7oGbEQwSJVHn7DFjoXQ",
        "version": {
          "created": "136217827"
        }
      }
    }
  }
}

@reta
Copy link
Collaborator

reta commented Jun 21, 2022

@yyyogev I meant the approach you mentioned in the issue, analyzer at query time, sorry for confusion:

If we specify the standard analyzer to the search request, we get the expected results.
{"query": {"bool": {"must": [{"query_string": {"query": "message:/TLS/", "analyzer":"standard"}}]}}}

@yyyogev
Copy link
Contributor Author

yyyogev commented Jun 21, 2022

@reta this solution has the same problem.. it would lowercase keywords won't find keywords with uppercase letters

@reta
Copy link
Collaborator

reta commented Jun 21, 2022

@yyyogev sadly I do not have universal solution for you right now, there is Apache Lucene issue / pull request to support case-insensitivity for regexes in query_string (exactly your case) [1], fe /TLS/i but it seems to be abandoned. I assume you use whatever query string is provided, if you really have to preserve regex behavior, you would need to preprocess the query to lower-case the regexes (and keep everything else intact).

On the Opensearch side, what we could do, is to consult search_analyzer for regex queries, basically reverting back this change since Opensearch does not support wildcard field type:

setAnalyzer(forceAnalyzer == null ? queryBuilder.context.getSearchAnalyzer(currentFieldType) : forceAnalyzer);

@nknize what do you think?

[1] apache/lucene-solr#1708

@nknize
Copy link
Collaborator

nknize commented Jun 27, 2022

On the OpenSearch side we should add case_insensitive support like we did for RegexpIntervalsSourceProvider.

In the meantime I think in this situation you'll want to use the PatternAnalyzer at query time. In the pattern analyzer the Lowercase TokenFilter is used to auto lowercase your query string so it matches the default lowercased tokens created at index time by the Standard analyzer. If you want to preserve case for exact matches you could use an index template (like you did above) to define a custom analyzer to use at index time that removes the LowerCase token filter. Then use PatternAnalyzer at query time w/ lowercase set to false. The affect would be nothing is lowercased, so query strings would have to be exact match at query time (sounds like this is not what you want).

@reta
Copy link
Collaborator

reta commented Jun 27, 2022

@nknize sorry, just to reiterate one thing, for regex queries (specifically query_string scenario), the per-field query time analyzer is ignored, it could be forced but for all fields.

@alexgnatyuk
Copy link

alexgnatyuk commented Jul 4, 2022

@reta @nknize we believe that reverting the change (aligning the behavior with the way it worked before the fix) should be the way to go in this case because:

  1. the initial fix was needed only for fields with the type of wildcard, which OpenSearch doesn't support. There is no real reason for this code to exist in the OpenSearch.
  2. the fix changes make the regex search case sensitive in a very weird way since we don't automatically apply type-specific analyzers at a search time for regex anymore, although we do it on indexing time by default.
  3. the fix introduces an inability to apply type-specific analyzers during the query string regex searches (even if set on an index or type mapping level), which reduces the opportunities to affect the query parsing and search flow in general.

please let me know what you think about it.

and we of course happy to contribute to it if that's the way to go here.

@AmiStrn FYI

@AmiStrn
Copy link
Contributor

AmiStrn commented Jul 5, 2022

@reta I agree with @alexgnatyuk here. this seems like a really easy fix if the code introduced in Elasticsearch 7.9 is reverted since it is an x-pack feature that is using it.
It is a bug, really, that has gone unnoticed for a while.
users should not have to be explicit about this for the regex's in their query since many times they will be using OpenSearch-dashboards and not have that level of knowledge that they need to also specify the analyzer type somehow (that would be really bad UX).

the discussion is about the proposed solution - what do you think? should we make this change as proposed?

@reta
Copy link
Collaborator

reta commented Jul 5, 2022

I am totally with you @AmiStrn @alexgnatyuk (#3578 (comment)), need @nknize confirmation this is away to go

@AmiStrn
Copy link
Contributor

AmiStrn commented Jul 10, 2022

@nknize can you please update here? green light for us to start working on these changes?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Indexing & Search
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants