Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Escaped wildcard character in wildcard query not handled correctly #15555

Closed
HUSTERGS opened this issue Sep 1, 2024 · 2 comments · Fixed by #15737
Closed

[BUG] Escaped wildcard character in wildcard query not handled correctly #15555

HUSTERGS opened this issue Sep 1, 2024 · 2 comments · Fixed by #15737
Assignees
Labels
bug Something isn't working Search Search query, autocomplete ...etc

Comments

@HUSTERGS
Copy link
Contributor

HUSTERGS commented Sep 1, 2024

Describe the bug

When use wildcard query on wildcard field, raw * (and maybe include raw ?), is not handled correctly, while the same wildcard query can work well on keyword field. Possibly the escape logic is not implemented in opensearch/index/mapper/WildcardFieldMapper.java.

Related component

Search

To Reproduce

  1. Create a simple index containing both wildcard and keyword type
    PUT escape_index
    {
      "mappings": {
        "properties": {
          "wild": {
            "type": "wildcard",
            "fields": {
              "keyword": {
                "type": "keyword"
              }
            }
          }
        }
      }
    }
    
  2. Insert data with raw *
    POST escape_index/_doc
    {
      "wild": "* test *"
    }
    
  3. search on both field with the same wildcard query
    Search on KEYWORD type:
    GET escape_index/_search
    {
      "query": {
        "wildcard": {
          "wild.keyword": {
            "value": "\\**"
          }
        }
      }
    }
    
    Result:
    {
      "took": 2,
      "timed_out": false,
      "_shards": {
        "total": 2,
        "successful": 2,
        "skipped": 0,
        "failed": 0
      },
      "hits": {
        "total": {
          "value": 1,
          "relation": "eq"
        },
        "max_score": 1,
        "hits": [
          {
            "_index": "escape_index",
            "_id": "xMi6opEB80UGmJnvgJSB",
            "_score": 1,
            "_source": {
              "wild": "* test *"
            }
          }
        ]
      }
    }
    
    Search on WILDCARD type
    GET escape_index/_search
    {
      "query": {
        "wildcard": {
          "wild": {
            "value": "\\**"
          }
        }
      }
    }
    
    Result:
    {
      "took": 1,
      "timed_out": false,
      "_shards": {
        "total": 2,
        "successful": 2,
        "skipped": 0,
        "failed": 0
      },
      "hits": {
        "total": {
          "value": 0,
          "relation": "eq"
        },
        "max_score": null,
        "hits": []
      }
    }
    

Expected behavior

get the correct results when wildcard query contains raw * and ? characters

Additional Details

No response

@HUSTERGS HUSTERGS added bug Something isn't working untriaged labels Sep 1, 2024
@github-actions github-actions bot added the Search Search query, autocomplete ...etc label Sep 1, 2024
@mch2 mch2 removed the untriaged label Sep 4, 2024
@msfroh
Copy link
Collaborator

msfroh commented Sep 4, 2024

That sounds like bug!

In particular, the methods findNonWildcardSequence and getNonWildcardSequence just scan through characters, splitting on ? or *. They don't check if those characters are escaped.

Also, I suspect that the logic in else case in

if (value.contains("?")) {
Automaton automaton = WildcardQuery.toAutomaton(new Term(name(), finalValue));
CompiledAutomaton compiledAutomaton = new CompiledAutomaton(automaton);
matchPredicate = s -> {
if (caseInsensitive) {
s = s.toLowerCase(Locale.ROOT);
}
BytesRef valueBytes = BytesRefs.toBytesRef(s);
return compiledAutomaton.runAutomaton.run(valueBytes.bytes, valueBytes.offset, valueBytes.length);
};
} else {
matchPredicate = s -> {
if (caseInsensitive) {
s = s.toLowerCase(Locale.ROOT);
}
return Regex.simpleMatch(finalValue, s);
};
}
, which uses Regex.simpleMatch might not handle escapes properly either. Maybe we just go with the WildcardQuery.toAutomaton approach always, since it does handle escapes properly.

@HUSTERGS, are you willing to fix those methods? If not, I can take care of it (but probably not in time for the 2.17 release, which we expect to freeze today/tomorrow).

@HUSTERGS
Copy link
Contributor Author

HUSTERGS commented Sep 5, 2024

OF COURSE! @msfroh
bugfix is here #15737. And the original implementation cannot handle empty wildcard query strings, which will cause String index out of range on

terms.add(new String(new char[] { 0, currentSequence.charAt(0), currentSequence.charAt(1) }));

So I add a simple shortcut at the beginning of getRequiredNGrams to fix it all together

@github-project-automation github-project-automation bot moved this from 🆕 New to ✅ Done in Search Project Board Sep 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Search Search query, autocomplete ...etc
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

3 participants