Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wildcard field - added case insensitive search option #53814

Closed
wants to merge 3 commits into from

Conversation

markharwood
Copy link
Contributor

@markharwood markharwood commented Mar 19, 2020

Generally in elasticsearch, the case sensitivity or not of searches is normally dictated by the choice of field a user searches. In keeping with this practice this change automatically creates a multi-field for wildcard fields so that a wildcard field called foo also has a foo._case_insensitive variant.

However, unlike other fields this comes at no extra storage cost. The .case_insensitive field type just uses the same data structures as the case sensitive one - it just relaxes the matching logic in the verification phase. The ngram index used for the approximation query is lower-cased and therefore may produce more false positives for case-sensitive searches but these are expected to be low in number (how many docs differ only by case?).

…multi field with “._case_insensitive” suffix is automatically made available at not extra storage cost which performs case insensitive searches
@markharwood markharwood added >enhancement :Search Foundations/Mapping Index mappings, including merging and defining field types v8.0.0 v7.7.0 labels Mar 19, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search (:Search/Mapping)

@markharwood
Copy link
Contributor Author

markharwood commented Mar 19, 2020

Some quick benchmarks based on an index where case-sensitive and case-insensitive searches matched equal numbers of docs.

query num hits Case sensitive (ms) Case insensitive (ms)
*shell_exec* 0 7 1
*shellinvoker* 0 3 1
*jexws2* 0 2 2
* 404 * 223539 270 728
* 200 * 1121055 1028 3388
* 200 4263 * 88052 123 280
*/administrator/* 640105 476 1476
*plugin* 4498 21 26
*upload* 6274 17 26
*login* 69986 102 262
*www.baidu.com* 4036 29 31
*www.hivemindmap.com* 0 4 2
*JsonGraphServlet* 0 3 1
80.40.134.103* 0 5 2
202.137.141.11* 0 5 1
94.197.* 5 3 1
*.pdf* 12 2 1
*.dll* 30 2 1
*.exe* 156 4 2
*cgi* 279 3 1
*/admin/* 688 10 10
*Mozilla/5.0 (Windows NT 5.1) AppleWebKit* 103752 120 328
*python-requests/1.2.3 CPython/2.7.5* 1733 8 8
*"Python-urllib/2.7"* 1722 8 6
* "-" 0 2 1

The new case-insensitive search option requires more work to normalize-on-the-fly the binary doc values read from disk. However this is an overhead that might be worth paying as it halves the disk space otherwise required to have case sensitive and insensitive search (something you might want if your logging system captures both Windows and Unix events)

@markharwood
Copy link
Contributor Author

Closing in favour of #53851

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>enhancement :Search Foundations/Mapping Index mappings, including merging and defining field types v7.7.0 v8.0.0-alpha1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants