Users fuzzy search enhancements #541

Petitoto · 2024-08-25T14:17:45Z

Description

This PR improves users fuzzy search results.

Searching algorithm is refactored in a more simple way:

for each user:
- attribute a score based on the query, the user attributes and a similarity algorithm
- insert the score and its corresponding user in a sorted list
- remove the lowest score of the list if its size is superior to the limit parameter (thus, always keeping the list sorted with the N=limit best results)

The score of each user correspond to the highest Jaro-Winkler similarity between the query and:

firstname
name
firstname + name
name + firstname
nickname (if exists)

This method aims to fit real searches: queries are often related to one of these 5 strings, but we don't know which one. The higher the similarity is between the query and one of them, the more likely the query is related to it. For well-constructed queries, false positives will always come after the good results.

Before running the Jaro-Winkler algorithm, all strings are "unaccentuated" to make the similarity algorithm insensitive to accents. Moreover, queries from the /users/search endpoint are "capworded". We assume that queries from this endpoint are often the beginning of a name / firstname / nickname. This way, queries like max will better match Maxou than Kmax, which may better corresponds to the enduser's search.

This method has proven to give better results on limited subsets of users (~30) and queries (~20), while keeping one of the highest performance. Other methods tested include:

previous sort_user() function
SequenceMatcher of the standard DiffLib library
Jaro-Winkler algorithm from RapidFuzz library (however, on longer strings, RapidFuzz pretend to be faster than Jellyfish)
Indel algorithm from RapidFuzz library
Damerau-Levenshtein algorithm from RapidFuzz library
partial_ratio() from RapidFuzz library
token_ratio() from RapidFuzz library
partial_token_ratio() from RapidFuzz library

On the tested data with a limit parameter of 10, the new function introduced by this PR takes 50% more time than the old one.
This is due to more similarities beeing computed (with 4 instead of 5, both take approximately the same time).
However, it still seems to be reasonable for production use, and is better than other tested algorithms, especially when list or number of users in the database increase.

Checklist

Created tests which fail without the change (if possible)
All tests passing
Extended the documentation, if necessary

codecov · 2024-08-25T14:22:23Z

Codecov Report

Attention: Patch coverage is 94.11765% with 1 line in your changes missing coverage. Please review.

Project coverage is 81.83%. Comparing base (91302a3) to head (57f861a).
Report is 10 commits behind head on main.

Files with missing lines	Patch %	Lines
app/utils/tools.py	93.33%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #541      +/-   ##
==========================================
+ Coverage   81.77%   81.83%   +0.06%     
==========================================
  Files         125      125              
  Lines        9485     9504      +19     
==========================================
+ Hits         7756     7778      +22     
+ Misses       1729     1726       -3

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

armanddidierjean

This is brillant!

Petitoto added 2 commits August 25, 2024 15:44

improve sort_user()

8f4d145

capitalize words from "/users/search" queries to improve results

a7c2a48

unaccent searches & use sorted list

57f861a

armanddidierjean approved these changes Sep 22, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Users fuzzy search enhancements #541

Users fuzzy search enhancements #541

Petitoto commented Aug 25, 2024 •

edited

Loading

codecov bot commented Aug 25, 2024 •

edited

Loading

armanddidierjean left a comment

Users fuzzy search enhancements #541

Are you sure you want to change the base?

Users fuzzy search enhancements #541

Conversation

Petitoto commented Aug 25, 2024 • edited Loading

Description

Checklist

codecov bot commented Aug 25, 2024 • edited Loading

Codecov Report

armanddidierjean left a comment

Choose a reason for hiding this comment

Petitoto commented Aug 25, 2024 •

edited

Loading

codecov bot commented Aug 25, 2024 •

edited

Loading