Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fuzzy Matching V1 #2099

Merged
merged 4 commits into from
Sep 22, 2023
Merged

Fuzzy Matching V1 #2099

merged 4 commits into from
Sep 22, 2023

Conversation

mikiher
Copy link
Contributor

@mikiher mikiher commented Sep 14, 2023

Problem:
Audiobookshelf requires a pretty strict folder structure. However, users sometimes have many books in existing folders that adhere to different (or no) standards, and they might be reluctant to fix their directory structure. But then book titles and authors are incorrectly read, and consequently, matching usually return no/wrong results, which requires users to manually fix the title and author before matching.

The option to prefer audio metadata over folder names somewhat improves the situation, but does not fix it, and is also not enabled by default.

Proposal:
As a first step, I'd like to suggest a heuristic fuzzy matching, that kicks in if the initial title and author search returns no results (a rudimentary version of this already exists in the code, potentially sending one additional search request with a "clean" version of the title and author - it is subsumed in the new proposal):

  • If the initial search returns no results, we first further clean the title, and then heuristically split it into hyphen-separated parts.
  • We then create a Set of title candidates, and add each part to the set. We also try to generate additional title candidates by applying various heuristics on each part, and add those candidates to the set as well.
  • The resulting list of unique candidates is then heuristically sorted to minimize the number of additional search requests while still keeping the request as specific as possible.
  • Additional search requests are then sent until one returns results, or until maxFuzzySearches (the maximum number of allowed additional search requests) has been reached
  • If no results were found, search requests are also repeated without the author (again, until maxFuzzySearches has been reached)

This proposal is implemented here.
I've evaluated it on 50 books that have audible.com metadata from my unmodified audiobook torrents directory, which has no standard folder structure. The existing matching finds the correct result only for 24% of books. Fuzzy matching V1 finds the correct result for 96% of books, and finds the correct result @1 for 92%. I have not calculated the average number of additional search requests, but it looks like it is usually between 0-3.

@mikiher mikiher marked this pull request as ready for review September 14, 2023 23:10
@mikiher
Copy link
Contributor Author

mikiher commented Sep 15, 2023

If we want to make quick-match more conservative than manual match, we can set maxFuzzySearches to a lower value than the default one, by setting the appropriate option in the quick-match call to BookFinder.search().

The second commit demonstrates that.

@advplyr
Copy link
Owner

advplyr commented Sep 22, 2023

I ran some tests on this and it matched well. We'll see if it gets too many false positives and adjust from there. Thanks!

@advplyr advplyr merged commit a11fc21 into advplyr:master Sep 22, 2023
1 check passed
@mikiher mikiher mentioned this pull request Oct 5, 2023
@mikiher mikiher deleted the Fuzzy-Matching branch July 12, 2024 18:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants