-
Notifications
You must be signed in to change notification settings - Fork 440
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fuzzy matching continued #2186
Fuzzy matching continued #2186
Conversation
By the way, I can share my eval sets and hard cases privately, if you'd like to see them. |
@mikiher I tested this PR with my sample libraries that have random meta tags and filenames and it worked really well. It would be nice to have a better test set to test with in the future. |
I think the AuthorCandidates class can be useful for the AuthorFinder as well |
add(title, position = 0) { | ||
// if title contains the author, remove it | ||
if (this.cleanAuthor) { | ||
const authorRe = new RegExp(`(^| | by |)${this.cleanAuthor}(?= |$)`, "g") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For future reference if you are working on this, an edge case came up here with invalid regex. #2265
Fixed by adding a util function to escape the string.
Nice lesson in defensive coding, thanks!
I got sidetracked by this external docker-on-windows abs watcher project
(which I hope to release sometime this week), but I'm definitely planning
to go back to the matching code. This code requires some serious unit
testing, but I didn't get a clear response to my question on discord
regarding which unit testing framework was your favorite (jest, mocha,
cypress, something else?). I think you just need to make a decision and
we'll all stick with it, but I believe you should drive this.
…On Mon, Oct 30, 2023 at 11:37 PM advplyr ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In server/finders/BookFinder.js
<#2186 (comment)>
:
> - if (candidate)
- candidates.add(candidate)
+ static TitleCandidates = class {
+
+ constructor(bookFinder, cleanAuthor) {
+ this.bookFinder = bookFinder
+ this.candidates = new Set()
+ this.cleanAuthor = cleanAuthor
+ this.priorities = {}
+ this.positions = {}
+ }
+
+ add(title, position = 0) {
+ // if title contains the author, remove it
+ if (this.cleanAuthor) {
+ const authorRe = new RegExp(`(^| | by |)${this.cleanAuthor}(?= |$)`, "g")
For future reference if you are working on this, an edge case came up here
with invalid regex. #2265
<#2265>
Fixed by adding a util function to escape the string.
—
Reply to this email directly, view it on GitHub
<#2186 (review)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AFMDFVQAHRSZER3ZCHNCKALYCAM2DAVCNFSM6AAAAAA5UW3XEGVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMYTOMBVGEYTMMBRGA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
I don't have a preferred framework. I've used mocha and jest a little bit but not enough to have a preference. Do you have a preference? |
I'm quite new to Node.js and Javascript (my background is C++ and Python). |
This is a continuation of Fuzzy Matching V1.
This includes some cleanups and refactoring, a few improvements, and one major enhancement.
Cleanups, refactoring, and small fixes:
Enhancements & Improvements:
The code is now more robust, and handles various hard corner cases it didn't handle before.
It fixes one case in the previous eval set, and keeps a 98% found rate and 96% found@1 in a new 50 title/author pairs eval set.
In the new eval set, I also measured the average number of fuzzy searches - 1.18 (note that the eval sets are picked from an unstructured torrents folder, where the initial search with the original title and field almost always fails. 1.18 means that in most cases, only one fuzzy search request is needed).
The additional parallel author validation requests (usually between 2-4) to audnexus seem to run very quickly, and most of the network time seems to be spent in the search provider requests.