Add fuzzy search #281

joaomsa · 2015-02-23T15:22:53Z

Adds a sqlite extension with a ZRELEVANCY function that takes an item and query, and produces a similarity measure. it only allows matches where the query is a subsequence of the item. I'm getting great results in terms of relevancy and speed based on spread and density of the subsequence matched. like the suggestion on #100:

Also involved tweaking the result tokenization/normalization code as i ran into a bunch of edge cases now that it's also applied to the query for consistency.

trollixx · 2015-02-23T16:16:19Z

Hi,

I am really not sure about benefits of the fuzzy search at the moment. For me it feels really confusing, as I expect index search to do exact matching.

I have tried your PR on my Linux machine. Our Windows build fails as you can see above.

My first thought was, that search is completely broken. It takes 7-15 seconds to display any results. That's really long.

Another problem is highlighting. For example, if I search for string, I see among results: as_string and SVGString.

Perhaps, there's something more, but the current performance makes any further testing too slow.

Any ideas what causes such slow down?

joaomsa · 2015-02-23T17:50:58Z

@trollixx I'm getting queries in 1s or less with 7 docsets. Which ones are you using? I'm going to do some profiling to see how to improve it, I'm sure I could use some better data structures and C++11x move semantics to reduce allocations.

The approach using longest common subsequences isn't unique, and enumerating them all to find the one with highest relevancy isn't polynomial time. However doing a quick check to see if it can highlight the exact word wouldn't be a noticeable time penalty and take care of cases like SVGString

I played around with using a virtual table as a token trigram index (similar to something like pg_trgm + GIN indexing for postgres), and while search was really fast it didn't really give good relevancy for something as specific as code search. I don't think dealing with transpositions and deletions make sense when you're searching for method names.

trollixx · 2015-02-23T18:09:59Z

Here's my test list of docsets:

joaomsa · 2015-02-24T13:32:43Z

Took a slightly different approach using LIKE as prefilter and got a big speedup while removing the dependency on the sqlite headers. Also improved presentation of cases like SVGString.

What do you think?

johntyree · 2015-03-23T19:41:34Z

@joaomsa this looks pretty great. Needs master merge :(

Also what did you use to make your little gif screencast?

joaomsa · 2015-03-23T23:39:17Z

@johntyree
If there is interest in merging I can update it with all the changes in master to prepare to merge. The screencast I made with GNOME/byzanz

trollixx · 2015-03-24T01:19:13Z

I like the idea, but I'd prefer to postpone merging for a little while. I am going to do a major refactoring after the next release (coming this or the next week), that would allow to extend Zeal to handle more different documentation formats. This change is kinda intrusive, so it would be better to merge, once the code related to Dash docsets is encapsulated in one place.

johntyree · 2015-03-25T14:59:25Z

@trollixx I'd say that's a reason to merge if you like the feature and are OK with the code, not postpone. You're going to force @joaomsa to figure out how your changes affect his patch and then rewrite it. Maybe he's OK with that, but it seems like much more work than refactoring it all together.

trollixx · 2015-03-25T21:41:47Z

@johntyree I think that after the coming refactoring this patch could be reduced and simplified.

If @joaomsa will not feel like updating his code, no problem, I am fine with doing it.

Sorry for possible inconvenience, I am just trying to make one step a time as part of bringing Zeal quality to higher levels.

joaomsa · 2015-03-25T22:37:26Z

@trollixx No problem I'll wait for the refactoring.

Another thing I've also been playing around with is doing search in the background and updating the results list incrementally. Seems to really improve responsiveness in cases where you quickly want results for smaller docsets while big ones like Android may still not have returned.

I'll put that as another PR in the future for your consideration since this one is already pretty big and I haven't figured out a satisfying way to handle cursor jumping in the results list as more results pour in.

trollixx · 2015-03-25T23:13:53Z

Sounds interesting. I have also started moving all database related stuff in per docset threads, but decided to hold it until I have a proper docset format abstraction.

johntyree · 2015-03-27T19:40:58Z

@trollixx sure. It's not my patch to maintain :)

As long as @joaomsa's hard work isn't lost any path that gets us there is a good path 👍

joaomsa · 2015-12-09T02:31:49Z

I've rebased this patchset against the latest master and improved it a bit more, is there still interest in this feature?

johntyree · 2015-12-09T06:24:28Z

I'm interested!

agauniyal · 2016-01-03T09:31:32Z

Me too!

paweljw · 2016-01-05T15:14:32Z

I can confirm that this version compiles and works perfectly. There are also no performance issues (as in, no noticeable slowdown in searching). This should definitely be considered for inclusion in Zeal.

brunoro · 2016-01-06T13:09:43Z

Looks great! Tried it out here and the performance was good, even though I was on a VM.

joaomsa · 2016-01-15T15:41:06Z

Once #460 lands, I'll rebase this as a DocsetSearchStrategy.

mrhota · 2016-07-14T03:07:44Z

@joaomsa I have rebased #460 in another PR: #559. FYI. Hopefully that one will be pulled and this can go in, too!

agauniyal · 2016-09-10T13:28:29Z

what's the status on this?

Uses an O(m+n) algorithm based on https://github.com/bevacqua/fuzzysearch - should be faster than the one initially proposed in PR zealdocs#281.

Uses an O(m+n) algorithm based on https://github.com/bevacqua/fuzzysearch - should be faster than the one initially proposed in PR #281.

Uses an O(m+n) algorithm based on https://github.com/bevacqua/fuzzysearch - should be faster than the one initially proposed in PR zealdocs#281.

trollixx · 2016-10-29T03:40:57Z

Closing in favour of #614. Regardless thanks for the contribution!

joaomsa changed the title ~~Add fuzzy search through sqlite extension~~ Add fuzzy search Feb 24, 2015

joaomsa added 3 commits December 7, 2015 01:00

Add fuzzy search through sqlite extension

29a0ca9

Speed up fuzzy search with prefilter

fc0dcef

Add FullMatch as matchType "strmid" -> "qString::mid"

e3c255a

joaomsa force-pushed the add-fuzzy-search branch from d663db0 to e3c255a Compare December 9, 2015 02:27

registry: Fixed ZDASH anchor results

7e5b34c

trollixx mentioned this pull request Jan 29, 2016

Fuzzy Finder #483

Closed

jkozera mentioned this pull request Oct 7, 2016

Better sorting of search results (fixes #100 and #603) #614

Merged

jkozera added a commit that referenced this pull request Oct 8, 2016

Better sorting of search results (#613, #100)

df47804

Uses an O(m+n) algorithm based on https://github.com/bevacqua/fuzzysearch - should be faster than the one initially proposed in PR #281.

trollixx closed this Oct 29, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add fuzzy search #281

Add fuzzy search #281

joaomsa commented Feb 23, 2015

trollixx commented Feb 23, 2015

joaomsa commented Feb 23, 2015

trollixx commented Feb 23, 2015

joaomsa commented Feb 24, 2015

johntyree commented Mar 23, 2015

joaomsa commented Mar 23, 2015

trollixx commented Mar 24, 2015

johntyree commented Mar 25, 2015

trollixx commented Mar 25, 2015

joaomsa commented Mar 25, 2015

trollixx commented Mar 25, 2015

johntyree commented Mar 27, 2015

joaomsa commented Dec 9, 2015

johntyree commented Dec 9, 2015

agauniyal commented Jan 3, 2016

paweljw commented Jan 5, 2016

brunoro commented Jan 6, 2016

joaomsa commented Jan 15, 2016

mrhota commented Jul 14, 2016

agauniyal commented Sep 10, 2016

trollixx commented Oct 29, 2016

Add fuzzy search #281

Add fuzzy search #281

Conversation

joaomsa commented Feb 23, 2015

trollixx commented Feb 23, 2015

joaomsa commented Feb 23, 2015

trollixx commented Feb 23, 2015

joaomsa commented Feb 24, 2015

johntyree commented Mar 23, 2015

joaomsa commented Mar 23, 2015

trollixx commented Mar 24, 2015

johntyree commented Mar 25, 2015

trollixx commented Mar 25, 2015

joaomsa commented Mar 25, 2015

trollixx commented Mar 25, 2015

johntyree commented Mar 27, 2015

joaomsa commented Dec 9, 2015

johntyree commented Dec 9, 2015

agauniyal commented Jan 3, 2016

paweljw commented Jan 5, 2016

brunoro commented Jan 6, 2016

joaomsa commented Jan 15, 2016

mrhota commented Jul 14, 2016

agauniyal commented Sep 10, 2016

trollixx commented Oct 29, 2016