Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add fuzzy search #281

Closed
wants to merge 4 commits into from
Closed

Add fuzzy search #281

wants to merge 4 commits into from

Conversation

joaomsa
Copy link
Contributor

@joaomsa joaomsa commented Feb 23, 2015

Adds a sqlite extension with a ZRELEVANCY function that takes an item and query, and produces a similarity measure. it only allows matches where the query is a subsequence of the item. I'm getting great results in terms of relevancy and speed based on spread and density of the subsequence matched. like the suggestion on #100:

pmsg demo

Also involved tweaking the result tokenization/normalization code as i ran into a bunch of edge cases now that it's also applied to the query for consistency.

@trollixx
Copy link
Member

Hi,

I am really not sure about benefits of the fuzzy search at the moment. For me it feels really confusing, as I expect index search to do exact matching.

I have tried your PR on my Linux machine. Our Windows build fails as you can see above.

My first thought was, that search is completely broken. It takes 7-15 seconds to display any results. That's really long.

Another problem is highlighting. For example, if I search for string, I see among results: as_string and SVGString.

Perhaps, there's something more, but the current performance makes any further testing too slow.

Any ideas what causes such slow down?

@joaomsa
Copy link
Contributor Author

joaomsa commented Feb 23, 2015

@trollixx I'm getting queries in 1s or less with 7 docsets. Which ones are you using? I'm going to do some profiling to see how to improve it, I'm sure I could use some better data structures and C++11x move semantics to reduce allocations.

The approach using longest common subsequences isn't unique, and enumerating them all to find the one with highest relevancy isn't polynomial time. However doing a quick check to see if it can highlight the exact word wouldn't be a noticeable time penalty and take care of cases like SVGString

I played around with using a virtual table as a token trigram index (similar to something like pg_trgm + GIN indexing for postgres), and while search was really fast it didn't really give good relevancy for something as specific as code search. I don't think dealing with transpositions and deletions make sense when you're searching for method names.

@trollixx
Copy link
Member

Here's my test list of docsets:

1424714870

@joaomsa
Copy link
Contributor Author

joaomsa commented Feb 24, 2015

Took a slightly different approach using LIKE as prefilter and got a big speedup while removing the dependency on the sqlite headers. Also improved presentation of cases like SVGString.

What do you think?

@joaomsa joaomsa changed the title Add fuzzy search through sqlite extension Add fuzzy search Feb 24, 2015
@johntyree
Copy link

@joaomsa this looks pretty great. Needs master merge :(

Also what did you use to make your little gif screencast?

@joaomsa
Copy link
Contributor Author

joaomsa commented Mar 23, 2015

@johntyree
If there is interest in merging I can update it with all the changes in master to prepare to merge. The screencast I made with GNOME/byzanz

@trollixx
Copy link
Member

I like the idea, but I'd prefer to postpone merging for a little while. I am going to do a major refactoring after the next release (coming this or the next week), that would allow to extend Zeal to handle more different documentation formats. This change is kinda intrusive, so it would be better to merge, once the code related to Dash docsets is encapsulated in one place.

@johntyree
Copy link

@trollixx I'd say that's a reason to merge if you like the feature and are OK with the code, not postpone. You're going to force @joaomsa to figure out how your changes affect his patch and then rewrite it. Maybe he's OK with that, but it seems like much more work than refactoring it all together.

@trollixx
Copy link
Member

@johntyree I think that after the coming refactoring this patch could be reduced and simplified.

If @joaomsa will not feel like updating his code, no problem, I am fine with doing it.

Sorry for possible inconvenience, I am just trying to make one step a time as part of bringing Zeal quality to higher levels.

@joaomsa
Copy link
Contributor Author

joaomsa commented Mar 25, 2015

@trollixx No problem I'll wait for the refactoring.

Another thing I've also been playing around with is doing search in the background and updating the results list incrementally. Seems to really improve responsiveness in cases where you quickly want results for smaller docsets while big ones like Android may still not have returned.

I'll put that as another PR in the future for your consideration since this one is already pretty big and I haven't figured out a satisfying way to handle cursor jumping in the results list as more results pour in.

@trollixx
Copy link
Member

Sounds interesting. I have also started moving all database related stuff in per docset threads, but decided to hold it until I have a proper docset format abstraction.

@johntyree
Copy link

@trollixx sure. It's not my patch to maintain :)

As long as @joaomsa's hard work isn't lost any path that gets us there is a good path 👍

@joaomsa
Copy link
Contributor Author

joaomsa commented Dec 9, 2015

I've rebased this patchset against the latest master and improved it a bit more, is there still interest in this feature?

@johntyree
Copy link

I'm interested!

@agauniyal
Copy link

Me too!

@paweljw
Copy link

paweljw commented Jan 5, 2016

I can confirm that this version compiles and works perfectly. There are also no performance issues (as in, no noticeable slowdown in searching). This should definitely be considered for inclusion in Zeal.

@brunoro
Copy link

brunoro commented Jan 6, 2016

Looks great! Tried it out here and the performance was good, even though I was on a VM.

@joaomsa
Copy link
Contributor Author

joaomsa commented Jan 15, 2016

Once #460 lands, I'll rebase this as a DocsetSearchStrategy.

@trollixx trollixx mentioned this pull request Jan 29, 2016
@mrhota
Copy link

mrhota commented Jul 14, 2016

@joaomsa I have rebased #460 in another PR: #559. FYI. Hopefully that one will be pulled and this can go in, too!

@agauniyal
Copy link

what's the status on this?

jkozera added a commit to jkozera/zeal that referenced this pull request Oct 7, 2016
Uses an O(m+n) algorithm based on https://github.com/bevacqua/fuzzysearch
- should be faster than the one initially proposed in PR zealdocs#281.
jkozera added a commit to jkozera/zeal that referenced this pull request Oct 7, 2016
Uses an O(m+n) algorithm based on https://github.com/bevacqua/fuzzysearch
- should be faster than the one initially proposed in PR zealdocs#281.
jkozera added a commit to jkozera/zeal that referenced this pull request Oct 7, 2016
Uses an O(m+n) algorithm based on https://github.com/bevacqua/fuzzysearch
- should be faster than the one initially proposed in PR zealdocs#281.
jkozera added a commit to jkozera/zeal that referenced this pull request Oct 7, 2016
Uses an O(m+n) algorithm based on https://github.com/bevacqua/fuzzysearch
- should be faster than the one initially proposed in PR zealdocs#281.
jkozera added a commit to jkozera/zeal that referenced this pull request Oct 7, 2016
Uses an O(m+n) algorithm based on https://github.com/bevacqua/fuzzysearch
- should be faster than the one initially proposed in PR zealdocs#281.
jkozera added a commit to jkozera/zeal that referenced this pull request Oct 7, 2016
Uses an O(m+n) algorithm based on https://github.com/bevacqua/fuzzysearch
- should be faster than the one initially proposed in PR zealdocs#281.
jkozera added a commit to jkozera/zeal that referenced this pull request Oct 7, 2016
Uses an O(m+n) algorithm based on https://github.com/bevacqua/fuzzysearch
- should be faster than the one initially proposed in PR zealdocs#281.
jkozera added a commit to jkozera/zeal that referenced this pull request Oct 7, 2016
Uses an O(m+n) algorithm based on https://github.com/bevacqua/fuzzysearch
- should be faster than the one initially proposed in PR zealdocs#281.
jkozera added a commit to jkozera/zeal that referenced this pull request Oct 7, 2016
Uses an O(m+n) algorithm based on https://github.com/bevacqua/fuzzysearch
- should be faster than the one initially proposed in PR zealdocs#281.
jkozera added a commit that referenced this pull request Oct 8, 2016
Uses an O(m+n) algorithm based on https://github.com/bevacqua/fuzzysearch
- should be faster than the one initially proposed in PR #281.
jkozera added a commit to jkozera/zeal that referenced this pull request Oct 8, 2016
Uses an O(m+n) algorithm based on https://github.com/bevacqua/fuzzysearch
- should be faster than the one initially proposed in PR zealdocs#281.
jkozera added a commit to jkozera/zeal that referenced this pull request Oct 8, 2016
Uses an O(m+n) algorithm based on https://github.com/bevacqua/fuzzysearch
- should be faster than the one initially proposed in PR zealdocs#281.
jkozera added a commit to jkozera/zeal that referenced this pull request Oct 8, 2016
Uses an O(m+n) algorithm based on https://github.com/bevacqua/fuzzysearch
- should be faster than the one initially proposed in PR zealdocs#281.
jkozera added a commit to jkozera/zeal that referenced this pull request Oct 8, 2016
Uses an O(m+n) algorithm based on https://github.com/bevacqua/fuzzysearch
- should be faster than the one initially proposed in PR zealdocs#281.
jkozera added a commit to jkozera/zeal that referenced this pull request Oct 8, 2016
Uses an O(m+n) algorithm based on https://github.com/bevacqua/fuzzysearch
- should be faster than the one initially proposed in PR zealdocs#281.
jkozera added a commit to jkozera/zeal that referenced this pull request Oct 9, 2016
Uses an O(m+n) algorithm based on https://github.com/bevacqua/fuzzysearch
- should be faster than the one initially proposed in PR zealdocs#281.
jkozera added a commit to jkozera/zeal that referenced this pull request Oct 9, 2016
Uses an O(m+n) algorithm based on https://github.com/bevacqua/fuzzysearch
- should be faster than the one initially proposed in PR zealdocs#281.
jkozera added a commit to jkozera/zeal that referenced this pull request Oct 9, 2016
Uses an O(m+n) algorithm based on https://github.com/bevacqua/fuzzysearch
- should be faster than the one initially proposed in PR zealdocs#281.
jkozera added a commit to jkozera/zeal that referenced this pull request Oct 9, 2016
Uses an O(m+n) algorithm based on https://github.com/bevacqua/fuzzysearch
- should be faster than the one initially proposed in PR zealdocs#281.
jkozera added a commit to jkozera/zeal that referenced this pull request Oct 9, 2016
Uses an O(m+n) algorithm based on https://github.com/bevacqua/fuzzysearch
- should be faster than the one initially proposed in PR zealdocs#281.
jkozera added a commit to jkozera/zeal that referenced this pull request Oct 9, 2016
Uses an O(m+n) algorithm based on https://github.com/bevacqua/fuzzysearch
- should be faster than the one initially proposed in PR zealdocs#281.
jkozera added a commit to jkozera/zeal that referenced this pull request Oct 9, 2016
Uses an O(m+n) algorithm based on https://github.com/bevacqua/fuzzysearch
- should be faster than the one initially proposed in PR zealdocs#281.
jkozera added a commit to jkozera/zeal that referenced this pull request Oct 21, 2016
Uses an O(m+n) algorithm based on https://github.com/bevacqua/fuzzysearch
- should be faster than the one initially proposed in PR zealdocs#281.
jkozera added a commit to jkozera/zeal that referenced this pull request Oct 21, 2016
Uses an O(m+n) algorithm based on https://github.com/bevacqua/fuzzysearch
- should be faster than the one initially proposed in PR zealdocs#281.
jkozera added a commit to jkozera/zeal that referenced this pull request Oct 27, 2016
Uses an O(m+n) algorithm based on https://github.com/bevacqua/fuzzysearch
- should be faster than the one initially proposed in PR zealdocs#281.
@trollixx
Copy link
Member

Closing in favour of #614. Regardless thanks for the contribution!

@trollixx trollixx closed this Oct 29, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

8 participants