Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Xapian-based catalog search #460

Merged
merged 10 commits into from
Mar 17, 2021
Merged

Xapian-based catalog search #460

merged 10 commits into from
Mar 17, 2021

Conversation

veloman-yunkan
Copy link
Collaborator

@veloman-yunkan veloman-yunkan commented Mar 7, 2021

Fixes #106

Depends on kiwix/kiwix-build#477. Must be merged after #459.

Changes in functionality:

  • Previously partial matches were implicitly supported. Currently this must be achieved by the usage of an explicit wildcard (as demonstrated by the change in the test/library.cpp unit test).

Notes:

  • No stemming is performed on the query, since the language assumed for the query is not known for sure. Stemming can be applied if the lang query parameter is used.
  • Translation of ISO 639-3 language codes to languages supported by Xapian is done only for a subset of languages. This is trivial to fix (though conceptually it's less trivial than it may seem, due to such thing as macrolanguages).

@veloman-yunkan
Copy link
Collaborator Author

CI fails because of the said dependence on kiwix/kiwix-build#477

@codecov
Copy link

codecov bot commented Mar 7, 2021

Codecov Report

Merging #460 (20b487d) into master (f7c867f) will increase coverage by 0.35%.
The diff coverage is 91.30%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #460      +/-   ##
==========================================
+ Coverage   62.65%   63.00%   +0.35%     
==========================================
  Files          50       50              
  Lines        3433     3490      +57     
  Branches     1736     1773      +37     
==========================================
+ Hits         2151     2199      +48     
- Misses       1280     1289       +9     
  Partials        2        2              
Impacted Files Coverage Δ
src/library.cpp 67.49% <91.04%> (+3.31%) ⬆️
include/library.h 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f7c867f...20b487d. Read the comment docs.

Copy link
Member

@mgautierfr mgautierfr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not using xapian to do the full query (size, language, creator, name, local, remote, ...) ?

As kiwix-lib now use xapian we must also update meson.

On libzim we use icu (https://github.com/openzim/libzim/blob/master/src/writer/xapianIndexer.cpp#L39-L40) to translate between different ISO language code.

src/library.cpp Outdated
const auto flags = Xapian::QueryParser::FLAG_PHRASE
| Xapian::QueryParser::FLAG_BOOLEAN
| Xapian::QueryParser::FLAG_LOVEHATE
| Xapian::QueryParser::FLAG_WILDCARD;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not using FLAG_PARTIAL ?
It would allow the user to not enter the end wildcard.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It only makes sense if the search endpoint is going to be used for suggestions. @kelson42 What's your opinion on that?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The end point is for now not used for completion/suggestions but as a "normal" search. Not sure if we have a lot to loose to add FLAG_PARTIAL but the default choice seems for me, for the moment, to keep it aside. @mgautierfr Maybe you have arguments to use it here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is used in kiwix-desktop to search in the library (local or remote). Search updated "in live" (no need to validate the search).
It seems legit that if the user search for "wiki" he gets result for all ("wikipedia", "wikispecies", ...)
If we don't, he will have no result until he enters a full word.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am reluctant to use the partial search mode in the backend since there is no indication of where the editing/cursor location in the text is. Though typically text is typed left to right (at least in the languages I am concerned with), it is also possible that the cursor is moved to another location and the text is edited in the middle. It is the front end (which is responsible for the editing of the text search string) that can add a hint about the cursor location in the form of a wildcard.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am reluctant to use the partial search mode in the backend
It is the front end [...] than can add a hint [...] in form of a wildcard

I globally agree with that.

However, as the documentation of FLAG_PARTIAL said, the wildcard should be added only in some situation (not a phrase, no a binary search, not ...). It could be pretty difficult to detect in the frontend (and we don't want to have all frontend implement it).

Maybe we could add a "boolean flag" in the api telling if we want a partial search or not and let xapian add it (if it is needed).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we could add a "boolean flag" in the api telling if we want a partial search or not and let xapian add it (if it is needed).

Done this way.

@veloman-yunkan
Copy link
Collaborator Author

Why not using xapian to do the full query (size, language, creator, name, local, remote, ...) ?

That was my intention too, but I chose to stick to what #106 asks for. I would like to keep this PR small and enhance the Xapian based catalog search in a follow-up PR. We will have to define how to handle tags.

@kelson42
Copy link
Collaborator

kelson42 commented Mar 9, 2021

Why not using xapian to do the full query (size, language, creator, name, local, remote, ...) ?

That was my intention too, but I chose to stick to what #106 asks for. I would like to keep this PR small and enhance the Xapian based catalog search in a follow-up PR. We will have to define how to handle tags.

Seems legit

@kelson42 kelson42 closed this Mar 9, 2021
@kelson42
Copy link
Collaborator

kelson42 commented Mar 9, 2021

As kiwix-lib now use xapian we must also update meson.

I believe this is rollbacking a commit made a few months ago ?!

@mgautierfr mgautierfr reopened this Mar 9, 2021
@veloman-yunkan
Copy link
Collaborator Author

Reminding that this PR is based on #459. It would be better to review and merge that one first.

@veloman-yunkan
Copy link
Collaborator Author

As kiwix-lib now use xapian we must also update meson.

Can't we rely on the required indirect xapian dependency brought by libzim?

https://github.com/kiwix/kiwix-lib/blob/f7608c378ee2d1250ce225c0aaefe2452299d481/meson.build#L46-L49

@mgautierfr
Copy link
Member

Can't we rely on the required indirect xapian dependency brought by libzim?

No.
It works in our CI as we install the "dev" part of all ours dependencies (including xapian) in our dependency archive
But in another build system (rpm/dev packages), if we don't declare kiwix-lib depending of xapian, only the libzim-dev and xapian (among other packages) will be present. We will not have xapian-dev installed (and so, no header nor .so) (And if we build statically, xapian will be a private dependency of libzim. So we will not have the xapian static archive to link with).

Copy link
Member

@mgautierfr mgautierfr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a small (style) remark on the new commits. I will not block for that.

I still "Request Changes" as we need to update our dependencies to add xapian.

src/library.cpp Show resolved Hide resolved
meson.build Show resolved Hide resolved
@veloman-yunkan veloman-yunkan force-pushed the opds_category_support branch 2 times, most recently from 39842b2 to b7b0bdb Compare March 17, 2021 10:20
Base automatically changed from opds_category_support to master March 17, 2021 11:37
@kelson42
Copy link
Collaborator

@mgautierfr @veloman-yunkan Probably time now to rebase ad complete the review process of this PR.

@veloman-yunkan
Copy link
Collaborator Author

@mgautierfr @veloman-yunkan Probably time now to rebase ad complete the review process of this PR.

I am done with my part

1. Get the subset of books matching the q (title/description) parameter
   of the search

2. Filter out books not matching the other parameters of the search.

Stage 1. currently works in the old way, but will be replaced by Xapian
based search in subsequent commits.
The search text in the catalog query is interpreted as partial by
default, but partial query mode can be disabled in C++. The latter
possibility is not exposed via the /catalog/search kiwix-serve endpoint,
though.
Language code is converted from ISO 639-3 to ISO 639 (which is
understood by Xapian) via ICU. The previous approach via an explicit
map had its advantages since Xapian has more than one stemmer
implementations for some languages (selectable via Xapian-specific
identifiers). This commit relies on the defaults associated with the
ISO 639 language codes.
@mgautierfr mgautierfr merged commit baed447 into master Mar 17, 2021
@mgautierfr mgautierfr deleted the xapian_based_catalog_search branch March 17, 2021 13:45
@veloman-yunkan
Copy link
Collaborator Author

Didn't kiwix/kiwix-build#479 have to be merged before this one?

@kelson42
Copy link
Collaborator

@veloman-yunkan I have merged it

veloman-yunkan added a commit that referenced this pull request Apr 11, 2021
This should have been done back in PR #460
veloman-yunkan added a commit that referenced this pull request Apr 11, 2021
This should have been done back in PR #460
veloman-yunkan added a commit that referenced this pull request Apr 13, 2021
This should have been done back in PR #460
veloman-yunkan added a commit that referenced this pull request Apr 16, 2021
This should have been done back in PR #460
veloman-yunkan added a commit that referenced this pull request Apr 27, 2021
This should have been done back in PR #460
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[library] Use Xapian for library (fulltext) search in title/description
3 participants