Introduce Search/Searcher Caching to Internal Server #620

maneeshpm · 2021-10-12T10:38:25Z

Fixes #509
This PR consists of three commits which are as follows:

Add a general-purpose LRU cache template

This general-purpose LRU cache template can be used to implement caching for various classes with various cache size limits.
NOTE: One must think about thread safety and race condition while using the template.

Implement caching on Searcher and Search

We use the new cache template to implement two kind of cache.
1: The Searcher cache is more general in terms of its usage. A Searcher can be used for multiple searches without much change to itself. We try to retrieve the Searcher and perform searches using it whenever possible, and if not we put a Searcher into the cache. Users can specify a custom cache length by manipulating the environment variable SEARCHER_CACHE_SIZE. Its default value is 10% of all the books available.
2: The Search cache is much more restricted in terms of usage. Its main purpose is to avoid re-searching on the Searcher during page changes to generate SearchResultSet of various ranges. Users can specify a custom cache length using the environment variable SEARCH_CACHE_SIZE with a default value of 2.

Implement caching on SuggestionSearcher

We create a cache for SuggestionSearcher very similar to that of FT searcher. User can specify a custom cache size using the environment variable SUGGESTION_SEARCHER_CACHE_SIZE. It has a default value of 10% of the number of books in the library.

codecov · 2021-10-12T10:39:49Z

Codecov Report

Merging #620 (e8cdcf0) into master (833bbc8) will increase coverage by 8.00%.
The diff coverage is 77.27%.

❗ Current head e8cdcf0 differs from pull request most recent head 6523d9f. Consider uploading reports for the commit 6523d9f to get more accurate results

@@            Coverage Diff             @@
##           master     #620      +/-   ##
==========================================
+ Coverage   58.03%   66.04%   +8.00%     
==========================================
  Files          54       55       +1     
  Lines        3584     4102     +518     
  Branches     2019     2088      +69     
==========================================
+ Hits         2080     2709     +629     
+ Misses       1503     1392     -111     
  Partials        1        1

Impacted Files	Coverage Δ
src/server/internalServer.h	`33.33% <ø> (ø)`
src/tools/cache.hpp	`71.69% <71.69%> (ø)`
src/server/internalServer.cpp	`82.54% <85.71%> (+1.48%)`	⬆️
src/search_renderer.cpp	`68.75% <0.00%> (-20.59%)`	⬇️
src/library.cpp	`78.26% <0.00%> (-4.53%)`	⬇️
src/server/response.cpp	`85.77% <0.00%> (-0.52%)`	⬇️
src/aria2.cpp	`0.00% <0.00%> (ø)`
include/book.h	`96.15% <0.00%> (ø)`
src/version.cpp	`0.00% <0.00%> (ø)`
include/server.h	`100.00% <0.00%> (ø)`
... and 28 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 833bbc8...6523d9f. Read the comment docs.

mgautierfr

Few thing to change but the global structure is good.

src/tools/cache.cpp

src/server/internalServer.cpp

src/meson.build

src/server/internalServer.cpp

This change though seemingly insignificant till now, proved to be an important aspect in kiwix/libkiwix#620. When the searcher is retrieved from cache, it should start up in OP_AND instead of OP_OR.

mgautierfr

There are still few changes to do but we are mostly good.

src/server/internalServer.cpp

mgautierfr

See my comments on the code.

But I've one comment coming from you, in the first commit message :

NOTE: One must think about thread safety and race condition while using
the template.

src/server/internalServer.cpp

mgautierfr · 2021-12-01T16:27:17Z

I don't understand your comment about my comment on thread safety?

In the commit where you introduce the cache system, you write :

NOTE: One must think about thread safety and race condition while usingthe template.

I agree with you.
But then, in other commits, you don't protect the calls against race condition.
The server handles the requests in different threads, and so you must "think about thread safety and race condition" (and protect/modify the code accordingly)

kelson42 · 2021-12-07T03:11:04Z

@maneeshpm Any news on this PR? It is one of the last one before releasing 10.0.0. This PR would also benefit of a proper automated test to secure the cache does work as intended and does not introduce regressions.

maneeshpm · 2021-12-07T13:02:00Z

@mgautierfr aah ohk, so we are protecting suggestions module via mutex lock in libzim. You mean we need to do a similar treatment for FT search as well?

@kelson42 Sure, will do something about it.

mgautierfr · 2021-12-07T13:53:37Z

You mean we need to do a similar treatment for FT search as well?

I was thinking about protecting the cache system (you explicitly say that in your commit message) but yes, we also need to protect the FT search.

mgautierfr · 2021-12-15T09:42:36Z

I wonder if we should move the searcher/suggestionSearcher cache in the library itself (as for the readers/archives)
It would simplify a bit the code with a Library method to get the searcher from a bookId (or even better, a book name if we also move the namemapper into the library (not sure we need to do this in this pr)).

The mutex protection you've added is not enough. It is technically protect the internal structures but a race condition is still possible :

Two threads can try to get a searcher for book foo.
The two threads cannot found it.
They both create a searcher.
They both add the searcher to the cache.
Only one is added to the cache, but we use two different searcher. And when we create a search and put it in the cache, they will use two different searchers.

What we also need is to protect (block) the cache while we are creating the searcher to avoid the creation of two searcher.
It is not easy to do (if we want to do it efficiently). But we already have a implementation in libzim (https://github.com/openzim/libzim/blob/master/src/concurrent_cache.h) we may reuse it.

maneeshpm · 2021-12-15T16:35:37Z

I wonder if we should move the searcher/suggestionSearcher cache in the library itself

@mgautierfr I agree, we are anyway pulling in the archive from the library, so it is more natural to get the searchers(and search via cache or otherwise) from the library itself rather than keeping it associated with the internal server. I propose we do it in a separate ticket after this.

Thanks for the suggestion, I'll get to see the proper implementation of an LRU cache rather than a simple one 😅 : I guess concurrent_cache.h in libzim is well written and tested!

maneeshpm · 2022-01-08T11:10:37Z

@kelson42 I have traced the error to the newly added function getCacheLength() that is supposed to read an environment variable and return its value(if found) or return a default value provided by the user. Returning a hardcoded number in the function such as 1 causes the error to go away. I am not really sure why that function fails on mac.

kelson42 · 2022-01-08T11:12:08Z

I wonder why in a first place we deal with an ENV variable here? IMO, either there is a given one by the user or we use a default value. I don't like the idea that the ENV plays a role here.

kelson42 · 2022-01-08T11:16:40Z

@maneeshpm But we have this kind of behaviour already in libzim and it works fine with macOS, maybe you could have a look to how this is done there?

maneeshpm · 2022-01-09T12:51:11Z

Apparently, this problem was caused by a simple nullptr returned by getenv(). Handling that case separately rather than relying on the try catch block fixes the problem.

I am not sure of a technical explanation for why this is not a problem on linux but on mac, maybe @mgautierfr can help with one 😅

kelson42 · 2022-01-09T13:01:27Z

@mgautierfr I see this ticket is ready for final review pass!

mgautierfr

Sorry for this late review @maneeshpm

It seems so have copied the lru_cache and ConcurrentCache from libzim.
There is no problem with this, but please change the commit message accordingly.
(And put a commit id of the version of the code you copy from. It will be simpler to know what will be the change in the future)

Else it seems we are good.

I am not sure of a technical explanation for why this is not a problem on linux but on mac, maybe @mgautierfr can help with one sweat_smile

The doc of std::string constructor (https://www.cplusplus.com/reference/string/string/string/) says that it is a undefined behavior if the pointer is null.
Maybe gcc (or its std library) create a empty string if we pass a null pointer and on mac a exception is raised.

kelson42 · 2022-02-24T16:00:00Z

@maneeshpm Any feedback? We are really not far to merge this PR!

kelson42 · 2022-02-28T15:32:03Z

@mgautierfr Considering that there is really not much to do and that @maneeshpm seems not available for the moment. Please feel free to just fix and merge.

The cache is copied from libzim project : https://github.com/openzim/libzim The exact file as been copied from commit 27f5e70

We use the new cache template to implement two kind of cache. 1: The Searcher cache is more general in terms of its usage. A Searcher can be used for multiple searches without much change to itself. We try to retrieve the searcher and perform searches using it whenever possible, and if not we put a searcher into the cache. User can specify a custom cache length by manipulating the environment variable SEARCHER_CACHE_SIZE. It's default value is 10% of all the books available. 2: The search cache is much more restricted in terms of usage. It's main purpose is to avoid re-searching on the searcher during page changes to generate SearchResultSet of various ranges. User can specify a custom cache length using the environment variable SEARCH_CACHE_SIZE with a default value of 2;

We create a cache for SuggestionSearcher very similar to that of FT searcher. User can specify a custom cache size using the environment variable SUGGESTION_SEARCHER_CACHE_SIZE. It has a default value of 10% of the number of books in the library.

mgautierfr · 2022-03-08T17:12:27Z

I've rebase on master and redo a bit the first commit which is a copy of file from libzim.
I haven't changed the other commits from @maneeshpm

maneeshpm force-pushed the search_caching branch from e642d41 to 9bf3e89 Compare October 12, 2021 10:46

maneeshpm requested a review from mgautierfr October 12, 2021 10:48

mgautierfr requested changes Oct 12, 2021

View reviewed changes

maneeshpm force-pushed the search_caching branch from 9bf3e89 to 18e35dc Compare October 18, 2021 12:40

maneeshpm commented Oct 18, 2021

View reviewed changes

src/server/internalServer.cpp Outdated Show resolved Hide resolved

This was referenced Nov 1, 2021

Switch default_op for Suggestion queryParser to OP_AND openzim/libzim#644

Closed

Switch default_op for query parser to OP_AND openzim/libzim#645

Merged

maneeshpm force-pushed the search_caching branch from 18e35dc to b3a8fd2 Compare November 1, 2021 09:17

maneeshpm marked this pull request as ready for review November 1, 2021 09:17

mgautierfr requested changes Nov 10, 2021

View reviewed changes

src/server/internalServer.cpp Outdated Show resolved Hide resolved

src/server/internalServer.cpp Outdated Show resolved Hide resolved

maneeshpm force-pushed the search_caching branch from b3a8fd2 to 008db4e Compare November 27, 2021 23:26

maneeshpm requested a review from mgautierfr November 27, 2021 23:54

maneeshpm force-pushed the search_caching branch from 008db4e to cf07d57 Compare November 28, 2021 00:09

mgautierfr requested changes Nov 29, 2021

View reviewed changes

src/server/internalServer.cpp Outdated Show resolved Hide resolved

src/server/internalServer.cpp Outdated Show resolved Hide resolved

src/server/internalServer.cpp Outdated Show resolved Hide resolved

src/server/internalServer.cpp Outdated Show resolved Hide resolved

mgautierfr added this to the 10.1.0 milestone Dec 1, 2021

kelson42 mentioned this pull request Dec 11, 2021

kiwix-serve ZIM fd needs to be smarter kiwix/kiwix-tools#142

Open

maneeshpm force-pushed the search_caching branch 5 times, most recently from 54f44ff to 2a819b1 Compare December 14, 2021 16:54

maneeshpm force-pushed the search_caching branch from 2a819b1 to 2f74770 Compare December 16, 2021 20:16

maneeshpm force-pushed the search_caching branch from cbd97e3 to 6280236 Compare January 8, 2022 10:32

kelson42 requested a review from mgautierfr January 8, 2022 10:40

maneeshpm force-pushed the search_caching branch 2 times, most recently from 303bda9 to e2998b0 Compare January 8, 2022 11:03

maneeshpm force-pushed the search_caching branch 2 times, most recently from e8cdcf0 to c5ed9c5 Compare January 9, 2022 12:42

kelson42 modified the milestones: 10.1.0, 10.0.1 Jan 9, 2022

kelson42 modified the milestones: 10.0.1, 10.1.0 Feb 3, 2022

mgautierfr requested changes Feb 16, 2022

View reviewed changes

maneeshpm added 3 commits March 8, 2022 17:34

Introduce a LRU Cache and concurrent cache

a51f8d6

The cache is copied from libzim project : https://github.com/openzim/libzim The exact file as been copied from commit 27f5e70

mgautierfr force-pushed the search_caching branch from c5ed9c5 to 6523d9f Compare March 8, 2022 16:35

mgautierfr approved these changes Mar 8, 2022

View reviewed changes

mgautierfr merged commit e48b550 into master Mar 8, 2022

mgautierfr deleted the search_caching branch March 8, 2022 17:12

This was referenced Apr 4, 2022

[REGRESSION] Fulltext search partly broken (no result) with master branch #722

Closed

[REGRESSION] Seach ranking broken on as of kiwix/kiwix-serve:3.2.0-2 #742

Closed

holta mentioned this pull request Apr 6, 2022

Upgrade to kiwix-tools "3.2.0-3" when that's released in coming days (fix for full-text search of ZIM files, and articleCount too?) iiab/iiab#3171

Closed

veloman-yunkan mentioned this pull request Oct 13, 2022

docker kiwix-serve crashing kiwix/kiwix-tools#579

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce Search/Searcher Caching to Internal Server #620

Introduce Search/Searcher Caching to Internal Server #620

maneeshpm commented Oct 12, 2021 •

edited

Loading

codecov bot commented Oct 12, 2021 •

edited

Loading

mgautierfr left a comment

mgautierfr left a comment

mgautierfr left a comment

mgautierfr commented Dec 1, 2021

kelson42 commented Dec 7, 2021 •

edited

Loading

maneeshpm commented Dec 7, 2021

mgautierfr commented Dec 7, 2021

mgautierfr commented Dec 15, 2021

maneeshpm commented Dec 15, 2021

maneeshpm commented Jan 8, 2022

kelson42 commented Jan 8, 2022 •

edited

Loading

kelson42 commented Jan 8, 2022

maneeshpm commented Jan 9, 2022 •

edited

Loading

kelson42 commented Jan 9, 2022

mgautierfr left a comment

kelson42 commented Feb 24, 2022

kelson42 commented Feb 28, 2022

mgautierfr commented Mar 8, 2022

Introduce Search/Searcher Caching to Internal Server #620

Introduce Search/Searcher Caching to Internal Server #620

Conversation

maneeshpm commented Oct 12, 2021 • edited Loading

codecov bot commented Oct 12, 2021 • edited Loading

Codecov Report

mgautierfr left a comment

Choose a reason for hiding this comment

mgautierfr left a comment

Choose a reason for hiding this comment

mgautierfr left a comment

Choose a reason for hiding this comment

mgautierfr commented Dec 1, 2021

kelson42 commented Dec 7, 2021 • edited Loading

maneeshpm commented Dec 7, 2021

mgautierfr commented Dec 7, 2021

mgautierfr commented Dec 15, 2021

maneeshpm commented Dec 15, 2021

maneeshpm commented Jan 8, 2022

kelson42 commented Jan 8, 2022 • edited Loading

kelson42 commented Jan 8, 2022

maneeshpm commented Jan 9, 2022 • edited Loading

kelson42 commented Jan 9, 2022

mgautierfr left a comment

Choose a reason for hiding this comment

kelson42 commented Feb 24, 2022

kelson42 commented Feb 28, 2022

mgautierfr commented Mar 8, 2022

maneeshpm commented Oct 12, 2021 •

edited

Loading

codecov bot commented Oct 12, 2021 •

edited

Loading

kelson42 commented Dec 7, 2021 •

edited

Loading

kelson42 commented Jan 8, 2022 •

edited

Loading

maneeshpm commented Jan 9, 2022 •

edited

Loading