Performance issue with new search #1993

AEgit · 2016-09-16T08:07:12Z

JabRef 3.7dev--snapshot--2016-09-16--master--629c3f0
windows 10 10.0 amd64
Java 1.8.0_101

With the current master branch (master--629c3f0) there is a significant decrease in performance when using the "search" feature of Jabref compared to a previous master branch (master--4913665).

Before the update entering a search term led to an instantaneous search and nearly instantaneous results. Now I have to wait up to 3 sec every time I'm entering a search term or modifying it. As I'm using the search feature quite extensively this is quite annoying.

The database is quite large (>8000 entries), but this shouldn't be the problem as it worked before the update.

Unfortunately I forgot to save the previous installer - Could you provide me with the older master--4913665?

AEgit · 2016-09-16T08:22:53Z

Ok, my current workaround is to use the latest release version (JabRef 3.6) which does not exhibit these performance problems.

lenhard · 2016-09-16T09:16:53Z

I cannot confirm this on my ~2200 entry database. There, search returns instant results. However, my database is also considerably smaller than yours. Also, I am not up to date with the recent changes in search. @koppor?

Also, @JabRef/developers What is the largest test file we have for testing performance?

koppor · 2016-09-16T09:50:12Z

@bartsch-dev Can you have a look?

AEgit · 2016-09-16T14:09:57Z

@lenhard
If needed, I could send you an older (and smaller) copy of my database (still >6000 entries). It has the same issues as my current database, albeit Jabref seems a bit faster with the smaller database: Still, if I open this database with Jabref 3.6 and use the search feature I get instant results. If I open the same database with JabRef 3.7dev--snapshot--2016-09-16--master--629c3f0 I have to wait a couple of seconds for search results (and you also hear the CPU cooling unit spinning up, indicating that more processing power is needed (this is not the case with Jabref 3.6).

If I send you the database, however, I would ask you not to share it or make it publicly available, as it may contain sensible information (preferentially you would delete the file, once it is no longer needed).

lenhard · 2016-09-16T14:15:42Z

@AEgit ok, thanks! You can send it to joerg.lenhard@kau.se Then I will see how search works with my machine and delete your file afterwards.

It would be cool if one day we had a huge file for JabRef for testing purposes. Maybe we could go for auto-generated entries, but then again randomized data will never be representative of real actual data.

tobiasdiez · 2016-09-16T14:18:19Z

I can confirm the performance problems with a DB of 6000 entries.
@lenhard there is one big file of mine in the admin rep.
@AEgit You don't need to send your file. But thanks for the offer.

lenhard · 2016-09-16T14:33:56Z

Thanks! I played around with the file a little bit and also noticed a delay, albeit less than a second, which would not annoy me. But that is very person-dependent and also, of course, machine-dependent.

If performance has been better before, we should try to at least keep it at that level.

For the record, according to @AEgit's observation, the decrease was introduced some time between today and August 31st (The first comment mentioned that everything was still fine with this commit: 4913665 )

tobiasdiez · 2016-09-16T14:42:32Z

The performance benchmarks confirm the problem:
27.06.2016 (no idea which build):
Benchmark Mode Cnt Score Error Units
Benchmarks.htmlToLatexConversion thrpt 20 3675.747 ± 93.344 ops/s
Benchmarks.inferBibDatabaseMode thrpt 20 16084.549 ± 715.908 ops/s
Benchmarks.keywordGroupContains thrpt 20 5970.994 ± 478.213 ops/s
Benchmarks.keywordGroupContainsWord thrpt 20 12199628.518 ± 584096.935 ops/s
Benchmarks.latexToHTMLConversion thrpt 20 88616.053 ± 1520.290 ops/s
Benchmarks.latexToUnicodeConversion thrpt 20 88126.629 ± 1262.394 ops/s
Benchmarks.parse thrpt 20 35.918 ± 0.877 ops/s
Benchmarks.search thrpt 20 192.584 ± 5.798 ops/s
Benchmarks.write thrpt 20 71.246 ± 2.905 ops/s

current master:
Benchmark Mode Cnt Score Error Units
Benchmarks.htmlToLatexConversion thrpt 20 4672.927 ± 75.992 ops/s
Benchmarks.inferBibDatabaseMode thrpt 20 16192.387 ± 420.196 ops/s
Benchmarks.keywordGroupContains thrpt 20 8927.462 ± 359.622 ops/s
Benchmarks.keywordGroupContainsWord thrpt 20 12317497.180 ± 321137.238 ops/s
Benchmarks.latexToHTMLConversion thrpt 20 90283.510 ± 2727.009 ops/s
Benchmarks.latexToUnicodeConversion thrpt 20 88338.103 ± 1459.373 ops/s
Benchmarks.parse thrpt 20 29.766 ± 0.337 ops/s
Benchmarks.search thrpt 20 30.106 ± 0.637 ops/s
Benchmarks.write thrpt 20 58.239 ± 0.563 ops/s

So the search score went down from nearly 200 to 30. (note that all the numbers decreased slightly...not good)

AEgit · 2016-09-16T15:09:05Z

@ Everyone:

Thank you for your quick replies. Yea, I prefer it if I can avoid sending the file - maybe once I'll have the time to go through the database and delete all the stuff, that shouldn't be public, I could make it available (but currently I don't have the time). As pointed out by lenhard, the changes that let to the performance drop must have happened between the following two builds:

JabRef 3.7dev--snapshot--2016-09-16--master--4913665
JabRef 3.7dev--snapshot--2016-09-16--master--629c3f0

chriba · 2016-09-16T15:13:07Z

I made some quick tests and the entry search in the searchworker is about 8 times slower since 543c176 than before.
https://github.com/JabRef/jabref/blob/master/src/main/java/net/sf/jabref/gui/search/SearchWorker.java#L43
To be precise it is the collectpart of the lambda.

chriba · 2016-09-16T15:25:09Z

Making the stream a parallelStream makes it significantly better but still worse than before (around 2 times slower).

lenhard · 2016-09-16T15:35:03Z

@bartsch-dev You bet me by 17 seconds with suggesting parallelization ;) Well, 2 times slower is better than 8 times slower.

Anyways, if lambda expressions cause the delay we should probably think about re-implementing performance-critical operations in the old style. @bartsch-dev can you reimplement that fragment in an interative style and see how it performs?
Something like (this is pseudocode)

for(Entry entry: database.getEntries()){
  if(searchQuery.matches(entry)) {
    myLinkedList.add(entry);
  }
}

Care to use a LinkedList since it performs in O(1) (meaning in constant time) for invocations of add.

chriba · 2016-09-16T15:39:13Z

It's not really the lambda that causes that problem.
If I insert the same lambda one commit before (not that I changed it in any remarkable way) it performs much better.

tobiasdiez · 2016-09-16T16:07:02Z

Well, the search() benchmark just uses the SearchQuery directly (without the SearchWorker). And even there the performance is 8 times slower.

oscargus · 2016-09-16T20:10:14Z

Could it be related to the Unicode stuff?

AEgit · 2016-09-27T10:08:31Z

Just to keep everyone updated: I've tried the most recent snapshot (JabRef 3.7dev--snapshot--2016-09-27--master--b8d1c2e), but it is still clearly slower than previous versions. I will fall back again to the latest release version JabRef 3.6.

chriba · 2016-09-27T13:14:40Z

I played around a little bit, and got following benchmark (note the score of the parallel search which is much higher than the search before the performance issue):

Benchmark                             Mode  Cnt         Score        Error  Units
Benchmarks.htmlToLatexConversion     thrpt   20      5017.331 ±    173.848  ops/s
Benchmarks.inferBibDatabaseMode      thrpt   20     21474.468 ±    542.497  ops/s
Benchmarks.keywordGroupContains      thrpt   20     12074.752 ±   1167.932  ops/s
Benchmarks.keywordGroupContainsWord  thrpt   20  12975699.817 ± 295384.417  ops/s
Benchmarks.latexToHTMLConversion     thrpt   20    100085.982 ±  11105.766  ops/s
Benchmarks.latexToUnicodeConversion  thrpt   20    105671.959 ±   5261.769  ops/s
Benchmarks.parallelSearch            thrpt   20       525.198 ±     42.728  ops/s
Benchmarks.parse                     thrpt   20        34.981 ±      1.116  ops/s
Benchmarks.removeLatexCommands       thrpt   20    816118.612 ±  33977.863  ops/s
Benchmarks.search                    thrpt   20       320.303 ±      7.338  ops/s
Benchmarks.write                     thrpt   20        69.749 ±      1.551  ops/s

The problem is that in the PR where I changed the searchbar I changed the ContainBasedSearchRule too.
Before the string about to be searched was stripped from all latex commands thus not recognizing {\"a} as ä but as a.
Removing all Latex commands is about 8 times quicker than converting them to Unicode (see Benchmark above).

Refs #518

AEgit · 2016-09-27T14:01:56Z

Hmmm, indeed, your benchmark results indicate that the parallel search is much faster than the initial one. I guess that the change to the ContainBasedSearchRule, however, has such a high impact on the performance (indeed, you mention it to be 8 times quicker before the change), that the increased performance in the parallel search is not able overcome the overall problem, i.e., that the search still is still much slower than it was previously.
As far as I understand, the way the search is currently implemented, works as it SHOULD, as Latex commands are interpreted correctly (while before they were not). This correct behaviour, however, causes a major performance issue for large databases (which contain many Latex commands that need to be converted to Unicode? - What about databases, which don't contain much that needs to be converted?).

If it is not possible to solve the problem in another way: Would it be possible to implement a switch which enables the user to activate either the "fast search" (which does not convert the Latex commands) or the "slow/correct search" (which does convert the Latex commands - with the associated performance drop)? I guess it is just a workaround, but that might be a solution?

lenhard · 2016-09-27T14:57:39Z

First of all, it is a good thing if we have found the component that is causing the trouble, good job @bartsch-dev

It will probably make no difference if the database contains LaTeX characters or not, I think it is the searching for LaTeX in LatexToUnicode that is causing the problem. There is some potential for performance optimization in the code in this class, for instance by using pre-compiled regular expressions instead of the replaceAll statements, but I doubt this will really solve the problem. I am also against providing a configuration option for turning this off (sorry @AEgit ), since workarounds have a tendency to turn into permanent solutions. Either we do it right, or we leave it, but no more hacks that make future feature implementation harder.

#518 is a possible long-term solution. However, if we compute a LaTeX-free version of every field on every change, load times of databases might go up significantly (which could be a compromise. Normally, you load once and might search often). Also other features might be affected. On the other side, we can of course circumvent increased load-times by implementing a cache-based version (computing LaTeX-free field content on demand). Then, the first search would be slow, but subsequent ones fast. All in all, I think this solution, including the cache, is worthwhile. @JabRef/developers What do you think other implications would be? Should we go for implementing #518?

Siedlerchr · 2016-09-27T16:18:23Z

I think we should discuss this one day in a devcall. But in general it looks like this is the only long term solution we have. But I do not really understand why people keep latex encoded entries in their db, as normally all modern text writing programs support UTF-8.
I mean, we have a cleanup operation already for removing latex.

chriba · 2016-09-27T18:20:56Z

IMHO there are two obvious solutions for this to speed things up:

Keep an extra map with already latex free fields
Convert everything to latex free strings

I just tried the first option, the gained performance is enormous (https://github.com/bartsch-dev/jabref/commit/7a9740a758146f3af5016a42af73b5ba93298cda):

Benchmark                   Mode  Cnt     Score     Error  Units
Benchmarks.parallelSearch  thrpt   20  1200.462 ± 125.105  ops/s
Benchmarks.search          thrpt   20   650.013 ±  53.531  ops/s

tobiasdiez · 2016-09-27T19:25:45Z

Option 1 goes in the direction of "store latex free version" #518.
What I don't understand is that it actually works for the benchmark. I thought the benchmark works as follows:

Create DB with a lot of entries
Search for one term
Start at beginning

Thus every entry should be asked about its fields exactly once. Thus caching shouldn't affect the benchmark.

Since I'm not a big fan of caching (it makes development harder in general, since you often have to change two things), I would propose to improve performance of the unicode to latex formatter instead. I'll have a look at it during the weekend if nobody wants to do it before (@bartsch-dev: ?)

chriba · 2016-09-27T20:17:15Z

The latex free field gets stored as soon as the field gets set (thus the database needs about double the space it did before), not when it gets accessed.
The benchmark just measures the time the search takes, not the time the database needs to get initialized.

I'm away over the long weekend. You can try your luck (I opt for option 2, where do we the strings with latex commands anyway, except saving the database to disk which only occurs if the user enters the special characters in latex style).

lenhard · 2016-09-28T08:30:32Z

Regarding caching: There is no point of improving performance if JabRef dies for large databases due to memory usage. @AEgit What is the current memory footprint for your 8000+ entry database? Will it be a problem for your enviroment if JabRef takes up twice of what it currently does?

AEgit · 2016-09-28T09:27:04Z

On Windows 10 Jabref currently uses ~1.4 GB of RAM for a database with nearly 9,000 entries (and several hundreds of groups). Doubling the memory footprint would be no issue for me, as I'm running this on a machine with 16 GB of RAM.

AEgit · 2016-09-28T09:31:11Z

Yes I have.

oscargus · 2016-09-28T18:41:18Z

I note that I pointed this out as a probable reason quite some time ago. ;-)

@Siedlerchr There are for sure still use cases for non-UTF-8 databases. IEEE is just one.

koppor · 2016-09-28T22:35:56Z

Plain LaTeX is important for so many users. For instance, when publishing scientific papers. Nearly no publisher supports biblatex. And bibtex8 still seems to be a hack and is 20 years old now. Sure, switching to biblatex is recommended, but not practical.

koppor · 2016-09-28T22:36:15Z

I doubt that caching the UTF-8 string ("latex-free version") takes twice as memory, because I assume that 80% of the Strings will be the same. The latex-free version can use String.intern() to reduce memory consumption drastically (see also http://stackoverflow.com/q/10578984/873282). We can also add -XX:+UseG1GC -XX:+UseStringDeduplication as JVM arguments, if we don't want to use intern(). See also http://stackoverflow.com/a/27953779/873282.

lenhard · 2016-09-29T08:03:24Z

@oscargus Indeed, your intuition paved the way :-)
@koppor Sure it does not take twice the memory. That is just the worst-case upper bound estimation, which we should take into account in a design decision anyway.

Summarizing the discussion, I think we have a consensus that something will be done. We currently have the following options:

Performance optimization of LatexToUnicode
Duplicate storage of LaTeX-free field values in BibEntry
Solution 2 plus Caching in BibEntry
Solution 2 plus hand-coded String internalization
Solution 2 plus additional JVM arguments
A combination of 3 plus 4, or 3 plus 5

Let us decide which way to go at the next devcall!

oscargus · 2016-09-29T12:06:42Z

I think rewriting the converters would make sense anyway. I have just touched them, but there are at least three(?) places where the conversion map is checked, which seems quite redundant. I haven't really studied it in depth, but it may be possible to write a faster converter from scratch with a selectable conversion map (to deal with Latex to {Unicode, HTML, XML, RTF}), although unification and speed may be contradictory...

A first step might be to profile the LatexToUnicode converter to see where the actual bottleneck is.

AEgit · 2016-09-29T18:04:14Z

It is not my place to say this, as I'm not an active developer: But, would it be possible for you to go for the quickest solution among the ones mentioned so that I can switch to the new developer version. Because currently I can't and because of that I can also not use Google Scholar (see recent posts here: #1886), which is quite essential to my work (much of it has no DOI but is found by Google Scholar).

Again, this is just a suggestion and apologies if I'm overstepping a boundary as a passive user who doesn't help in actively developing this wonderful piece of software.

lenhard · 2016-09-29T18:23:23Z

@AEgit if you desperately want a development version with quicker search, then you can get one right now here: http://builds.jabref.org/latex-free-fields/

This should be somewhat faster already. I'll do a few more minor improvements tomorrow and I have also not yet tested as to how much faster this actually is and how much memory it consumes. But it would be interesting to hear from you how search performs anyway. The first search run should still be pretty slow, but subsequent ones would be faster.

@oscargus I am doing a few minor optimizations in the LatexToUnicode, but a rewrite would probably the most reasonable option. I have to say that I barely understand what that algorithm does...

AEgit · 2016-09-29T18:36:07Z

JabRef 3.7dev--snapshot--2016-09-29--latex-free-fields--93e665e
windows 10 10.0 amd64
Java 1.8.0_101

@lenhard : Thank you so much!!!
This is indeed much faster - it feels as if it was even faster than before the issue (maybe due to the parallel optimizations?). Google Scholar works as well: Great!

The memory footprint seems to be nearly the same: I think there's a max. increase of 200 MB (Jabref is now using 1.6 GB instead of 1.4 GB) - but that is negligible on my machine. I think it also uses more CPU power - at least initially I heard my fans spinning up: Again, no issue for me, as I'm running this on a quite powerful Core i7 laptop.
So far I did not run into any (major) bugs.

Thank you very much!

AEgit · 2016-09-29T18:48:51Z

Ok, small bug spotted in JabRef 3.7dev--snapshot--2016-09-29--latex-free-fields--93e665e

It's no longer possible to assign an article to a group by drag'n'drop. If I try that, nothing happens. Instead I have to manually enter the groups name in the "Groups" field: Then it works.

(I didn't know whether I should open a new ticket as you mentioned that you would do a few minor improvements tomorrow anyway)

lenhard · 2016-09-29T18:53:25Z

@AEgit That is great to hear :-) I will do some more testing tomorrow, but it sounds like we are getting this issue solved. It is also good that the additional memory needed seems to be acceptable. If so, then that is certainly due to the suggestions by @koppor

The groups bug you mention should be unrelated and I only touch the LaTeX converter tomorrow, so please open a new issue.

AEgit · 2016-09-29T18:53:37Z

Another small bug in JabRef 3.7dev--snapshot--2016-09-29--latex-free-fields--93e665e

The DOI resolver does not appear anymore in the Web search interface.

mlep · 2016-09-29T19:09:36Z

The DOI resolver is now in BibTeX -> New entry

AEgit · 2016-09-29T19:40:40Z

@mlep
Thank you! Hmmm, would it be possible to readd the DOI fetcher to the Web search? Because now I have to make at least one additional click/use one additional shortcut to enter the DOI and get the bibtex data. Beforehand I could just enter the doi in the Web search that was already open.

koppor · 2016-09-29T19:51:00Z

Would it help to press CTRL n and then the Input field is focussed?

AEgit · 2016-09-29T19:52:41Z

Hmmm, yes, I was already using CTRL+N and focussing the input field would help - but it's not as convenient as it was before. At least I think so - maybe I just need to get used to it: But changing the focus to the input field would definitely help.

koppor · 2016-09-29T19:59:37Z

@zellerdev

AEgit · 2016-09-29T19:59:55Z

Another small bug I came across in JabRef 3.7dev--snapshot--2016-09-29--latex-free-fields--93e665e

Select a group that contains articles
Perform a search
One of the articles within that group is found by the search and is selected (indicated in blue)
The entry preview of the selcted article is displayed
If we select now another article in the same group that is also found by the search, the entry preview is not shown - it still shows the previous article
This does not apply to articles that are not part of the same group (indicated in grey)

I think a similar issue has been reported so far.

AEgit · 2016-09-29T20:46:34Z

Indeed similar (but not the same) problems as the one mentioned above appeared in:
#1760
#1903

(I can create a new bug report, if this is not related to the changes made so far in search)

oscargus · 2016-09-30T06:15:18Z

I think we should add a side panel for creation of entries based on DOI etc.

AEgit · 2016-09-30T14:11:12Z

As far as I can tell, the small bug mentioned for JabRef 3.7dev--snapshot--2016-09-29--latex-free-fields--93e665e (entry preview not correctly displayed after having performed a search - see posts above me), appears to be fixed in JabRef 3.7dev--snapshot--2016-09-30--latex-free-fields--0ee1d17.

UPDATE: No, I was to quick - the bug still appears (but it is a bit more difficult to reproduce - I will open a new ticket).
The new ticket can be found here:
#2105

lenhard · 2016-10-03T07:39:49Z

This is fixed in #2102. Discussion regarding related issues that have been discovered in this context should be carried on in separate threads.

AEgit · 2016-10-03T07:42:55Z

Thanks to everyone for this quick fix!

Bugs found in this context can be found here:
#2098
#2105

AEgit mentioned this issue Sep 16, 2016

Cursor jumps away from the search field #1938

Closed

tobiasdiez added bug Confirmed bugs or reports that are very likely to be bugs search labels Sep 16, 2016

chriba mentioned this issue Sep 22, 2016

Use parallel stream while searching #2043

Merged

matthiasgeiger added the status: devcall label Sep 28, 2016

chriba mentioned this issue Sep 29, 2016

[WIP] Speed up search #2091

Closed

koppor removed the status: devcall label Sep 29, 2016

AEgit mentioned this issue Sep 29, 2016

Articles can no longer be assigned to groups via drag'n'drop #2098

Closed

lenhard mentioned this issue Sep 30, 2016

Store LaTeX-free fields in BibEntry #2102

Merged

2 tasks

AEgit mentioned this issue Sep 30, 2016

Entry preview not updated properly #2105

Closed

lenhard closed this as completed Oct 3, 2016

tobiasdiez mentioned this issue Nov 19, 2016

Paste of Id should result in fetched BibEntry #2292

Closed

Performance issue with new search #1993

Performance issue with new search #1993

Comments

AEgit commented Sep 16, 2016 • edited Loading

AEgit commented Sep 16, 2016

lenhard commented Sep 16, 2016 • edited Loading

koppor commented Sep 16, 2016 via email

AEgit commented Sep 16, 2016

lenhard commented Sep 16, 2016

tobiasdiez commented Sep 16, 2016

lenhard commented Sep 16, 2016 • edited Loading

tobiasdiez commented Sep 16, 2016 • edited Loading

AEgit commented Sep 16, 2016

chriba commented Sep 16, 2016

chriba commented Sep 16, 2016

lenhard commented Sep 16, 2016

chriba commented Sep 16, 2016

tobiasdiez commented Sep 16, 2016

oscargus commented Sep 16, 2016

AEgit commented Sep 27, 2016

chriba commented Sep 27, 2016 • edited Loading

AEgit commented Sep 27, 2016 • edited Loading

lenhard commented Sep 27, 2016

Siedlerchr commented Sep 27, 2016 • edited Loading

chriba commented Sep 27, 2016 • edited Loading

tobiasdiez commented Sep 27, 2016

chriba commented Sep 27, 2016 • edited Loading

lenhard commented Sep 28, 2016

AEgit commented Sep 28, 2016

AEgit commented Sep 28, 2016

oscargus commented Sep 28, 2016

koppor commented Sep 28, 2016

koppor commented Sep 28, 2016

lenhard commented Sep 29, 2016

oscargus commented Sep 29, 2016

AEgit commented Sep 29, 2016

lenhard commented Sep 29, 2016

AEgit commented Sep 29, 2016

AEgit commented Sep 29, 2016

lenhard commented Sep 29, 2016

AEgit commented Sep 29, 2016

mlep commented Sep 29, 2016

AEgit commented Sep 29, 2016

koppor commented Sep 29, 2016 via email

AEgit commented Sep 29, 2016

koppor commented Sep 29, 2016 via email

AEgit commented Sep 29, 2016

AEgit commented Sep 29, 2016

oscargus commented Sep 30, 2016 via email

AEgit commented Sep 30, 2016 • edited Loading

lenhard commented Oct 3, 2016

AEgit commented Oct 3, 2016

AEgit commented Sep 16, 2016 •

edited

Loading

lenhard commented Sep 16, 2016 •

edited

Loading

lenhard commented Sep 16, 2016 •

edited

Loading

tobiasdiez commented Sep 16, 2016 •

edited

Loading

chriba commented Sep 27, 2016 •

edited

Loading

AEgit commented Sep 27, 2016 •

edited

Loading

Siedlerchr commented Sep 27, 2016 •

edited

Loading

chriba commented Sep 27, 2016 •

edited

Loading

chriba commented Sep 27, 2016 •

edited

Loading

AEgit commented Sep 30, 2016 •

edited

Loading