Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix for dictionary synchronization issue #4

Merged

Conversation

adrianocrestani
Copy link
Collaborator

@adrianocrestani adrianocrestani commented Feb 28, 2017

There is one (or two) issue when dictionaries are lazily initialized. The null verification that checks whether a dictionary has been initialized or not is not synchronized. That means that a dictionary may be loaded more than once, because the reference to it is not synced.

The other big issue is that on at least two of those dictionaries, the reference to the new dictionary escapes before it has completely been built. That leads to several problems when a second instance of the same dictionary is initialized while another thread is reading the first instance of that same dictionary.

The fix uses Guava's memoize to make sure that a dict is initialized only once and safely published. I was able to reproduce the bug and verify the fix with the code below.

        ExecutorService executorService = Executors.newFixedThreadPool(20);
        List<String> lines = FileUtils.readLines(new File("korean_strings_random.txt"), "UTF-8");

        for (final String line : lines) {
            executorService.submit(new Runnable() {
                @Override
                public void run() {
                    try {
                        KoreanTokenizer koreanTokenizer = new KoreanTokenizer();
                        KoreanFilter koreanFilter = new KoreanFilter(koreanTokenizer);
                        koreanTokenizer.setReader(new StringReader(line));
                        koreanFilter.reset();

                        while (koreanFilter.incrementToken()) ;
                    } catch (Throwable e) {
                        e.printStackTrace();
                    }
                }
            });
        }

Note that I did a very minimal effort to make those dictionary immutable, which would be ideal to avoid any future sync issue. I considered it out of the scope of this fix, since it wasn't required to fix the current dict initialization issues.

Copy link

@jpdaigle jpdaigle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great start, let's review Guava dependency and some style issues.

@@ -21,6 +21,8 @@
import java.util.Iterator;
import java.util.List;

import com.google.common.base.Supplier;
import com.google.common.base.Suppliers;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't see Guava in the maven pom file, how is it in scope as a dependency?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't find it out either, but it is there as compile time dependency when I run maven :/

import org.apache.lucene.analysis.kr.morph.CompoundEntry;
import org.apache.lucene.analysis.kr.morph.MorphException;
import org.apache.lucene.analysis.kr.morph.WordEntry;

public class DictionaryUtil {

private static Trie<String,WordEntry> dictionary;
private static Supplier<Trie<String,WordEntry>> DICTIONARY = Suppliers.memoize(new Supplier<Trie<String, WordEntry>>() {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand "keep the style as the surrounding file" but not having spaces between the type params in the generic declaration is crazy! ಠ_ಠ

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

argh, my bad, I tried to fix that when I see them. I will fix it

public Map<String, String> get() {
try {
return readDict(KoreanEnv.FILE_PREFIX);
} catch (MorphException e) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems like all the try/catch layout is repetitive and needs to be factored out into its own method?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, good point, I will move that into readDict()

List<String> lines;

try {
lines = FileUtil.readLines(KoreanEnv.getInstance().getValue(KoreanEnv.FILE_UNCOMPOUNDS),"UTF-8");

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

formatting

Copy link
Collaborator Author

@adrianocrestani adrianocrestani Feb 28, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what formatting issue your seeing here?

throw new RuntimeException("Failed to initialize UNCOMPOUNDS dictionary.", e);
}

for(String compound: lines) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs some commenting, formatting, and cleanup - this block is messy and why do we need a custom StringUtil?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just moved that block, so I tried to change as little as possible. If I start changing, I will probably change every line in that code. There is StringBuffer usage everywhere!!!

Copy link

@jpdaigle jpdaigle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good news is: everything looks fine, and if you look at modern ES plugins (https://github.com/elastic/elasticsearch/blob/master/plugins/repository-hdfs/build.gradle) you'll see some of them actually use Guava, so we shouldn't get any pushback from the community for adding a guava dependency here.

The bad news: I think that before this merges, you need to align on guava versions, and fix the POM so that we understand how guava's getting pulled in. Remember that ES 2.3 will have a Guava version of "guava-18.0.jar". Somehow the ES5 hdfs plugin depends on guava 16 for compiling, so that doesn't seem right. I just reviewed the gradle build file for core ES, and it seems to just use Guava for tests (it's a test compile dependency only).

@@ -312,12 +296,12 @@ public static String combineAndEomiCheck(char s, String eomi) throws MorphExcept
* @param type 1: josa, 2: eomi
* @throws MorphException

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no longer true

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The plugin depends on elasticsearch (2.3), which depends on guava 1.8. That is how guava jar is being added to the classpath at compile time. I don't see any change required here to fix the guava reference.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait are you replying to the "no longer true" because that's about the javadoc being outdated.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the guava issue: yep, 18.0 but how is ES core depending on Guava 18.0? It's listed in their gradle build file as a test dependency only

Copy link

@jpdaigle jpdaigle Mar 1, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • fix the bad javadoc
  • merge it shipit

@adrianocrestani adrianocrestani merged commit 2fdc7ea into tripadvisor:ES-2.3.5 Mar 1, 2017
@adrianocrestani adrianocrestani deleted the dict-sync-issue branch March 1, 2017 19:15
@adrianocrestani adrianocrestani mentioned this pull request Mar 28, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants