1.4.2 turner #97

matthewvon · 2013-10-10T13:16:52Z

fix race condition in close of async recovery log
address performance problems (write speed and level-0 compaction) with big objects

Note: spin locks taken from earlier PR already approved for Riak 2.0, and hot_threads taken eleveldb async threads of Riak 1.3.

…condition on close of async files and remove tons of redundant code.

…have unit tests.

…e threads move to fast for Google unit tests. use of volatile needed. or I could be wrong. checking in to move off laptop.

…r file to resolve.

…atabase (vnode).

…s do that naturally. Adjust code

kellymclaughlin · 2013-10-10T20:47:34Z

I see a test failure running make test on eleveldb with 1.4.2-turner checked out for eleveldb and the leveldb dependency; however, I see the same failure on master so I am assuming it is unrelated to these changes.

cacheleak: cacheleak_test_ (module 'cacheleak')...RSS1: 304908
RSS1: 353980
*failed*
in function cacheleak:'-cacheleak_loop/3-fun-2-'/2 (test/cacheleak.erl, line 66)
in call from cacheleak:cacheleak_loop/3 (test/cacheleak.erl, line 66)
**error:{assertion_failed,[{module,cacheleak},
                   {line,66},
                   {expression,"MaxFinalRSS > RSS"},
                   {expected,true},
                   {value,false}]}


module 'basho_bench_driver_eldb'
=======================================================
  Failed: 1.  Skipped: 0.  Passed: 16.

matthewvon · 2013-10-10T21:06:30Z

cache leak test fails 3 out of 4 times ... minimum. Jon has previously said that we should likely remove it. But whenever I get ready to do that ... it seems to work again.

kellymclaughlin · 2013-10-10T22:18:24Z

I did a Riak CS riak_test run using this branch for a functional sanity check and all tests passed.

kellymclaughlin · 2013-10-10T22:27:54Z

👍 on the PosixMmapFile changes

andrewjstone · 2013-10-11T15:58:35Z

db/db_impl.cc

      for (size_t i = 0; i < filenames.size(); i++) {
-          KeepOrDelete(filenames[i], level, live);
+          KeepOrDelete(filenames[i], -1, live);


What's up with the -1 here ?

The github differencing logic is piecing stuff from high and low. The real change is that I inserted "if (RunningCompactionCount()<2)", then tabbed everything else in the function in by one tab. Old (red) line 248 is now new (green) line 256. Move 7 lines down in red, then in green ... both have same "KeepOrDelete(filenames[i], -1, live)". Move 19 lines down from the starting point. You will see "KeepOrDelete(filenames[i], level, live);" in both.

jtuple · 2013-10-11T20:33:27Z

FYI, Jon has asked me to look over this PR as well. Just finished reading description in wiki, digging into the code now. Since Kelly/Andrew are already reviewing, I'm going to try to work though things quickly. Jon simply wanted someone who was familiar with the async eleveldb threading changes (eg. me) to be in the loop.

andrewjstone · 2013-10-11T21:04:19Z

Thanks @jtuple.

After looking over this code I don't see any obvious errors. Being that my c++ is rusty and I don't quite understand all the logic here I'm going to give this a 'not -1' h/t @kellymclaughlin

jtuple · 2013-10-12T09:31:08Z

db/db_impl.cc

+  ++running_compactions_;
+  Log(options_.info_log, "Background compact start: %u", running_compactions_);
+
+  assert(IsCompactionScheduled());


Minor note: this assertion should probably be changed to only check for bg_compaction_scheduled_ as the other condition that makes IsCompactScheduled true is a non-NULL imm, but imm compaction isn't handled by BackgroundCall anymore.

no. Today, bg_compaction_scheduled_ is a flag for any compaction BUT imm compaction. This routine should only be called for a non-imm compaction AND it should only be called when bg_compaction_scheduled_ is set.

I agree with you. This assertion isn't checking that case. It's checking IsCompactionScheduled() which is true when bg_compaction_scheduled_ OR imm != NULL. If we somehow end up in this function when bg_compaction_scheduled = false but imm != NULL, the assertion should fire. Obviously, this doesn't affect things in production, but seems helpful to have proper assertion to help with future debugging in later code changes.

argh. I misread everything. assertion needs to change.

jtuple · 2013-10-12T09:51:35Z

Compared atomics.h against equivalent code in eleveldb. All looks good. I notice that there's two new functions in atomics.h that weren't in eleveldb: add_and_fetch and sub_and_fetch. Neither are called anywhere inside LevelDB currently, so I cannot claim to have tested them. But, they look reasonable on visual inspection.

jtuple · 2013-10-12T10:35:42Z

When comparing hot_threads.cc against eleveldb's thread pool, I noticed that we have a race condition in both that could lead to a deadlock. It's basically the issue we tried to address in basho/eleveldb#50

Consider the code in hot_threads.cc#L108-L114:

            // only wait if we are really sure no work pending
            if (0==m_Pool.m_WorkQueueAtomic)
            {
                // yes, thread going to wait. set available now.
                m_Available=1;
                m_Condition.Wait();
            }    // if

As well as the code in hot_threads.cc#L260-L268:

            // no waiting threads, put on backlog queue
            {
                SpinLock lock(&m_QueueLock);
                inc_and_fetch(&m_WorkQueueAtomic);
                m_WorkQueue.push_back(item);
            }

            // to address race condition, thread might be waiting now
            FindWaitingThread(NULL);

If the worker thread is preempted after performing the conditional check, but before setting m_Available to 1 you can end up with an interleaving such as:

/* thread 1: */   if (0==m_Pool.m_WorkQueueAtomic)
/* thread 2: */   inc_and_fetch(&m_WorkQueueAtomic);
/* thread 2: */   m_WorkQueue.push_back(item);
/* thread 2: */   FindWaitingThread(NULL);
/* thread 1: */   m_Available=1;
/* thread 1: */   m_Condition.Wait();

Thus, thread 2 queues work, tries to find an available thread, fails to find thread 1 and returns. Afterwards, thread 1 goes to sleep, never to be signaled.

I'm not sure how likely this is to ever occur. But, I was able to reliably hit this condition by using a 1-thread pool and adding a small busywait wait before the m_Available=1 line. Relevant commit on jdb-1.4.2-turner branch here.

To make things clear, I added additional logging in this commit, where thread events are logged to /tmp/thread.log.

A combined log of thread + LevelDB events for a deadlock scenario is here. As shown, we get into a state where we hit waiting 2 and stall forever, because the enqueued background imm task never runs. All LevelDB traffic stops, throughput goes to zero, etc.

All in all, I have no idea how likely this race is to matter in the real world, with >1 background thread and no artificial delay. But, think it is still a remote possibility. Thoughts?

matthewvon · 2013-10-12T17:53:15Z

Joe comment #1: the two "new" functions were added as part of mv-flexcache branch of Riak 2.0. I pulled the file from there. But you are correct that you might be the first person reviewing them.

Joe comment #2: I agree with your assessment. The chance of the stall in eleveldb today is very slim since there are 71 threads (a prime number larger than 64 which is a common vnode count). The thread counts will be much lower in leveldb's usage. This is a bug that needs to be addressed. I am comfortable with it going to this targeted, high volume customer. But low traffic sites are susceptible.

jtuple · 2013-10-12T19:08:05Z

Spent a lot of time pushing this branch both inside and outside of Riak. In my non-Riak testing, I used a micro-benchmarck that tested having 10 to 128 concurrent open DBs with 10 to 400 concurrent Erlang processes per DB issuing reads/writes. This is much more concurrency than Riak normally can provide. Multiple runs over 2 hours along with different object sizes, read/write ratios, etc all worked without a hitch. Wasn't testing performance -- was trying to hit any race conditions or other concurrency issues.

This runs alternated from starting with fresh DBs on each run, to reusing a data created on a prior run. Likewise, I also 'kill -9' the Erlang VM on several of these runs, such that the follow-up run was starting from data that needed to run recovery. Things appeared to work as intended. Looking at the LOGs shows recovery kicking in when needed on startup, shows both background imm writes and compactions occurring as necessary, etc.

I've carefully looked over the new hot_threads code, the new mmap ref counting logic in env_posix, and the atomics changes, and other than the race condition noted above, everything else looks good. I also spent some time sketching out the old vs new approach to flushing imm tables, as well as manually traversed all the call paths, and it's clear to me that all paths that could flush imm tables previous cannot occur in the code as written today and only the new background imm flusher will do the work -- exactly as intended.

Once this is stabilized, I do think code cleanup is necessary not only for "cleanup" purposes, but also to prevent us accidentally reintroducing code that may concurrently call CompactMemTable when it is no longer safe (eg. accidentally calling PrioritizeWork in a future change).

Other than the noted race condition, there is nothing else I'm worried about in this code at this time. I agree that >1 thread pool is unlikely to hit the case I noted. The hot_threads version of the deadlock is also easy to identify: waiting 2 in the LOG that goes on forever, so easy to keep on eye out for. The related potential deadlock in the eleveldb thread pool is harder to identify, but we have been running that code in production for 9 months now, so no reason to hold things up in this case.

+1 for customer deployment with this branch. Agree with @matthewvon about fixing race condition for 2.0.

matthewvon · 2013-12-01T18:28:40Z

issue addressed in mv-hot-threads2

Matthew V added 10 commits October 4, 2013 11:23

add logging at key points to help define problem seen at turner

a6d8657

add more logging statements and fix logging in BGUnmapper2

48b607c

FINALLY ... rewrite some of the file routines: specifically fix race …

a7adf44

…condition on close of async files and remove tons of redundant code.

Bug fix closing sequence. Forgot to change file size. This is why we …

a5b2f71

…have unit tests.

new threading modules compile, but are not integrated into old code.

67f5063

first pass at integrating hot threads. at this point it seems that th…

a54904b

…e threads move to fast for Google unit tests. use of volatile needed. or I could be wrong. checking in to move off laptop.

spin lock defining in some modules and not in others. need this heade…

905d8e7

…r file to resolve.

various changes to support multiple, overlapping compactions on the d…

5a39395

…atabase (vnode).

level-0 merge to level-1 was intended to merge all ... does not alway…

b1e6e33

…s do that naturally. Adjust code

completely disable compactions due to seeks.

8b5c6bc

andrewjstone reviewed Oct 11, 2013
View reviewed changes

jtuple reviewed Oct 12, 2013
View reviewed changes

jtuple closed this Oct 12, 2013

jtuple reopened this Oct 12, 2013

jtuple mentioned this pull request Oct 18, 2013

Mv hot threads1 #101

Merged

matthewvon closed this Dec 1, 2013

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

1.4.2 turner #97

1.4.2 turner #97

matthewvon commented Oct 10, 2013

kellymclaughlin commented Oct 10, 2013

matthewvon commented Oct 10, 2013

kellymclaughlin commented Oct 10, 2013

kellymclaughlin commented Oct 10, 2013

andrewjstone Oct 11, 2013

matthewvon Oct 11, 2013

jtuple commented Oct 11, 2013

andrewjstone commented Oct 11, 2013

jtuple Oct 12, 2013

matthewvon Oct 12, 2013

jtuple Oct 12, 2013

matthewvon Oct 12, 2013

jtuple commented Oct 12, 2013

jtuple commented Oct 12, 2013

matthewvon commented Oct 12, 2013

jtuple commented Oct 12, 2013

matthewvon commented Dec 1, 2013

1.4.2 turner #97

1.4.2 turner #97

Conversation

matthewvon commented Oct 10, 2013

kellymclaughlin commented Oct 10, 2013

matthewvon commented Oct 10, 2013

kellymclaughlin commented Oct 10, 2013

kellymclaughlin commented Oct 10, 2013

andrewjstone Oct 11, 2013

Choose a reason for hiding this comment

matthewvon Oct 11, 2013

Choose a reason for hiding this comment

jtuple commented Oct 11, 2013

andrewjstone commented Oct 11, 2013

jtuple Oct 12, 2013

Choose a reason for hiding this comment

matthewvon Oct 12, 2013

Choose a reason for hiding this comment

jtuple Oct 12, 2013

Choose a reason for hiding this comment

matthewvon Oct 12, 2013

Choose a reason for hiding this comment

jtuple commented Oct 12, 2013

jtuple commented Oct 12, 2013

matthewvon commented Oct 12, 2013

jtuple commented Oct 12, 2013

matthewvon commented Dec 1, 2013