Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

1.4.2 turner #97

Closed
wants to merge 10 commits into from
Closed

1.4.2 turner #97

wants to merge 10 commits into from

Conversation

matthewvon
Copy link
Contributor

  1. fix race condition in close of async recovery log
  2. address performance problems (write speed and level-0 compaction) with big objects

Note: spin locks taken from earlier PR already approved for Riak 2.0, and hot_threads taken eleveldb async threads of Riak 1.3.

@kellymclaughlin
Copy link

I see a test failure running make test on eleveldb with 1.4.2-turner checked out for eleveldb and the leveldb dependency; however, I see the same failure on master so I am assuming it is unrelated to these changes.

cacheleak: cacheleak_test_ (module 'cacheleak')...RSS1: 304908
RSS1: 353980
*failed*
in function cacheleak:'-cacheleak_loop/3-fun-2-'/2 (test/cacheleak.erl, line 66)
in call from cacheleak:cacheleak_loop/3 (test/cacheleak.erl, line 66)
**error:{assertion_failed,[{module,cacheleak},
                   {line,66},
                   {expression,"MaxFinalRSS > RSS"},
                   {expected,true},
                   {value,false}]}


module 'basho_bench_driver_eldb'
=======================================================
  Failed: 1.  Skipped: 0.  Passed: 16.

@matthewvon
Copy link
Contributor Author

cache leak test fails 3 out of 4 times ... minimum. Jon has previously said that we should likely remove it. But whenever I get ready to do that ... it seems to work again.

@kellymclaughlin
Copy link

I did a Riak CS riak_test run using this branch for a functional sanity check and all tests passed.

@kellymclaughlin
Copy link

👍 on the PosixMmapFile changes

for (size_t i = 0; i < filenames.size(); i++) {
KeepOrDelete(filenames[i], level, live);
KeepOrDelete(filenames[i], -1, live);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's up with the -1 here ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The github differencing logic is piecing stuff from high and low. The real change is that I inserted "if (RunningCompactionCount()<2)", then tabbed everything else in the function in by one tab. Old (red) line 248 is now new (green) line 256. Move 7 lines down in red, then in green ... both have same "KeepOrDelete(filenames[i], -1, live)". Move 19 lines down from the starting point. You will see "KeepOrDelete(filenames[i], level, live);" in both.

@jtuple
Copy link
Contributor

jtuple commented Oct 11, 2013

FYI, Jon has asked me to look over this PR as well. Just finished reading description in wiki, digging into the code now. Since Kelly/Andrew are already reviewing, I'm going to try to work though things quickly. Jon simply wanted someone who was familiar with the async eleveldb threading changes (eg. me) to be in the loop.

@andrewjstone
Copy link
Contributor

Thanks @jtuple.

After looking over this code I don't see any obvious errors. Being that my c++ is rusty and I don't quite understand all the logic here I'm going to give this a 'not -1' h/t @kellymclaughlin

++running_compactions_;
Log(options_.info_log, "Background compact start: %u", running_compactions_);

assert(IsCompactionScheduled());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor note: this assertion should probably be changed to only check for bg_compaction_scheduled_ as the other condition that makes IsCompactScheduled true is a non-NULL imm, but imm compaction isn't handled by BackgroundCall anymore.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no. Today, bg_compaction_scheduled_ is a flag for any compaction BUT imm compaction. This routine should only be called for a non-imm compaction AND it should only be called when bg_compaction_scheduled_ is set.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with you. This assertion isn't checking that case. It's checking IsCompactionScheduled() which is true when bg_compaction_scheduled_ OR imm != NULL. If we somehow end up in this function when bg_compaction_scheduled = false but imm != NULL, the assertion should fire. Obviously, this doesn't affect things in production, but seems helpful to have proper assertion to help with future debugging in later code changes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

argh. I misread everything. assertion needs to change.

@jtuple
Copy link
Contributor

jtuple commented Oct 12, 2013

Compared atomics.h against equivalent code in eleveldb. All looks good. I notice that there's two new functions in atomics.h that weren't in eleveldb: add_and_fetch and sub_and_fetch. Neither are called anywhere inside LevelDB currently, so I cannot claim to have tested them. But, they look reasonable on visual inspection.

@jtuple
Copy link
Contributor

jtuple commented Oct 12, 2013

When comparing hot_threads.cc against eleveldb's thread pool, I noticed that we have a race condition in both that could lead to a deadlock. It's basically the issue we tried to address in basho/eleveldb#50

Consider the code in hot_threads.cc#L108-L114:

            // only wait if we are really sure no work pending
            if (0==m_Pool.m_WorkQueueAtomic)
            {
                // yes, thread going to wait. set available now.
                m_Available=1;
                m_Condition.Wait();
            }    // if

As well as the code in hot_threads.cc#L260-L268:

            // no waiting threads, put on backlog queue
            {
                SpinLock lock(&m_QueueLock);
                inc_and_fetch(&m_WorkQueueAtomic);
                m_WorkQueue.push_back(item);
            }

            // to address race condition, thread might be waiting now
            FindWaitingThread(NULL);

If the worker thread is preempted after performing the conditional check, but before setting m_Available to 1 you can end up with an interleaving such as:

/* thread 1: */   if (0==m_Pool.m_WorkQueueAtomic)
/* thread 2: */   inc_and_fetch(&m_WorkQueueAtomic);
/* thread 2: */   m_WorkQueue.push_back(item);
/* thread 2: */   FindWaitingThread(NULL);
/* thread 1: */   m_Available=1;
/* thread 1: */   m_Condition.Wait();

Thus, thread 2 queues work, tries to find an available thread, fails to find thread 1 and returns. Afterwards, thread 1 goes to sleep, never to be signaled.

I'm not sure how likely this is to ever occur. But, I was able to reliably hit this condition by using a 1-thread pool and adding a small busywait wait before the m_Available=1 line. Relevant commit on jdb-1.4.2-turner branch here.

To make things clear, I added additional logging in this commit, where thread events are logged to /tmp/thread.log.

A combined log of thread + LevelDB events for a deadlock scenario is here. As shown, we get into a state where we hit waiting 2 and stall forever, because the enqueued background imm task never runs. All LevelDB traffic stops, throughput goes to zero, etc.

All in all, I have no idea how likely this race is to matter in the real world, with >1 background thread and no artificial delay. But, think it is still a remote possibility. Thoughts?

@matthewvon
Copy link
Contributor Author

Joe comment #1: the two "new" functions were added as part of mv-flexcache branch of Riak 2.0. I pulled the file from there. But you are correct that you might be the first person reviewing them.

Joe comment #2: I agree with your assessment. The chance of the stall in eleveldb today is very slim since there are 71 threads (a prime number larger than 64 which is a common vnode count). The thread counts will be much lower in leveldb's usage. This is a bug that needs to be addressed. I am comfortable with it going to this targeted, high volume customer. But low traffic sites are susceptible.

@jtuple
Copy link
Contributor

jtuple commented Oct 12, 2013

Spent a lot of time pushing this branch both inside and outside of Riak. In my non-Riak testing, I used a micro-benchmarck that tested having 10 to 128 concurrent open DBs with 10 to 400 concurrent Erlang processes per DB issuing reads/writes. This is much more concurrency than Riak normally can provide. Multiple runs over 2 hours along with different object sizes, read/write ratios, etc all worked without a hitch. Wasn't testing performance -- was trying to hit any race conditions or other concurrency issues.

This runs alternated from starting with fresh DBs on each run, to reusing a data created on a prior run. Likewise, I also 'kill -9' the Erlang VM on several of these runs, such that the follow-up run was starting from data that needed to run recovery. Things appeared to work as intended. Looking at the LOGs shows recovery kicking in when needed on startup, shows both background imm writes and compactions occurring as necessary, etc.

I've carefully looked over the new hot_threads code, the new mmap ref counting logic in env_posix, and the atomics changes, and other than the race condition noted above, everything else looks good. I also spent some time sketching out the old vs new approach to flushing imm tables, as well as manually traversed all the call paths, and it's clear to me that all paths that could flush imm tables previous cannot occur in the code as written today and only the new background imm flusher will do the work -- exactly as intended.

Once this is stabilized, I do think code cleanup is necessary not only for "cleanup" purposes, but also to prevent us accidentally reintroducing code that may concurrently call CompactMemTable when it is no longer safe (eg. accidentally calling PrioritizeWork in a future change).

Other than the noted race condition, there is nothing else I'm worried about in this code at this time. I agree that >1 thread pool is unlikely to hit the case I noted. The hot_threads version of the deadlock is also easy to identify: waiting 2 in the LOG that goes on forever, so easy to keep on eye out for. The related potential deadlock in the eleveldb thread pool is harder to identify, but we have been running that code in production for 9 months now, so no reason to hold things up in this case.

+1 for customer deployment with this branch. Agree with @matthewvon about fixing race condition for 2.0.

@jtuple jtuple closed this Oct 12, 2013
@jtuple jtuple reopened this Oct 12, 2013
@jtuple jtuple mentioned this pull request Oct 18, 2013
@matthewvon matthewvon closed this Dec 1, 2013
@matthewvon
Copy link
Contributor Author

issue addressed in mv-hot-threads2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants