-
Notifications
You must be signed in to change notification settings - Fork 392
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Made many rpc:call/4,5 calls safer if rex is down. #586
Conversation
Created a couple of utility functions to handle do the wrapping. Most rpc:call/4,5 that where not wrapped in a try/catch have been changed to use the utility. Those that weren't appeared to want a crash due to context (such as explictly matching for only the success value).
I'm working on fixing the failing eunit tests, but you'll need to fix the dialzyer bits. |
So, this is a drive-by review but are we sure we want to change all rpc calls? Idiomatic Erlang is against defensive programming and maybe not all these places need to be defended from a crash. My main concern is overhead and if any of these changed calls sit in the hot path. Are we adding a few percentage points of overhead for something that is a rare event? |
For the most part, I only added a the safer version where there was both already a check for {badrpc, _Error} and no try/catch in place. So I'm trying to put these in places where people didn't expect this call to fail. Personally, I do think it's going to be a rare event that the rex process won't be running. It's more likely that any gen process the function passed to rpc will be down (and even then, it seems very unlikely). The changes made here simply change what happens in the first cast to be the same as what happens in the second case. I tried to find how expensive a try/catch is compared to just a case statement, and the closest I could find was http://erlang.org/pipermail/erlang-questions/2013-November/075928.html. The author implies that try/catch is as performant as case (or at least as close that it did not much matter), and the real killer was the getting a stack trace. So, to loop around to your question, @rzezeski, I don't think it's going to be a few percentage points, but fractions of a percent. Still, are those fractions of a percent a reasonable trade-of for code to work as expected in the circumstances where someone thought to check {badrpc, _} but didn't expect different behavior if rex itself is down? Dunno at the moment. |
@lordnull I just wanted to make sure we put some thought into this and you certainly have. With that I will stop my back seat code reviewing. |
Grabbing review. |
How is this looking for merging? Replication has a dependency on this PR. |
Result | ||
catch | ||
'EXIT':{noproc, _NoProcDetails} -> | ||
{badrpc, rpc_process_down} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it worth considering a lager warning here? This is a real weird situation, and maybe one we should be alerted to. (rex being down).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, backseat reviewer again. There should be a crash report if rex goes down. Given that the point of this function is to convert an exception into an error tuple shouldn't it be up to the caller to decide if a log message is needed? The caller may not care if rex is down and choose to ignore this situation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pretty much my thought as well; if the caller cares if rex is down, then the function returns enough information to act on it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, fair enough. That's fine, but I'll just put here for posterity that the one time we've seen rex crash, we never saw a log, because the FS on that node went read-only. But maybe enough other things will go wrong to alert the operator to that fact.
Just a few minor comments, but this is looking good to me otherwise. |
@lordnull Once you address the comments, we should be good to merge. |
Sounds like the consensus is to skip the lager messages. I think we're good to go here. |
Of course eunit failed. Let me see if I can recreate... |
Forced a rebuild and eunit fails on a different test. Passes on my machine (both raw branch and rebased). Merge anyway? |
@lordnull yeah we'll likely merge anyway? Can you point me at the failure though, so I can make sure an issue is filed about it? |
Previous failure: http://buildbot.bos1/builders/test-riak_core/builds/990 After a forced rebuild: http://buildbot.bos1/builders/test-riak_core/builds/1010 |
Made many rpc:call/4,5 calls safer if rex is down.
Created a couple of utility functions to handle do the wrapping. Most
rpc:call/4,5 that where not wrapped in a try/catch have been changed to
use the utility. Those that weren't appeared to want a crash due to
context (such as explictly matching for only the success value).
There are some dialyzer issues that were introduced, however I'm not able to see how they are related to the changes.