-
Notifications
You must be signed in to change notification settings - Fork 378
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
restbed unhandled exception #125
Comments
This issue can be reproduced with the following test program. void post_method_handler( const shared_ptr< Session > session )
{
const auto request = session->get_request( );
size_t content_length = request->get_header( "Content-Length", 0 );
session->fetch( content_length, [ request ]( const shared_ptr< Session > session, const Bytes & body )
{
fprintf( stdout, "%.*s\n", ( int ) body.size( ), body.data( ) );
session->sleep_for( std::chrono::milliseconds(5), []( const shared_ptr< Session > session )
{
session->yield( OK, "Hello, World!", { { "Content-Length", "13" }, { "Connection", "close" } } );
}
);
} );
}
int main( const int, const char** )
{
auto resource = make_shared< Resource >( );
resource->set_path( "/resource" );
resource->set_method_handler( "POST", post_method_handler );
auto settings = make_shared< Settings >( );
settings->set_port( 8010 );
settings->set_worker_limit( 10 );
Service service;
service.publish( resource );
service.start( settings );
return EXIT_SUCCESS;
}
Sending a load of requests into the test program causes it to coredump very quickly. Note that sometimes I see other crashes besides an unhandled exception. It may be a multi-threading issue since I don't see it if I set worker_limit to 1. |
Actually, this does not appear to be related to the number of worker threads. Even with worker_limit set to 1, the same issue occurs. It just takes longer. |
I'll look at wrapping the connection_timeout_handler in a try-catch block and logging the error when it occurs. This will fix the issue. Give me a couple of hours to confirm the fix. |
How are you interacting with the above example? curl, wget, Restbed::Http? |
From a precursory review of the stack-trace it appears the socket is timing out, calling close and ASIO is throwing an exception on a bad socket. void SocketImpl::connection_timeout_handler( const error_code& error )
{
if ( error or m_timer->expires_at( ) > steady_clock::now( ) )
{
return;
}
close( ); //this method calls m_socket->close.
} My hunch is the socket is killed by the client, resulting in 'bad descriptor'. The session times out and calls close on an invalid descriptor resulting in the system_error. |
FYI. I also ran with valgrind and I see the following error before crashing. There are other similar errors that follow.
|
In terms of generating load into the example. I use the following simple node script. E.g. var http = require('http');
var postData = "testing 123";
var requestCount = 0;
if(process.argv.length < 4)
{
console.log('ERROR: Missing hostname and TPS!');
process.exit(1);
}
var iterations = 1;
var tps = process.argv[3];
if(process.argv.length > 4)
{
iterations = process.argv[4];
}
console.log(`connecting to ${process.argv[2]} at ${tps} TPS with ${iterations} iterations`);
var options = {
host: process.argv[2],
path: '/resource',
port: '8010',
method: 'POST',
headers: {
'Content-Type' : 'text/plain',
'Content-Length' : Buffer.byteLength(postData)
}
};
callback = function(response) {
if(response.statusCode != 200)
console.log(`STATUS: ${response.statusCode}`);
}
setInterval(function() {
for(var i = 0; i < iterations; i++)
{
var req = http.request(options, callback);
req.write(postData);
req.end();
requestCount++;
}
}, 1000 / tps);
setInterval(function() {
console.log(`TPS: ${requestCount}`);
requestCount = 0;
}, 1000); |
Thanks. |
for(var i = 0; i < iterations; i++)
{
var req = http.request(options, callback);
req.write(postData);
req.end();
requestCount++;
} As stated previously. I believe each request (client) has a restricted scope and is being reclaimed after the end of the for-loop/req.end. This causes the client to close the connection and once the timeout is fired it creates the bad socket operation on the server. Curious as to why ASIO would throw an error on socket->close and not just report success. Either way poor handling on our end. |
I pushed fix for this issue, further validation required. |
@callum-kirby please provide a regression test. |
@greenday9 We've managed to replicate this issue and resolve it via @callum-kirby's fix. Please can you confirm on your end. |
It runs longer now. But I've hit a coredump. I'm running the same example program with the same load injection script.
valgrind also reports a memory issue as well during runs. It appears in some cases SocketImpl is deleted (via Session::yield) and then the connect_timeout_handler callback fires on the the deleted SocketImpl.
|
Attempting to replicate. A few notes: a) Please make sure you do indeed close the session on your end. session->sleep_for( std::chrono::milliseconds(5), []( const shared_ptr< Session > session )
{
//session->yield( OK, "Hello, World!", { { "Content-Length", "13" }, { "Connection", "close" } } );
session->close( OK, "Hello, World!", { { "Content-Length", "13" }, { "Connection", "close" } } );
} ); b) Do you in fact have 10 cores available where the machine is running? Here is the suggested approach to setting the worker limit. settings->set_worker_limit( thread::hardware_concurrency( ) ); |
Yep confirmed on Ubuntu 10x threads on 4 core system; see what I can do. |
a) How does close and yield differ? My assumption is that yield() leaves the connection open which is what we want in the case the client sends follow on requests. It is artificial in the test program (particularly since we return the "Connection: close" header). b) No, we don't have 10 cores. This should be optimized for the number of cores/etc. However, the test program is simply replicating the issue. Thanks for the suggestion on using thread::hardware_concurrency(). |
@greenday9 You are correct. The yield with 'Connection: close' caught my eye and decided to mention it. Attempting to resolve timer issue now. |
Ok moving the SocketImpl::connection_timeout_handler to a static method resolves this issue. Still attempting to locate how this situation arises in the first place. |
@greenday9 Not seeing a service crash anymore, please rerun your stress suite. |
@ben-crowhurst The situation arises as follows, given presence of worker threads:
I've confirmed the above scenario through painful testing with lots of debug messages with thread IDs :) Just catching and silencing the exception that you pushed is definitely not a proper fix for this. This also explains why making connection_timeout_handler() static "fixed" the issue -- it was no longer operating on an already-destroyed object. I tried fixing it by deriving SocketImpl from std::enable_shared_from_this and then in connection_timeout_handler() doing the following: auto socket = shared_from_this(); That took care of the socket not getting destroyed (because the shared_ptr had its refcount upped) asynchronously. However, that is still not a proper fix because a certain race condition remains: what if there are multiple worker threads and data becomes available for asio::async_read() at the very same precise moment that connection_timeout_handler() is called? The result will be that connection_timeout_handler() will close the socket that the asio::async_read() completion handler has already begun getting data from which is a nasty race condition. Therefore the proper fix is to never allow async_read() / async_write() completion handlers run concurrently with connection_timeout_handler() by using an asio::io_service::strand. I'll do a PR momentarily with that fix implemented. |
My apologies – I am still quite new to Git. Will do a proper PR shortly. |
It is definitely much more stable, but there is still an issue. If I kill the load injection tool while the test program is heavily loaded, I see the unhandled exception or coredump. This is easy to reproduce especially when the program is running with valgrind. See the following valgrind trace:
|
OK so now it's the other way around. I've done more debugging and the reason for this crash is this:
I have been able to reproduce case 3B quite a few times (not 100% reliably though due to its async nature) and it resulted in an identical crash every time. The fix is to bring back your static connection_timeout_handler() and shared_from_this magic. I tried it and it does work fine, no more crashes after substantial abuse. Note that the strand fix is still needed because the original race scenario is still valid. I'll submit a PR shortly. |
Looks good with the latest change. I haven't been able to reproduce the unhandled exception or coredump. |
Yay! :) |
Great news. I'm looking into a regression test so as to avoid this being reintroduced during the next refactor. |
Hi @ben-crowhurst + others on this thread, I believe I hit this issue earlier today using restbed 4.5 when returning a c. 9MB file to a web browser. I was running the service single threaded, and it appeared that a second request was coming in from the client while the data resulting from the first request was being transmitted. (Historically I've run on Linux and not seen this issue; today I was running on Windows. The error code the socket was closed following data transmission was "invalid file descriptor".) It seems the solution to this issue was to wrap the timeout handler in an asio strand? My question is why was this only done for the call on lines 290 + 613 and not the calls on lines 415 + 464? I added the calls into my codebase and it resolved my issue. Thanks |
We’ve run into an issue with restbed coredumping due to an unhandled exception.
This was encountered with an application compiled on AWS Linux with gcc 5.4.0
The following error message is generated to stderr when the unhandled exception is encountered.
This is the stack backtrace from the core file which is not too helpful.
Using gdb, I tracked down where std::system_error is being thrown from asio. See below:
The text was updated successfully, but these errors were encountered: