Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use of select() in lib/http/HttpClient_Curl.hpp causes crashes when >1024 filehandles in use #1060

Open
peacenix opened this issue Sep 28, 2022 · 4 comments
Assignees
Labels
bug Something isn't working

Comments

@peacenix
Copy link

In lib/http/HttpClient_Curl.hpp, WaitOnSocket() uses select(), which is limited to monitoring file descriptors less than 1024. On a system that creates a large number of file handles (e.g. for client sockets), this limit will be exceeded and cause a crash in WaitOnSocket(). While I am using the library on FreeBSD, this problem would also apply to other platforms, namely Linux.

The Linux man page for select() describes this limitation and suggests using poll or epoll instead:

WARNING: select() can monitor only file descriptors numbers that
are less than FD_SETSIZE (1024)—an unreasonably low limit for
many modern applications—and this limitation will not change.
All modern applications should instead use poll(2) or epoll(7),
which do not suffer this limitation.

The crash is also reported for CentOS in the Open Telemetry C++ SDK which shares the same code-base for the curl http-client library with 1ds-cpp:
Issue: open-telemetry/opentelemetry-cpp#1220
PR for fix: open-telemetry/opentelemetry-cpp#1410

My colleague created the following simple patch to use poll() instead of select() which we apply in our FreeBSD port build to prevent the crash:

--- lib/http/HttpClient_Curl.hpp.orig	2022-09-28 09:41:36.000000000 -0400
+++ lib/http/HttpClient_Curl.hpp	2022-09-28 09:46:46.000000000 -0400
@@ -10,6 +10,7 @@
 #include <cstdio>
 #include <cstdlib>
 #include <cstdint>
+#include <poll.h>
 #include <string.h>
 #include <regex>

@@ -440,27 +441,20 @@
      */
     static int WaitOnSocket(curl_socket_t sockfd, int for_recv, long timeout_ms)
     {
-        struct timeval tv;
-        fd_set infd, outfd, errfd;
         int res;
+        struct pollfd pfd;

-        tv.tv_sec = timeout_ms / 1000;
-        tv.tv_usec = (timeout_ms % 1000) * 1000;
+        pfd.fd = sockfd;
+        pfd.revents = 0;

-        FD_ZERO(&infd);
-        FD_ZERO(&outfd);
-        FD_ZERO(&errfd);
-
-        FD_SET(sockfd, &errfd); /* always check for error */

         if(for_recv) {
-            FD_SET(sockfd, &infd);
+            pfd.events = POLLIN | POLLERR;
         } else {
-            FD_SET(sockfd, &outfd);
+            pfd.events = POLLOUT | POLLERR;
         }

-        /* select() returns the number of signalled sockets or -1 */
-        res = select((int)sockfd + 1, &infd, &outfd, &errfd, &tv);
+        res = poll(&pfd, 1, timeout_ms);
         return res;
     }
@peacenix peacenix added the bug Something isn't working label Sep 28, 2022
@maxgolov
Copy link
Contributor

@peacenix @lalitb

There was a limit on the total number of concurrent simultaneous HTTPS connections that SDK makes to a single HTTP server in production. This is configured here . I wonder if it's happening in TEST or in production, and if there's a bug in a higher-level abstraction that doesn't properly limit the number of concurrent connections.

While the patch here is valid, I think we need to double-check why the higher-level code allows for more than 4 pending requests. This code is supposed to limit it:

if (uploadCount() >= static_cast<uint32_t>(m_config[CFG_INT_MAX_PENDING_REQ]) )

@lalitb
Copy link
Contributor

lalitb commented Sep 28, 2022

This is the patch we applied in opentelemetry-cpp to fix the issue for both windows and Linux (poll is posix call, and not supported in windows).

https://github.com/open-telemetry/opentelemetry-cpp/pull/1410/files

But I was not aware of max concurrent connections configuration which @maxgolov mentioned. Need to debug this separately if that too is an issue.

@lalitb
Copy link
Contributor

lalitb commented Sep 28, 2022

Oh I noticed @peacenix has already shared the opentelemetry-cpp PR. Please ignore my comment on that :)

btw., opentelemetry-cpp now uses curl_multi_poll to poll for multiple handles simultaneously.

@maxgolov
Copy link
Contributor

maxgolov commented Sep 28, 2022

I think the fix makes sense in OpenTelemetry. But it's not exactly clear why we don't get the higher-level TPM limit to apply in 1DS. In 1DS C++ SDK the limit should've prevented the bug from happening. Maybe there's something in the HTTP curl client wrapper that doesn't clean and leaking the descriptors after the request is done? i.e. uploadCount() returns 0. But the actual number of fds remains high and is incremented over time.

@lalitb lalitb self-assigned this Oct 4, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants