Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unclear on how to use s2n_cleanup from multiple threads #649

Closed
64 opened this issue Nov 18, 2017 · 5 comments
Closed

Unclear on how to use s2n_cleanup from multiple threads #649

64 opened this issue Nov 18, 2017 · 5 comments

Comments

@64
Copy link

64 commented Nov 18, 2017

The usage guide says the following:

s2n_cleanup cleans up any internal resources used by s2n. This function should be called from each thread or process that is created subsequent to calling s2n_init when that thread or process is done calling other s2n functions.

I'm running a multithreaded web server, which does something a bit like this:

s2n_init
s2n_config_new
create threads
on each new thread run:
    do some work
    s2n_cleanup
s2n_config_free
s2n_cleanup

According to the docs, I think this should be correct. However, s2n_cleanup fails on every invocation except the first. I suppose this is due to this code - since it is using a global (i.e non-thread-local) variable, only the first invocation from the process succeeds. If I am correct then the usage guide is wrong.

This may be an XY problem, since I was trying to debug a ~500 byte memory leak (per thread) caused by something in s2n_rand, which led me to this. I was definitely calling all the relevant cleanup functions except I was only calling s2n_cleanup on the main thread until I noticed what it said in the usage guide. It's entirely possible I'm just using the library incorrectly, but if you want to take a look at my code you can find it here.

Thanks for the help.

@raycoll
Copy link
Contributor

raycoll commented Nov 20, 2017

Hello @64,

You are correct. As is, s2n_init/s2n_cleanup should be called once per process. Do you mind opening a PR to update the USAGE-GUIDE?

This may be an XY problem, since I was trying to debug a ~500 byte memory leak (per thread) caused by something in s2n_rand, which led me to this.

This is also interesting. Do you still have the leak after calling s2n_init/s2n_cleanup once? There were issues in the past where a fixed amount of global Openssl libcrypto memory for error handling was leaked.

@64
Copy link
Author

64 commented Nov 20, 2017

Will send a PR soon. I still have the leak; here is the valgrind output:

==18255== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==18255== Using Valgrind-3.13.0 and LibVEX; rerun with -h for copyright info
==18255== Command: ./build/hh
==18255== 
==18255== Syscall param epoll_pwait(sigmask) points to unaddressable byte(s)
==18255==    at 0x55CA5E6: epoll_pwait (in /usr/lib/libc-2.26.so)
==18255==    by 0x111BF6: server_listen (hh.c:359)
==18255==    by 0x111F92: main (hh.c:431)
==18255==  Address 0x0 is not stack'd, malloc'd or (recently) free'd
==18255== 
==18255== Thread 2:
==18255== Syscall param epoll_pwait(sigmask) points to unaddressable byte(s)
==18255==    at 0x55CA5E6: epoll_pwait (in /usr/lib/libc-2.26.so)
==18255==    by 0x1113ED: worker_event_loop (hh.c:217)
==18255==    by 0x52BD089: start_thread (in /usr/lib/libpthread-2.26.so)
==18255==    by 0x55CA47E: clone (in /usr/lib/libc-2.26.so)
==18255==  Address 0x0 is not stack'd, malloc'd or (recently) free'd
==18255== 
==18255== 
==18255== HEAP SUMMARY:
==18255==     in use at exit: 2,592 bytes in 12 blocks
==18255==   total heap usage: 3,403 allocs, 3,391 frees, 696,600 bytes allocated
==18255== 
==18255== Thread 1:
==18255== 264 bytes in 1 blocks are indirectly lost in loss record 1 of 5
==18255==    at 0x4C2CE5F: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==18255==    by 0x4F93639: CRYPTO_zalloc (in /usr/lib/libcrypto.so.1.1)
==18255==    by 0x4F84F5A: EVP_CipherInit_ex (in /usr/lib/libcrypto.so.1.1)
==18255==    by 0x1217EB: s2n_drbg_update.isra.1 (in /home/matt/code/c/hh/build/hh)
==18255==    by 0x121C71: s2n_drbg_generate (in /home/matt/code/c/hh/build/hh)
==18255==    by 0x11CB8D: s2n_get_private_random_data (in /home/matt/code/c/hh/build/hh)
==18255==    by 0x11CC34: s2n_openssl_compat_rand (in /home/matt/code/c/hh/build/hh)
==18255==    by 0x4EEAC08: ??? (in /usr/lib/libcrypto.so.1.1)
==18255==    by 0x4EEA99F: ??? (in /usr/lib/libcrypto.so.1.1)
==18255==    by 0x4EDCABD: BN_BLINDING_create_param (in /usr/lib/libcrypto.so.1.1)
==18255==    by 0x4FBC260: RSA_setup_blinding (in /usr/lib/libcrypto.so.1.1)
==18255==    by 0x4FBE73B: ??? (in /usr/lib/libcrypto.so.1.1)
==18255== 
==18255== 528 bytes in 2 blocks are indirectly lost in loss record 2 of 5
==18255==    at 0x4C2CE5F: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==18255==    by 0x4F93639: CRYPTO_zalloc (in /usr/lib/libcrypto.so.1.1)
==18255==    by 0x4F84F5A: EVP_CipherInit_ex (in /usr/lib/libcrypto.so.1.1)
==18255==    by 0x1217EB: s2n_drbg_update.isra.1 (in /home/matt/code/c/hh/build/hh)
==18255==    by 0x121C71: s2n_drbg_generate (in /home/matt/code/c/hh/build/hh)
==18255==    by 0x11CB8D: s2n_get_private_random_data (in /home/matt/code/c/hh/build/hh)
==18255==    by 0x11CC34: s2n_openssl_compat_rand (in /home/matt/code/c/hh/build/hh)
==18255==    by 0x4EEAC08: ??? (in /usr/lib/libcrypto.so.1.1)
==18255==    by 0x4EEA99F: ??? (in /usr/lib/libcrypto.so.1.1)
==18255==    by 0x4F26E78: ??? (in /usr/lib/libcrypto.so.1.1)
==18255==    by 0x121E95: s2n_ecc_generate_own_key.isra.0 (in /home/matt/code/c/hh/build/hh)
==18255==    by 0x1220E3: s2n_ecc_generate_ephemeral_key (in /home/matt/code/c/hh/build/hh)
==18255== 
==18255== 792 bytes in 3 blocks are indirectly lost in loss record 3 of 5
==18255==    at 0x4C2CE5F: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==18255==    by 0x4F93639: CRYPTO_zalloc (in /usr/lib/libcrypto.so.1.1)
==18255==    by 0x4F84F5A: EVP_CipherInit_ex (in /usr/lib/libcrypto.so.1.1)
==18255==    by 0x1217EB: s2n_drbg_update.isra.1 (in /home/matt/code/c/hh/build/hh)
==18255==    by 0x121C71: s2n_drbg_generate (in /home/matt/code/c/hh/build/hh)
==18255==    by 0x11CA6D: s2n_get_public_random_data (in /home/matt/code/c/hh/build/hh)
==18255==    by 0x11B77E: s2n_server_hello_send (in /home/matt/code/c/hh/build/hh)
==18255==    by 0x11724E: s2n_negotiate (in /home/matt/code/c/hh/build/hh)
==18255==    by 0x1107A6: do_negotiate (client.c:52)
==18255==    by 0x1109C4: client_on_data_received (client.c:105)
==18255==    by 0x1115E2: worker_event_loop (hh.c:238)
==18255==    by 0x52BD089: start_thread (in /usr/lib/libpthread-2.26.so)
==18255== 
==18255== 1,296 (504 direct, 792 indirect) bytes in 3 blocks are definitely lost in loss record 4 of 5
==18255==    at 0x4C2CE5F: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==18255==    by 0x4F93639: CRYPTO_zalloc (in /usr/lib/libcrypto.so.1.1)
==18255==    by 0x121A46: s2n_drbg_instantiate (in /home/matt/code/c/hh/build/hh)
==18255==    by 0x11CAA1: s2n_get_public_random_data (in /home/matt/code/c/hh/build/hh)
==18255==    by 0x11B77E: s2n_server_hello_send (in /home/matt/code/c/hh/build/hh)
==18255==    by 0x11724E: s2n_negotiate (in /home/matt/code/c/hh/build/hh)
==18255==    by 0x1107A6: do_negotiate (client.c:52)
==18255==    by 0x1109C4: client_on_data_received (client.c:105)
==18255==    by 0x1115E2: worker_event_loop (hh.c:238)
==18255==    by 0x52BD089: start_thread (in /usr/lib/libpthread-2.26.so)
==18255==    by 0x55CA47E: clone (in /usr/lib/libc-2.26.so)
==18255== 
==18255== 1,296 (504 direct, 792 indirect) bytes in 3 blocks are definitely lost in loss record 5 of 5
==18255==    at 0x4C2CE5F: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==18255==    by 0x4F93639: CRYPTO_zalloc (in /usr/lib/libcrypto.so.1.1)
==18255==    by 0x121A46: s2n_drbg_instantiate (in /home/matt/code/c/hh/build/hh)
==18255==    by 0x11CAB6: s2n_get_public_random_data (in /home/matt/code/c/hh/build/hh)
==18255==    by 0x11B77E: s2n_server_hello_send (in /home/matt/code/c/hh/build/hh)
==18255==    by 0x11724E: s2n_negotiate (in /home/matt/code/c/hh/build/hh)
==18255==    by 0x1107A6: do_negotiate (client.c:52)
==18255==    by 0x1109C4: client_on_data_received (client.c:105)
==18255==    by 0x1115E2: worker_event_loop (hh.c:238)
==18255==    by 0x52BD089: start_thread (in /usr/lib/libpthread-2.26.so)
==18255==    by 0x55CA47E: clone (in /usr/lib/libc-2.26.so)
==18255== 
==18255== LEAK SUMMARY:
==18255==    definitely lost: 1,008 bytes in 6 blocks
==18255==    indirectly lost: 1,584 bytes in 6 blocks
==18255==      possibly lost: 0 bytes in 0 blocks
==18255==    still reachable: 0 bytes in 0 blocks
==18255==         suppressed: 0 bytes in 0 blocks
==18255== 
==18255== For counts of detected and suppressed errors, rerun with: -v
==18255== ERROR SUMMARY: 21 errors from 4 contexts (suppressed: 0 from 0)

You can safely ignore the epoll stuff; it's because valgrind isn't clever enough to detect that I'm not using that parameter of the syscall.

This is the output after opening 3 sockets (one per worker thread) - each socket gets queued onto a different thread for handling (the only calls made to s2n on each worker thread are s2n_connection_* and s2n_negotiate - I don't have to call send/recv in order to trigger a leak). If I open another socket then no additional memory is lost (since the new socket gets scheduled onto the first thread again, which has already allocated).

BTW, I'm using a very slightly patched version of s2n - I suspect this is the cause of the issue. Here is the diff:

diff --git a/tls/s2n_cipher_preferences.c b/tls/s2n_cipher_preferences.c
index 14649e2..97d1f0d 100644
--- a/tls/s2n_cipher_preferences.c
+++ b/tls/s2n_cipher_preferences.c
@@ -21,6 +21,18 @@
 
 #include "error/s2n_errno.h"
 
+struct s2n_cipher_suite *cipher_suites_h2[] = {
+    &s2n_ecdhe_rsa_with_aes_128_gcm_sha256,
+    &s2n_ecdhe_rsa_with_aes_256_gcm_sha384,
+    &s2n_ecdhe_rsa_with_chacha20_poly1305_sha256
+};
+
+const struct s2n_cipher_preferences cipher_preferences_h2 = {
+    .count = sizeof(cipher_suites_h2) / sizeof(cipher_suites_h2[0]),
+    .suites = cipher_suites_h2,
+    .minimum_protocol_version = S2N_TLS12
+};
+
 /* s2n's list of cipher suites, in order of preference, as of 2014-06-01 */
 struct s2n_cipher_suite *cipher_suites_20140601[] = {
     &s2n_dhe_rsa_with_aes_128_cbc_sha256,
@@ -260,6 +272,7 @@ struct {
 } selection[] = {
     { "default", &cipher_preferences_20170210 },
     { "default_fips", &cipher_preferences_20170405},
+    { "h2", &cipher_preferences_h2 },
     { "20140601", &cipher_preferences_20140601 },
     { "20141001", &cipher_preferences_20141001 },
     { "20150202", &cipher_preferences_20150202 },
diff --git a/tls/s2n_cipher_preferences.h b/tls/s2n_cipher_preferences.h
index b66b797..4d6b80d 100644
--- a/tls/s2n_cipher_preferences.h
+++ b/tls/s2n_cipher_preferences.h
@@ -25,6 +25,7 @@ struct s2n_cipher_preferences {
     int minimum_protocol_version;
 };
 
+extern const struct s2n_cipher_preferences cipher_preferences_h2;
 extern const struct s2n_cipher_preferences cipher_preferences_20140601;
 extern const struct s2n_cipher_preferences cipher_preferences_20141001;
 extern const struct s2n_cipher_preferences cipher_preferences_20150202;

OpenSSL / libcrypto version:

OpenSSL 1.1.0g  2 Nov 2017

Hope this helps.

EDIT: Interesting to note that valgrind reports all the leaks as occuring on thread 1, despite the evidence that it's calling s2n functions from threads for the first time that causes the leaks to occur. That information might prove useful.

@raycoll
Copy link
Contributor

raycoll commented Nov 20, 2017

My apologies @64 ! My previous advice was incorrect. There is per-thread DRBG state: https://github.com/awslabs/s2n/blob/master/utils/s2n_random.c#L60 . This state indeed needs to be cleaned up by every thread that uses s2n(via s2n_cleanup). I think this explains what you're seeing in your Valgrind output.

Back to your original issue s2n_cleanup fails on every invocation except the first.I think this is a bug in s2n. We should allow multiple threads to call s2n_cleanup. At first glance we're probably failing when we try to close the global entropy_fd multiple times: https://github.com/awslabs/s2n/blob/master/utils/s2n_random.c#L279

@64
Copy link
Author

64 commented Nov 21, 2017

Yeah, that's the code I linked to in my original post. I ran it through before in GDB and that's where it seemed to be going wrong.

I'm not sure I can help out too much since I'm not familiar with the internals but would making that variable thread local be sufficient?

(side note, do you want me to keep this issue open? or should I open a new one for the bug, to be clear)

@raycoll
Copy link
Contributor

raycoll commented Nov 21, 2017

Makes sense. I will work on a patch and I think we can continue to use this issue for tracking. In the meantime, the leak should be benign because it happens once per thread but does not increase for the lifetime of the thread/process.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants