Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[1.3.1] Large Object Handler may consume much memory and causes oom eventually #570

Closed
windkit opened this issue Jan 10, 2017 · 11 comments
Closed

Comments

@windkit
Copy link
Contributor

windkit commented Jan 10, 2017

Description

To solve issue #506, we added the "work pool" approach to limit the memory usage in leo_gateway.
While preliminary testing shows it can limit the memory usage, further long testing reveals problems of ever growing memory usage

Observation

Use parallel to launch 64 instances of s3cmd to upload 256MB file

$ parallel -P 64 "./s3cmd put /dev/shm/testfile_256m s3://test/{} &>> log/put{}.log" ::: $(seq 1 1024)

grafana

Memory usage slowly increases over time and with a long test (e.g. 10240), all memory/swap were used up and leo_gateway was killed with oom

@windkit
Copy link
Contributor Author

windkit commented Jan 10, 2017

Investigation

Take Log of Memory Usage and Stack Trace of every processes in leo_gateway

[{erlang:process_info(P, current_stacktrace), erlang:process_info(P, binary)} || P <- erlang:processes()].

Log File

use of rpc:async_call + rpc:nb_yield

(https://github.com/leo-project/leo_gateway/blob/develop/src/leo_gateway_rpc_handler.erl#L209)

 {{current_stacktrace,
      [{gen,do_call,4,[{file,"gen.erl"},{line,215}]},
       {gen_server,call,3,[{file,"gen_server.erl"},{line,186}]},
       {rpc,do_call,3,[{file,"rpc.erl"},{line,335}]},
       {rpc,'-async_call/4-fun-0-',5,[{file,"rpc.erl"},{line,647}]}]},
  {binary,[{140029729704280,11,2},{140020579020840,5242880,2}]}}

rpc:async_call is implemented as a gen_server:call with infinity as timeout, binary references thus are held infinitely
(https://github.com/erlang/otp/blob/OTP-17.5/lib/kernel/src/rpc.erl#L344.)

Resolution

Use rpc:call with timeout instead to avoid the memory leak, but we need to check the performance impact.

@windkit
Copy link
Contributor Author

windkit commented Jan 10, 2017

use of gen_server:call

(https://github.com/leo-project/leo_gateway/blob/develop/src/leo_large_object_put_handler.erl#L296)

 {{current_stacktrace,
      [{rpc,do_yield,2,[{file,"rpc.erl"},{line,680}]},
       {leo_gateway_rpc_handler,invoke,5,
           [{file,"src/leo_gateway_rpc_handler.erl"},{line,213}]},
       {leo_large_object_worker,handle_call,3,
           [{file,"src/leo_large_object_worker.erl"},{line,69}]},
       {gen_server,try_handle_call,4,[{file,"gen_server.erl"},{line,607}]},
       {gen_server,handle_msg,5,[{file,"gen_server.erl"},{line,639}]},
       {proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,237}]}]},
  {binary,
      [{140031103866032,11,1},
       {140009060397096,5242880,1},
       {140029742560208,11,2},
       {140008767758376,5242880,2},
       {140031105443640,10,1},
       {140023715803176,5242880,1},
       {140031105973232,10,1},
       {140012841623592,5242880,1},
       {140031103862120,11,1},
       {140024261275688,5242880,1},
       {140031104387704,11,1},
       {140027408928808,5242880,1},
       {140026968350760,5242880,1},
       {140029724165888,11,1},
       {140031108588688,11,1},
       {140010686324776,5242880,1},
       {140031105960440,11,1},
       {140011326205992,5242880,1},
       {140029784521384,11,1},
       {140009521950760,5242880,1},
       {140031107034256,11,1},
       {140024103927848,5242880,1},
       {140031107026464,11,1},
       {140016141234216,5242880,1},
       {140031107535504,11,1},
       {140010570936360,5243903,1},
       {140029821251968,11,1},
       {140027482357800,5242880,1}]}}

As the leo_large_object_worker is guarded with leo_pod:checkin and leo_pod:checkout, it should only handle one request at a time, but the log shows it was not the case.

{current_stacktrace,[{rpc,do_call,3,
                          [{file,"rpc.erl"},{line,347}]},
                     {leo_gateway_rpc_handler,invoke,5,
                                              [{file,"src/leo_gateway_rpc_handler.erl"},{line,210}]},
                     {leo_large_object_worker,handle_call,3,
                                              [{file,"src/leo_large_object_worker.erl"},{line,69}]},
                     {gen_server,try_handle_call,4,
                                 [{file,"gen_server.erl"},{line,607}]},
                     {gen_server,handle_msg,5,
                                 [{file,"gen_server.erl"},{line,639}]},
                     {proc_lib,init_p_do_apply,3,
                               [{file,"proc_lib.erl"},{line,237}]}]},
{messages,[{'$gen_call',{<0.17462.0>,#Ref<0.0.1.72690>},
                        {put,{put_req_params,<<"test/129\n51">>,
                                             <<0,0,0,0,0,0,0,0,0,0,0,0,0,0,...>>,
                                             <<>>,0,5242880,0,51,0,0,undefined}}},
           {'$gen_call',{<0.17681.0>,#Ref<0.0.1.77196>},
                        {put,{put_req_params,<<"test/151\n18">>,
                                             <<0,0,0,0,0,0,0,0,0,0,0,0,0,...>>,
                                             <<>>,0,5242880,0,18,0,0,undefined}}},
           {'$gen_call',{<0.17898.0>,#Ref<0.0.1.81785>},
                        {put,{put_req_params,<<"test/125\n42">>,
                                             <<0,0,0,0,0,0,0,0,0,0,0,0,...>>,
                                             <<>>,0,5242880,0,42,0,0,undefined}}},
           {'$gen_call',{<0.18088.0>,#Ref<0.0.1.86230>},
                        {put,{put_req_params,<<"test/134\n32">>,
                                             <<0,0,0,0,0,0,0,0,0,0,0,...>>,
                                             <<>>,0,5242880,0,32,0,0,undefined}}},
           {'$gen_call',{<0.18279.0>,#Ref<0.0.1.90749>},
                        {put,{put_req_params,<<"test/144\n30">>,
                                             <<0,0,0,0,0,0,0,0,0,0,...>>,
                                             <<>>,0,5242880,0,30,0,0,undefined}}},
           {'$gen_call',{<0.18497.0>,#Ref<0.0.1.95480>},
                        {put,{put_req_params,<<"test/144\n37">>,
                                             <<0,0,0,0,0,0,0,0,0,...>>,
                                             <<>>,0,5242880,0,37,0,0,undefined}}},
           {'$gen_call',{<0.18695.0>,#Ref<0.0.1.100099>},
                        {put,{put_req_params,<<"test/178\n13">>,
                                             <<0,0,0,0,0,0,0,0,...>>,
                                             <<>>,0,5242880,0,13,0,0,undefined}}},
           {'$gen_call',{<0.18908.0>,#Ref<0.0.1.104740>},
                        {put,{put_req_params,<<"test/148\n32">>,
                                             <<0,0,0,0,0,0,0,...>>,
                                             <<>>,0,5242880,0,32,0,0,...}}},
           {'$gen_call',{<0.19104.0>,#Ref<0.0.1.109355>},
                        {put,{put_req_params,<<"test/143\n27">>,
                                             <<0,0,0,0,0,0,...>>,
                                             <<>>,0,5242880,0,27,0,...}}},
           {'$gen_call',{<0.19311.0>,#Ref<0.0.1.113998>},
                        {put,{put_req_params,<<"test/159\n22">>,
                                             <<0,0,0,0,0,...>>,
                                             <<>>,0,5242880,0,22,...}}},
           {'$gen_call',{<0.19505.0>,#Ref<0.0.1.118648>},
                        {put,{put_req_params,<<"test/144\n43">>,
                                             <<0,0,0,0,...>>,
                                             <<>>,0,5242880,0,...}}},
           {'$gen_call',{<0.19749.0>,#Ref<0.0.1.123305>},
                        {put,{put_req_params,<<"test/164\n18">>,
                                             <<0,0,0,...>>,
                                             <<>>,0,5242880,...}}}]},
{message_queue_len,12},
{binary,[{140681802943336,10,1},
         {140662102286376,5242880,1},
         {140681801375384,11,2},
         {140661777100840,5242880,2},
         {140681801373472,11,2},
         {140665049935912,5242880,2},
         {140680430618000,11,2},
         {140660948402216,5242880,2},
         {140680428762928,11,2},
         {140666298228776,5242880,2},
         {140681804257872,11,2},
         {140664483483688,5242880,2},
         {140681803469200,11,2},
         {140664231727144,5242880,2},
         {140680422444920,11,2},
         {140665574428712,5242880,2},
         {140681801374136,11,2},
         {140667439210536,5242880,2},
         {140681803487712,11,...},
         {140669816815656,...},
         {...}|...]}},

Further check reveals that there are pending gen_server:call in the message queue.
While the leo_pod:checkin and leo_pod:checkout call on the worker is balanced, there are still some pending tasks.

This is because the gen_server:call by default uses a timeout value of 5000, when it is timeout, it would be returned to the worker_pool and further requests would come in continuously and therefore the jammed message queue.

@windkit
Copy link
Contributor Author

windkit commented Jan 10, 2017

WIP Fix

By switching to rpc:call and gen_server:call with {timeout, infinity}, the memory usage is under control.

Result

fix

TODO

  • Performance Check
  • Long Running Test

@windkit
Copy link
Contributor Author

windkit commented Jan 10, 2017

No Multipart, No Traffic Shaping

parallel_64_s3cmd_256m

@windkit
Copy link
Contributor Author

windkit commented Jan 11, 2017

No Multipart, GW Upload Limit (100MBps)

tc

@windkit
Copy link
Contributor Author

windkit commented Jan 11, 2017

Multipart, GW Upload Limit (100MBps)

mp_tc

@windkit
Copy link
Contributor Author

windkit commented Jan 11, 2017

Multipart, No Traffic Shaping

mp

Detailed benchmark will be done and reported at https://github.com/leo-project/notes

@windkit
Copy link
Contributor Author

windkit commented Jan 12, 2017

No Multipart, GW Upload Limit (100MBps) - Long Run

long

Memory Usage is stable around 2GB over the 13 hours

@windkit
Copy link
Contributor Author

windkit commented Jan 19, 2017

Memory Leak from leo_pod overflow

overflow

During long benchmark of large objects, a memory issue is spotted. Memory usage slowly increases throughout the test and after the test completion, memory usage does not return to zero.

After analyzing with erlang:process_info/1, there are many leftover leo_large_object_worker (>800) and each one holding reference to 5MB Objects

[{current_function,{gen_server,loop,6}},
 {initial_call,{proc_lib,init_p,5}},
 {status,waiting},
 {message_queue_len,0},
 {dictionary,[{'$ancestors',[pod_loh_worker,
                             pod_loh_worker_sup,leo_gateway_sup,<0.979.0>]},
              {'$initial_call',{leo_large_object_worker,init,1}}]},
 {binary,[{140229022581968,11,1},
          {140218954448936,5243903,1}]},
...

Code Analysis

in leo_pod_manager:check_out, new worker is spawned to handle overflow requests, but the process does not stop after leo_pod_manager:check_in. Those "overflow" workers are therefore left in the system forever.

Testing with 96 workers, 0 overflow

larger

Here the memory usage is more stable and after the completion it would drop back.

Actions

We may have to change the default value for leo_gateway for the moment.
Ultimately we have to review the design of leo_pod, personally I don't understand the propose of "overflow" workers here. As those are essentially working in the same way of normal workers except they are not tracked by manager. A work queue approach would be a better fit.

yosukehara added a commit to leo-project/leo_pod that referenced this issue Jan 19, 2017
@windkit
Copy link
Contributor Author

windkit commented Jan 20, 2017

I confirm the patch fixed the issue
load

@yosukehara
Copy link
Member

@windkit Thanks for testing it. I've recognized we fixed this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants