[1.3.1] Large Object Handler may consume much memory and causes oom eventually #570

windkit · 2017-01-10T07:27:15Z

Description

To solve issue #506, we added the "work pool" approach to limit the memory usage in leo_gateway.
While preliminary testing shows it can limit the memory usage, further long testing reveals problems of ever growing memory usage

Observation

Use parallel to launch 64 instances of s3cmd to upload 256MB file

$ parallel -P 64 "./s3cmd put /dev/shm/testfile_256m s3://test/{} &>> log/put{}.log" ::: $(seq 1 1024)

Memory usage slowly increases over time and with a long test (e.g. 10240), all memory/swap were used up and leo_gateway was killed with oom

The text was updated successfully, but these errors were encountered:

windkit · 2017-01-10T07:28:12Z

Investigation

Take Log of Memory Usage and Stack Trace of every processes in leo_gateway

[{erlang:process_info(P, current_stacktrace), erlang:process_info(P, binary)} || P <- erlang:processes()].

Log File

use of `rpc:async_call` + `rpc:nb_yield`

(https://github.com/leo-project/leo_gateway/blob/develop/src/leo_gateway_rpc_handler.erl#L209)

 {{current_stacktrace,
      [{gen,do_call,4,[{file,"gen.erl"},{line,215}]},
       {gen_server,call,3,[{file,"gen_server.erl"},{line,186}]},
       {rpc,do_call,3,[{file,"rpc.erl"},{line,335}]},
       {rpc,'-async_call/4-fun-0-',5,[{file,"rpc.erl"},{line,647}]}]},
  {binary,[{140029729704280,11,2},{140020579020840,5242880,2}]}}

rpc:async_call is implemented as a gen_server:call with infinity as timeout, binary references thus are held infinitely
(https://github.com/erlang/otp/blob/OTP-17.5/lib/kernel/src/rpc.erl#L344.)

Resolution

Use rpc:call with timeout instead to avoid the memory leak, but we need to check the performance impact.

windkit · 2017-01-10T07:33:56Z

use of `gen_server:call`

(https://github.com/leo-project/leo_gateway/blob/develop/src/leo_large_object_put_handler.erl#L296)

 {{current_stacktrace,
      [{rpc,do_yield,2,[{file,"rpc.erl"},{line,680}]},
       {leo_gateway_rpc_handler,invoke,5,
           [{file,"src/leo_gateway_rpc_handler.erl"},{line,213}]},
       {leo_large_object_worker,handle_call,3,
           [{file,"src/leo_large_object_worker.erl"},{line,69}]},
       {gen_server,try_handle_call,4,[{file,"gen_server.erl"},{line,607}]},
       {gen_server,handle_msg,5,[{file,"gen_server.erl"},{line,639}]},
       {proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,237}]}]},
  {binary,
      [{140031103866032,11,1},
       {140009060397096,5242880,1},
       {140029742560208,11,2},
       {140008767758376,5242880,2},
       {140031105443640,10,1},
       {140023715803176,5242880,1},
       {140031105973232,10,1},
       {140012841623592,5242880,1},
       {140031103862120,11,1},
       {140024261275688,5242880,1},
       {140031104387704,11,1},
       {140027408928808,5242880,1},
       {140026968350760,5242880,1},
       {140029724165888,11,1},
       {140031108588688,11,1},
       {140010686324776,5242880,1},
       {140031105960440,11,1},
       {140011326205992,5242880,1},
       {140029784521384,11,1},
       {140009521950760,5242880,1},
       {140031107034256,11,1},
       {140024103927848,5242880,1},
       {140031107026464,11,1},
       {140016141234216,5242880,1},
       {140031107535504,11,1},
       {140010570936360,5243903,1},
       {140029821251968,11,1},
       {140027482357800,5242880,1}]}}

As the leo_large_object_worker is guarded with leo_pod:checkin and leo_pod:checkout, it should only handle one request at a time, but the log shows it was not the case.

{current_stacktrace,[{rpc,do_call,3,
                          [{file,"rpc.erl"},{line,347}]},
                     {leo_gateway_rpc_handler,invoke,5,
                                              [{file,"src/leo_gateway_rpc_handler.erl"},{line,210}]},
                     {leo_large_object_worker,handle_call,3,
                                              [{file,"src/leo_large_object_worker.erl"},{line,69}]},
                     {gen_server,try_handle_call,4,
                                 [{file,"gen_server.erl"},{line,607}]},
                     {gen_server,handle_msg,5,
                                 [{file,"gen_server.erl"},{line,639}]},
                     {proc_lib,init_p_do_apply,3,
                               [{file,"proc_lib.erl"},{line,237}]}]},
{messages,[{'$gen_call',{<0.17462.0>,#Ref<0.0.1.72690>},
                        {put,{put_req_params,<<"test/129\n51">>,
                                             <<0,0,0,0,0,0,0,0,0,0,0,0,0,0,...>>,
                                             <<>>,0,5242880,0,51,0,0,undefined}}},
           {'$gen_call',{<0.17681.0>,#Ref<0.0.1.77196>},
                        {put,{put_req_params,<<"test/151\n18">>,
                                             <<0,0,0,0,0,0,0,0,0,0,0,0,0,...>>,
                                             <<>>,0,5242880,0,18,0,0,undefined}}},
           {'$gen_call',{<0.17898.0>,#Ref<0.0.1.81785>},
                        {put,{put_req_params,<<"test/125\n42">>,
                                             <<0,0,0,0,0,0,0,0,0,0,0,0,...>>,
                                             <<>>,0,5242880,0,42,0,0,undefined}}},
           {'$gen_call',{<0.18088.0>,#Ref<0.0.1.86230>},
                        {put,{put_req_params,<<"test/134\n32">>,
                                             <<0,0,0,0,0,0,0,0,0,0,0,...>>,
                                             <<>>,0,5242880,0,32,0,0,undefined}}},
           {'$gen_call',{<0.18279.0>,#Ref<0.0.1.90749>},
                        {put,{put_req_params,<<"test/144\n30">>,
                                             <<0,0,0,0,0,0,0,0,0,0,...>>,
                                             <<>>,0,5242880,0,30,0,0,undefined}}},
           {'$gen_call',{<0.18497.0>,#Ref<0.0.1.95480>},
                        {put,{put_req_params,<<"test/144\n37">>,
                                             <<0,0,0,0,0,0,0,0,0,...>>,
                                             <<>>,0,5242880,0,37,0,0,undefined}}},
           {'$gen_call',{<0.18695.0>,#Ref<0.0.1.100099>},
                        {put,{put_req_params,<<"test/178\n13">>,
                                             <<0,0,0,0,0,0,0,0,...>>,
                                             <<>>,0,5242880,0,13,0,0,undefined}}},
           {'$gen_call',{<0.18908.0>,#Ref<0.0.1.104740>},
                        {put,{put_req_params,<<"test/148\n32">>,
                                             <<0,0,0,0,0,0,0,...>>,
                                             <<>>,0,5242880,0,32,0,0,...}}},
           {'$gen_call',{<0.19104.0>,#Ref<0.0.1.109355>},
                        {put,{put_req_params,<<"test/143\n27">>,
                                             <<0,0,0,0,0,0,...>>,
                                             <<>>,0,5242880,0,27,0,...}}},
           {'$gen_call',{<0.19311.0>,#Ref<0.0.1.113998>},
                        {put,{put_req_params,<<"test/159\n22">>,
                                             <<0,0,0,0,0,...>>,
                                             <<>>,0,5242880,0,22,...}}},
           {'$gen_call',{<0.19505.0>,#Ref<0.0.1.118648>},
                        {put,{put_req_params,<<"test/144\n43">>,
                                             <<0,0,0,0,...>>,
                                             <<>>,0,5242880,0,...}}},
           {'$gen_call',{<0.19749.0>,#Ref<0.0.1.123305>},
                        {put,{put_req_params,<<"test/164\n18">>,
                                             <<0,0,0,...>>,
                                             <<>>,0,5242880,...}}}]},
{message_queue_len,12},
{binary,[{140681802943336,10,1},
         {140662102286376,5242880,1},
         {140681801375384,11,2},
         {140661777100840,5242880,2},
         {140681801373472,11,2},
         {140665049935912,5242880,2},
         {140680430618000,11,2},
         {140660948402216,5242880,2},
         {140680428762928,11,2},
         {140666298228776,5242880,2},
         {140681804257872,11,2},
         {140664483483688,5242880,2},
         {140681803469200,11,2},
         {140664231727144,5242880,2},
         {140680422444920,11,2},
         {140665574428712,5242880,2},
         {140681801374136,11,2},
         {140667439210536,5242880,2},
         {140681803487712,11,...},
         {140669816815656,...},
         {...}|...]}},

Further check reveals that there are pending gen_server:call in the message queue.
While the leo_pod:checkin and leo_pod:checkout call on the worker is balanced, there are still some pending tasks.

This is because the gen_server:call by default uses a timeout value of 5000, when it is timeout, it would be returned to the worker_pool and further requests would come in continuously and therefore the jammed message queue.

windkit · 2017-01-10T07:36:09Z

WIP Fix

By switching to rpc:call and gen_server:call with {timeout, infinity}, the memory usage is under control.

Result

TODO

Performance Check
Long Running Test

windkit · 2017-01-10T08:05:19Z

No Multipart, No Traffic Shaping

windkit · 2017-01-11T00:05:37Z

No Multipart, GW Upload Limit (100MBps)

windkit · 2017-01-11T01:14:39Z

Multipart, GW Upload Limit (100MBps)

windkit · 2017-01-11T01:32:43Z

Multipart, No Traffic Shaping

Detailed benchmark will be done and reported at https://github.com/leo-project/notes

windkit · 2017-01-12T00:38:58Z

No Multipart, GW Upload Limit (100MBps) - Long Run

Memory Usage is stable around 2GB over the 13 hours

windkit · 2017-01-19T05:20:03Z

Memory Leak from `leo_pod` overflow

During long benchmark of large objects, a memory issue is spotted. Memory usage slowly increases throughout the test and after the test completion, memory usage does not return to zero.

After analyzing with erlang:process_info/1, there are many leftover leo_large_object_worker (>800) and each one holding reference to 5MB Objects

[{current_function,{gen_server,loop,6}},
 {initial_call,{proc_lib,init_p,5}},
 {status,waiting},
 {message_queue_len,0},
 {dictionary,[{'$ancestors',[pod_loh_worker,
                             pod_loh_worker_sup,leo_gateway_sup,<0.979.0>]},
              {'$initial_call',{leo_large_object_worker,init,1}}]},
 {binary,[{140229022581968,11,1},
          {140218954448936,5243903,1}]},
...

Code Analysis

in leo_pod_manager:check_out, new worker is spawned to handle overflow requests, but the process does not stop after leo_pod_manager:check_in. Those "overflow" workers are therefore left in the system forever.

Testing with 96 workers, 0 overflow

Here the memory usage is more stable and after the completion it would drop back.

Actions

We may have to change the default value for leo_gateway for the moment.
Ultimately we have to review the design of leo_pod, personally I don't understand the propose of "overflow" workers here. As those are essentially working in the same way of normal workers except they are not tracked by manager. A work queue approach would be a better fit.

windkit · 2017-01-20T03:40:29Z

I confirm the patch fixed the issue

yosukehara · 2017-01-20T08:47:17Z

@windkit Thanks for testing it. I've recognized we fixed this issue.

windkit mentioned this issue Jan 10, 2017

Fix Large Object Handler Memory Issue leo-project/leo_gateway#62

Merged

yosukehara assigned yosukehara and windkit Jan 17, 2017

yosukehara added Improve Priority-HIGH survey v1.3 _leo_gateway labels Jan 17, 2017

yosukehara added this to the 1.3.2 milestone Jan 17, 2017

yosukehara added a commit to leo-project/leo_pod that referenced this issue Jan 19, 2017

Resolved for memory-leak: leo-project/leofs/issues/570

80df715

yosukehara closed this as completed Jan 20, 2017

windkit mentioned this issue Feb 16, 2018

[leo_gateway] PUT/POST a large file causes higher load and memory consumption #984

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[1.3.1] Large Object Handler may consume much memory and causes oom eventually #570

[1.3.1] Large Object Handler may consume much memory and causes oom eventually #570

windkit commented Jan 10, 2017

windkit commented Jan 10, 2017

windkit commented Jan 10, 2017

windkit commented Jan 10, 2017 •

edited

Loading

windkit commented Jan 10, 2017

windkit commented Jan 11, 2017

windkit commented Jan 11, 2017

windkit commented Jan 11, 2017 •

edited

Loading

windkit commented Jan 12, 2017

windkit commented Jan 19, 2017

windkit commented Jan 20, 2017

yosukehara commented Jan 20, 2017

[1.3.1] Large Object Handler may consume much memory and causes oom eventually #570

[1.3.1] Large Object Handler may consume much memory and causes oom eventually #570

Comments

windkit commented Jan 10, 2017

Description

Observation

windkit commented Jan 10, 2017

Investigation

use of rpc:async_call + rpc:nb_yield

Resolution

windkit commented Jan 10, 2017

use of gen_server:call

windkit commented Jan 10, 2017 • edited Loading

WIP Fix

Result

TODO

windkit commented Jan 10, 2017

No Multipart, No Traffic Shaping

windkit commented Jan 11, 2017

No Multipart, GW Upload Limit (100MBps)

windkit commented Jan 11, 2017

Multipart, GW Upload Limit (100MBps)

windkit commented Jan 11, 2017 • edited Loading

Multipart, No Traffic Shaping

windkit commented Jan 12, 2017

No Multipart, GW Upload Limit (100MBps) - Long Run

windkit commented Jan 19, 2017

Memory Leak from leo_pod overflow

Code Analysis

Testing with 96 workers, 0 overflow

Actions

windkit commented Jan 20, 2017

yosukehara commented Jan 20, 2017

use of `rpc:async_call` + `rpc:nb_yield`

use of `gen_server:call`

windkit commented Jan 10, 2017 •

edited

Loading

windkit commented Jan 11, 2017 •

edited

Loading

Memory Leak from `leo_pod` overflow