Object counting during startup (without .state files) is slow #974

vstax · 2018-01-18T16:46:04Z

Originally mentioned as part of #966.

I did experiment with startup after removing state files - and yes, that was the culprit of long startup times. On our system, lacking .state files adds ~11 seconds to opening of each container. This process seems to be CPU bound (by single core, that is). Disk read is around 11 MB/s (or 2200-2400 op/s) but it's not IO-starved (figures, given that .sst files aren't fragmented).

Per-thread top output during this process:

  PID USER      PR  NI    VIRT    RES    SHR S %CPU %MEM     TIME+ COMMAND    
26623 leofs     20   0 6749208 293588   4404 R 92,0  0,9   0:24.02 1_scheduler
26631 leofs     20   0 6749208 293588   4404 S 24,6  0,9   0:06.38 aux        
26668 leofs     20   0 6749208 293588   4404 S  6,6  0,9   0:02.11 Eleveldb   
26672 leofs     20   0 6749208 293588   4404 S  6,6  0,9   0:02.15 Eleveldb   
26703 leofs     20   0 6749208 293588   4404 S  6,3  0,9   0:01.49 Eleveldb   
26715 leofs     20   0 6749208 293588   4404 S  6,0  0,9   0:01.94 Eleveldb   
26698 leofs     20   0 6749208 293588   4404 S  5,6  0,9   0:01.74 Eleveldb   
26655 leofs     20   0 6749208 293588   4404 S  5,3  0,9   0:01.28 Eleveldb   
26675 leofs     20   0 6749208 293588   4404 S  5,3  0,9   0:01.58 Eleveldb   
26694 leofs     20   0 6749208 293588   4404 S  5,3  0,9   0:01.57 Eleveldb   
26711 leofs     20   0 6749208 293588   4404 S  5,3  0,9   0:01.36 Eleveldb   
26692 leofs     20   0 6749208 293588   4404 S  5,0  0,9   0:01.55 Eleveldb   
26717 leofs     20   0 6749208 293588   4404 S  4,3  0,9   0:01.26 Eleveldb   
26677 leofs     20   0 6749208 293588   4404 S  3,7  0,9   0:00.69 Eleveldb   
26684 leofs     20   0 6749208 293588   4404 S  3,7  0,9   0:00.94 Eleveldb   
26702 leofs     20   0 6749208 293588   4404 S  3,3  0,9   0:00.79 Eleveldb   
26718 leofs     20   0 6749208 293588   4404 S  3,3  0,9   0:00.71 Eleveldb   
26659 leofs     20   0 6749208 293588   4404 S  3,0  0,9   0:00.63 Eleveldb   
26680 leofs     20   0 6749208 293588   4404 S  2,7  0,9   0:00.62 Eleveldb   
26664 leofs     20   0 6749208 293588   4404 S  2,3  0,9   0:00.54 Eleveldb   
26721 leofs     20   0 6749208 293588   4404 S  2,3  0,9   0:00.49 Eleveldb   
26656 leofs     20   0 6749208 293588   4404 R  2,0  0,9   0:00.45 Eleveldb

Unfortunately, parallel open doesn't improve anything here as well, object counting seem to process containers sequentially.

Total size of .sst files for each container is ~132 MB, total amount of objects in each container is ~700,000. 11 seconds to count 700,000, hard to say if that's a lot or not. There are numbers at leo-project/leo_backend_db#10 (comment) but I feel like they don't quite tell the full story; current implementation, with sleeps removed can open containers - when no compaction is required - really fast (around 25 containers per second here when they are in disk cache, around 2 containers per second with empty cache, after boot). But counting objects from state files is quite slow. It's probably not a major problem, just that I didn't have an understanding that node holding 178 M objects can spend ~40 minutes counting them.

So the questions are: are these numbers expected? Can they be improved (e.g. by some kind of parallelization)? Or this should just be documented, that node startup can take a very long time, when storing lots of objects and previous shutdown wasn't successful.

The text was updated successfully, but these errors were encountered:

mocchira · 2018-01-19T08:00:05Z

@vstax Thanks for reporting in detail.
Apparently the erlang scheduler thread bounded and the only erlang code (not C/C++) running during the object counting is https://github.com/leo-project/leo_backend_db/blob/develop/src/leo_backend_db_eleveldb.erl#L400-L411 which is a simple tail recursive loop (equivalent of the tight loop in procedural language) other than invoking the eleveldb's C/C++ bridge function and receiving the result through message passing from another OS thread managed by eleveldb. Taking those facts into account, the interaction between the erlang scheduler thread and threads managed by eleveldb could be bottleneck so if we could reduce the number of interactions then things might improve.

AS IS

erlang scheduler thread -> native bridge -> NIF(eleveldb) -> leveldb function called on other OS thread
erlang scheduler thread <- native bridge <- NIF(eleveldb) send a result through message passing API provided by erlang runtime 

~~~ The round trip above for eleveldb:iterator_move will be repeated the number of objects times ~~~

TO BE

## Call the new API - `count` function that executes object counting on other OS thread and just returns the result so the round trip times get reduced only 1.
erlang scheduler thread -> native bridge -> NIF(eleveldb) -> leveldb function called on other OS thread
erlang scheduler thread <- native bridge <- NIF(eleveldb) send a result through message passing API provided by erlang runtime

I will try this idea and share the result the next working day.

mocchira · 2018-01-19T08:03:37Z

@vstax Just in case, can you do the same experiment and run leofs_doctor during that time? the result might tell us other possible culprit.

vstax · 2018-01-19T15:33:29Z

@mocchira I see, thanks for looking.
Unfortunately, leofs_doctor doesn't seem to work during node startup:

[general info]
- Date: 2018/01/19 17:45:40

[entop]
#1:

=INFO REPORT==== 19-Jan-2018::17:45:41 ===
    application: leofs_doctor
    exited: {bad_return,
                {{leofs_doctor,start,[normal,[]]},
                 {'EXIT',
                     {function_clause,
                         [{entop_view,draw_1,
                              [1,{error,badrpc}],
                              [{file,"src/entop_view.erl"},{line,51}]},
                          {leofs_doctor,start,1,
                              [{file,"src/leofs_doctor.erl"},{line,120}]},
                          {leofs_doctor,start,2,
                              [{file,"src/leofs_doctor.erl"},{line,97}]},
                          {application_master,start_it_old,4,
                              [{file,"application_master.erl"},
                               {line,273}]}]}}}}
    type: permanent
{"Kernel pid terminated",application_controller,"{application_start_failure,leofs_doctor,{bad_return,{{leofs_doctor,start,[normal,[]]},{'EXIT',{function_clause,[{entop_view,draw_1,[1,{error,badrpc}],[{file,\"src/entop_view.erl\"},{line,51}]},{leofs_doctor,start,1,[{file,\"src/leofs_doctor.erl\"},{line,120}]},{leofs_doctor,start,2,[{file,\"src/leofs_doctor.erl\"},{line,97}]},{application_master,start_it_old,4,[{file,\"application_master.erl\"},{line,273}]}]}}}}}"}

mocchira · 2018-01-24T08:39:26Z

@vstax I see. if possible, would you like to do the same experiment with the following patch applied to leo_storage? This will make leofs_doctor work even during node startup. (after experiments, please rollback the original one without the patch)

diff --git a/apps/leo_storage/src/leo_storage_app.erl b/apps/leo_storage/src/leo_storage_app.erl
index 161158e..290a179 100644
--- a/apps/leo_storage/src/leo_storage_app.erl
+++ b/apps/leo_storage/src/leo_storage_app.erl
@@ -148,6 +148,7 @@ after_proc(Error) ->
 %% @private
 after_proc_1(true, Pid, Managers) ->
     try
+        ensure_started(rex, rpc, start_link, worker, 2000),
         %% Launch servers
         %% launch_object_storage after checking if managers are alive
         %% to fix slow down startup problem. Details can be found on
@@ -171,7 +172,6 @@ after_proc_1(true, Pid, Managers) ->
                      [leo_storage_handler_del_directory]}),

         %% After processing
-        ensure_started(rex, rpc, start_link, worker, 2000),
         ok = leo_storage_api:register_in_monitor(first),

         %% Launch leo-rpc

vstax · 2018-01-24T17:09:36Z

@mocchira
OK, here are results for slow startup when counting objects: https://pastebin.com/z9kBkRKY
and fast startup when just opening files: https://pastebin.com/1cv0xTvQ

No compactions in either case.

mocchira · 2018-01-24T23:48:13Z

@vstax Thanks! looks exactly what I expected. (iterator_move consumes almost CPU time in erlang runtime) so implementing count on C/C++ is likely to improve the boot time. I get to implement and would like to ask you test again once it's finished.

mocchira · 2018-01-25T08:36:59Z

TODO

IMPL/UT leo-project/eleveldb (waiting for the review Add count function eleveldb#1)
micro benchmark on dev-env (Add count function eleveldb#1 (comment))
IMPL/UT leo-project/leo_backend_db
micro benchmark on dev-env
IMPL/IT leo-project/leofs
benchmark on RIT

OPTIONAL

PR to upstream (basho/eleveldb) however looks not actively maintained now...

mocchira · 2018-01-26T07:40:08Z

I confirmed this fix actually improves the boot time of leo_storage without .state files. This fix will be shipped with 1.4.0-rc2 today so please give it a try if you have a chance > @vstax

mocchira · 2018-01-30T00:09:37Z

@windkit Can you compare the boot time without .state files for metadata between 1.3.8 and 1.4.0-rc2 on RIT? (the number of records would be at least 10,000,000 on one node)

windkit · 2018-01-30T01:18:30Z

@mocchira noted, let me check it

windkit · 2018-01-30T02:12:52Z

Test Case

8 AVS, SSD
Image f4m Load (n=3, w=2, r = 1)

$ ./leofs-adm du S1@192.168.100.37
 active number of objects: 2352475
  total number of objects: 2352475
   active size of objects: 83106573709
    total size of objects: 83106573709
     ratio of active size: 100.0%
    last compaction start: ____-__-__ __:__:__
      last compaction end: ____-__-__ __:__:__

Test 1.3.8

Normal Boot (14s)

$ ./bin/leo_storage stop;
$ date; ./bin/leo_storage start; ./bin/leo_storage wait_ready; date
Tue Jan 30 10:39:36 JST 2018
Tue Jan 30 10:39:50 JST 2018

Delete State File (40s)

$ ./bin/leo_storage stop;
$ rm /ssd/avs/metadata/*.state
$ date; ./bin/leo_storage start; ./bin/leo_storage wait_ready; date
Tue Jan 30 10:37:25 JST 2018
Tue Jan 30 10:38:05 JST 2018

Test 1.4.0-rc2

Normal Boot (14s)

$ ./bin/leo_storage stop;
$ date; ./bin/leo_storage start; ./bin/leo_storage wait_ready; date
Tue Jan 30 11:11:02 JST 2018
Tue Jan 30 11:11:16 JST 2018

Delete State File (15s)

$ ./bin/leo_storage stop;
$ rm /ssd/avs/metadata/*.state
$ date; ./bin/leo_storage start; ./bin/leo_storage wait_ready; date
Tue Jan 30 11:08:19 JST 2018
Tue Jan 30 11:08:34 JST 2018

mocchira · 2018-01-30T02:27:28Z

@windkit Thanks for testing. LGTM.

vstax · 2018-02-05T19:09:36Z

@mocchira
This fix works very well, thank you. Time to open container without state file is cut from 11 sec to ~2.5 sec, the process is now IO bound (60-70 MB/s reads) and CPU usage is at 20%. Overall time to open 256 containers without state files is around 12 minutes (when no compaction is performed), effectively getting rid of #980 issue at least on this system.

vstax mentioned this issue Jan 18, 2018

Node entered "strange" state with internal crashes (timer problems?) #966

Closed

mocchira self-assigned this Jan 19, 2018

mocchira added _leo_backend_db Improve survey Priority-MIDDLE v1.4 labels Jan 19, 2018

mocchira mentioned this issue Jan 25, 2018

Add count function leo-project/eleveldb#1

Merged

mocchira added this to the 1.4.0 milestone Jan 26, 2018

vstax closed this as completed Feb 5, 2018

This was referenced Feb 6, 2018

[all][systemd] Tune the default 30min for TimeoutStartSec #980

Closed

[eleveldb][leo_backend_db] Impl iterater_move in batch #994

Closed

mocchira mentioned this issue Feb 16, 2018

Opening AVS from different paths in parallel for faster startup #940

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Object counting during startup (without .state files) is slow #974

Object counting during startup (without .state files) is slow #974

vstax commented Jan 18, 2018

mocchira commented Jan 19, 2018

mocchira commented Jan 19, 2018

vstax commented Jan 19, 2018

mocchira commented Jan 24, 2018

vstax commented Jan 24, 2018

mocchira commented Jan 24, 2018

mocchira commented Jan 25, 2018 •

edited

Loading

mocchira commented Jan 26, 2018

mocchira commented Jan 30, 2018

windkit commented Jan 30, 2018

windkit commented Jan 30, 2018 •

edited

Loading

mocchira commented Jan 30, 2018

vstax commented Feb 5, 2018

Object counting during startup (without .state files) is slow #974

Object counting during startup (without .state files) is slow #974

Comments

vstax commented Jan 18, 2018

mocchira commented Jan 19, 2018

AS IS

TO BE

mocchira commented Jan 19, 2018

vstax commented Jan 19, 2018

mocchira commented Jan 24, 2018

vstax commented Jan 24, 2018

mocchira commented Jan 24, 2018

mocchira commented Jan 25, 2018 • edited Loading

TODO

OPTIONAL

mocchira commented Jan 26, 2018

mocchira commented Jan 30, 2018

windkit commented Jan 30, 2018

windkit commented Jan 30, 2018 • edited Loading

Test Case

Test 1.3.8

mocchira commented Jan 30, 2018

vstax commented Feb 5, 2018

mocchira commented Jan 25, 2018 •

edited

Loading

windkit commented Jan 30, 2018 •

edited

Loading