Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Object counting during startup (without .state files) is slow #974

Closed
vstax opened this issue Jan 18, 2018 · 13 comments
Closed

Object counting during startup (without .state files) is slow #974

vstax opened this issue Jan 18, 2018 · 13 comments

Comments

@vstax
Copy link
Contributor

vstax commented Jan 18, 2018

Originally mentioned as part of #966.

I did experiment with startup after removing state files - and yes, that was the culprit of long startup times. On our system, lacking .state files adds ~11 seconds to opening of each container. This process seems to be CPU bound (by single core, that is). Disk read is around 11 MB/s (or 2200-2400 op/s) but it's not IO-starved (figures, given that .sst files aren't fragmented).

Per-thread top output during this process:

  PID USER      PR  NI    VIRT    RES    SHR S %CPU %MEM     TIME+ COMMAND    
26623 leofs     20   0 6749208 293588   4404 R 92,0  0,9   0:24.02 1_scheduler
26631 leofs     20   0 6749208 293588   4404 S 24,6  0,9   0:06.38 aux        
26668 leofs     20   0 6749208 293588   4404 S  6,6  0,9   0:02.11 Eleveldb   
26672 leofs     20   0 6749208 293588   4404 S  6,6  0,9   0:02.15 Eleveldb   
26703 leofs     20   0 6749208 293588   4404 S  6,3  0,9   0:01.49 Eleveldb   
26715 leofs     20   0 6749208 293588   4404 S  6,0  0,9   0:01.94 Eleveldb   
26698 leofs     20   0 6749208 293588   4404 S  5,6  0,9   0:01.74 Eleveldb   
26655 leofs     20   0 6749208 293588   4404 S  5,3  0,9   0:01.28 Eleveldb   
26675 leofs     20   0 6749208 293588   4404 S  5,3  0,9   0:01.58 Eleveldb   
26694 leofs     20   0 6749208 293588   4404 S  5,3  0,9   0:01.57 Eleveldb   
26711 leofs     20   0 6749208 293588   4404 S  5,3  0,9   0:01.36 Eleveldb   
26692 leofs     20   0 6749208 293588   4404 S  5,0  0,9   0:01.55 Eleveldb   
26717 leofs     20   0 6749208 293588   4404 S  4,3  0,9   0:01.26 Eleveldb   
26677 leofs     20   0 6749208 293588   4404 S  3,7  0,9   0:00.69 Eleveldb   
26684 leofs     20   0 6749208 293588   4404 S  3,7  0,9   0:00.94 Eleveldb   
26702 leofs     20   0 6749208 293588   4404 S  3,3  0,9   0:00.79 Eleveldb   
26718 leofs     20   0 6749208 293588   4404 S  3,3  0,9   0:00.71 Eleveldb   
26659 leofs     20   0 6749208 293588   4404 S  3,0  0,9   0:00.63 Eleveldb   
26680 leofs     20   0 6749208 293588   4404 S  2,7  0,9   0:00.62 Eleveldb   
26664 leofs     20   0 6749208 293588   4404 S  2,3  0,9   0:00.54 Eleveldb   
26721 leofs     20   0 6749208 293588   4404 S  2,3  0,9   0:00.49 Eleveldb   
26656 leofs     20   0 6749208 293588   4404 R  2,0  0,9   0:00.45 Eleveldb   

Unfortunately, parallel open doesn't improve anything here as well, object counting seem to process containers sequentially.

Total size of .sst files for each container is ~132 MB, total amount of objects in each container is ~700,000. 11 seconds to count 700,000, hard to say if that's a lot or not. There are numbers at leo-project/leo_backend_db#10 (comment) but I feel like they don't quite tell the full story; current implementation, with sleeps removed can open containers - when no compaction is required - really fast (around 25 containers per second here when they are in disk cache, around 2 containers per second with empty cache, after boot). But counting objects from state files is quite slow. It's probably not a major problem, just that I didn't have an understanding that node holding 178 M objects can spend ~40 minutes counting them.

So the questions are: are these numbers expected? Can they be improved (e.g. by some kind of parallelization)? Or this should just be documented, that node startup can take a very long time, when storing lots of objects and previous shutdown wasn't successful.

@mocchira
Copy link
Member

@vstax Thanks for reporting in detail.
Apparently the erlang scheduler thread bounded and the only erlang code (not C/C++) running during the object counting is https://github.com/leo-project/leo_backend_db/blob/develop/src/leo_backend_db_eleveldb.erl#L400-L411 which is a simple tail recursive loop (equivalent of the tight loop in procedural language) other than invoking the eleveldb's C/C++ bridge function and receiving the result through message passing from another OS thread managed by eleveldb. Taking those facts into account, the interaction between the erlang scheduler thread and threads managed by eleveldb could be bottleneck so if we could reduce the number of interactions then things might improve.

AS IS

erlang scheduler thread -> native bridge -> NIF(eleveldb) -> leveldb function called on other OS thread
erlang scheduler thread <- native bridge <- NIF(eleveldb) send a result through message passing API provided by erlang runtime 

~~~ The round trip above for eleveldb:iterator_move will be repeated the number of objects times ~~~

TO BE

## Call the new API - `count` function that executes object counting on other OS thread and just returns the result so the round trip times get reduced only 1.
erlang scheduler thread -> native bridge -> NIF(eleveldb) -> leveldb function called on other OS thread
erlang scheduler thread <- native bridge <- NIF(eleveldb) send a result through message passing API provided by erlang runtime 

I will try this idea and share the result the next working day.

@mocchira
Copy link
Member

@vstax Just in case, can you do the same experiment and run leofs_doctor during that time? the result might tell us other possible culprit.

@vstax
Copy link
Contributor Author

vstax commented Jan 19, 2018

@mocchira I see, thanks for looking.
Unfortunately, leofs_doctor doesn't seem to work during node startup:

[general info]
- Date: 2018/01/19 17:45:40

[entop]
#1:

=INFO REPORT==== 19-Jan-2018::17:45:41 ===
    application: leofs_doctor
    exited: {bad_return,
                {{leofs_doctor,start,[normal,[]]},
                 {'EXIT',
                     {function_clause,
                         [{entop_view,draw_1,
                              [1,{error,badrpc}],
                              [{file,"src/entop_view.erl"},{line,51}]},
                          {leofs_doctor,start,1,
                              [{file,"src/leofs_doctor.erl"},{line,120}]},
                          {leofs_doctor,start,2,
                              [{file,"src/leofs_doctor.erl"},{line,97}]},
                          {application_master,start_it_old,4,
                              [{file,"application_master.erl"},
                               {line,273}]}]}}}}
    type: permanent
{"Kernel pid terminated",application_controller,"{application_start_failure,leofs_doctor,{bad_return,{{leofs_doctor,start,[normal,[]]},{'EXIT',{function_clause,[{entop_view,draw_1,[1,{error,badrpc}],[{file,\"src/entop_view.erl\"},{line,51}]},{leofs_doctor,start,1,[{file,\"src/leofs_doctor.erl\"},{line,120}]},{leofs_doctor,start,2,[{file,\"src/leofs_doctor.erl\"},{line,97}]},{application_master,start_it_old,4,[{file,\"application_master.erl\"},{line,273}]}]}}}}}"}

@mocchira
Copy link
Member

@vstax I see. if possible, would you like to do the same experiment with the following patch applied to leo_storage? This will make leofs_doctor work even during node startup. (after experiments, please rollback the original one without the patch)

diff --git a/apps/leo_storage/src/leo_storage_app.erl b/apps/leo_storage/src/leo_storage_app.erl
index 161158e..290a179 100644
--- a/apps/leo_storage/src/leo_storage_app.erl
+++ b/apps/leo_storage/src/leo_storage_app.erl
@@ -148,6 +148,7 @@ after_proc(Error) ->
 %% @private
 after_proc_1(true, Pid, Managers) ->
     try
+        ensure_started(rex, rpc, start_link, worker, 2000),
         %% Launch servers
         %% launch_object_storage after checking if managers are alive
         %% to fix slow down startup problem. Details can be found on
@@ -171,7 +172,6 @@ after_proc_1(true, Pid, Managers) ->
                      [leo_storage_handler_del_directory]}),

         %% After processing
-        ensure_started(rex, rpc, start_link, worker, 2000),
         ok = leo_storage_api:register_in_monitor(first),

         %% Launch leo-rpc

@vstax
Copy link
Contributor Author

vstax commented Jan 24, 2018

@mocchira
OK, here are results for slow startup when counting objects: https://pastebin.com/z9kBkRKY
and fast startup when just opening files: https://pastebin.com/1cv0xTvQ

No compactions in either case.

@mocchira
Copy link
Member

@vstax Thanks! looks exactly what I expected. (iterator_move consumes almost CPU time in erlang runtime) so implementing count on C/C++ is likely to improve the boot time. I get to implement and would like to ask you test again once it's finished.

@mocchira
Copy link
Member

mocchira commented Jan 25, 2018

TODO

OPTIONAL

  • PR to upstream (basho/eleveldb) however looks not actively maintained now...

@mocchira mocchira added this to the 1.4.0 milestone Jan 26, 2018
@mocchira
Copy link
Member

I confirmed this fix actually improves the boot time of leo_storage without .state files. This fix will be shipped with 1.4.0-rc2 today so please give it a try if you have a chance > @vstax

@mocchira
Copy link
Member

@windkit Can you compare the boot time without .state files for metadata between 1.3.8 and 1.4.0-rc2 on RIT? (the number of records would be at least 10,000,000 on one node)

@windkit
Copy link
Contributor

windkit commented Jan 30, 2018

@mocchira noted, let me check it

@windkit
Copy link
Contributor

windkit commented Jan 30, 2018

Test Case

8 AVS, SSD
Image f4m Load (n=3, w=2, r = 1)

$ ./leofs-adm du S1@192.168.100.37
 active number of objects: 2352475
  total number of objects: 2352475
   active size of objects: 83106573709
    total size of objects: 83106573709
     ratio of active size: 100.0%
    last compaction start: ____-__-__ __:__:__
      last compaction end: ____-__-__ __:__:__

Test 1.3.8

  1. Normal Boot (14s)
    $ ./bin/leo_storage stop;
    $ date; ./bin/leo_storage start; ./bin/leo_storage wait_ready; date
    Tue Jan 30 10:39:36 JST 2018
    Tue Jan 30 10:39:50 JST 2018
    
  2. Delete State File (40s)
    $ ./bin/leo_storage stop;
    $ rm /ssd/avs/metadata/*.state
    $ date; ./bin/leo_storage start; ./bin/leo_storage wait_ready; date
    Tue Jan 30 10:37:25 JST 2018
    Tue Jan 30 10:38:05 JST 2018
    

Test 1.4.0-rc2

  1. Normal Boot (14s)
    $ ./bin/leo_storage stop;
    $ date; ./bin/leo_storage start; ./bin/leo_storage wait_ready; date
    Tue Jan 30 11:11:02 JST 2018
    Tue Jan 30 11:11:16 JST 2018
    
  2. Delete State File (15s)
    $ ./bin/leo_storage stop;
    $ rm /ssd/avs/metadata/*.state
    $ date; ./bin/leo_storage start; ./bin/leo_storage wait_ready; date
    Tue Jan 30 11:08:19 JST 2018
    Tue Jan 30 11:08:34 JST 2018
    

@mocchira
Copy link
Member

@windkit Thanks for testing. LGTM.

@vstax
Copy link
Contributor Author

vstax commented Feb 5, 2018

@mocchira
This fix works very well, thank you. Time to open container without state file is cut from 11 sec to ~2.5 sec, the process is now IO bound (60-70 MB/s reads) and CPU usage is at 20%. Overall time to open 256 containers without state files is around 12 minutes (when no compaction is performed), effectively getting rid of #980 issue at least on this system.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants