Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Node entered "strange" state with internal crashes (timer problems?) #966

Closed
vstax opened this issue Jan 10, 2018 · 17 comments
Closed

Node entered "strange" state with internal crashes (timer problems?) #966

vstax opened this issue Jan 10, 2018 · 17 comments

Comments

@vstax
Copy link
Contributor

vstax commented Jan 10, 2018

This is not 100% LeoFS problem as there was external factor present. So maybe there is no need to solve this directly but I wanted to report this in case there actually is some error (or error handling) in code that can be improved.

We were updating & rebooting cluster of storage nodes (firmware and kernel update against recent vulnerabilities). We have N=3, W=2, so maximum single node was touched at once. Procedure was as follows: new kernel was installed, each node was "suspended", then firmware update installed via IPMI without system interruption (current Supermicro systems support this), when it was over the system was rebooted - booting with new firmware, new kernel and after storage node was started, it was "resumed" from manager node.

During firmware update itself the system was working fine, aside from possible timer issues, PTP daemons were complaining in system logs:

Jan 10 18:56:53 stor02 ptp4l: [7085841.907] clockcheck: clock jumped backward or running slower than expected!
Jan 10 18:56:53 stor02 phc2sys: [7085842.007] port ac1f6b.fffe.212073-1 changed state
Jan 10 18:56:53 stor02 phc2sys: [7085842.007] reconfiguring after port state change
Jan 10 18:56:53 stor02 phc2sys: [7085842.007] selecting CLOCK_REALTIME for synchronization
Jan 10 18:56:53 stor02 phc2sys: [7085842.007] selecting eth1 as the master clock
Jan 10 18:56:53 stor02 ptp4l: [7085842.406] port 1: SLAVE to UNCALIBRATED on SYNCHRONIZATION_FAULT
Jan 10 18:56:54 stor02 phc2sys: [7085843.007] port ac1f6b.fffe.212073-1 changed state
Jan 10 18:56:54 stor02 phc2sys: [7085843.007] reconfiguring after port state change
Jan 10 18:56:54 stor02 phc2sys: [7085843.007] master clock not ready, waiting...
Jan 10 18:56:54 stor02 ptp4l: [7085843.405] port 1: UNCALIBRATED to SLAVE on MASTER_CLOCK_SELECTED
Jan 10 18:56:55 stor02 ptp4l: [7085843.912] clockcheck: clock jumped forward or running faster than expected!
Jan 10 18:56:55 stor02 phc2sys: [7085844.007] port ac1f6b.fffe.212073-1 changed state
Jan 10 18:56:55 stor02 phc2sys: [7085844.008] reconfiguring after port state change
Jan 10 18:56:55 stor02 phc2sys: [7085844.008] selecting CLOCK_REALTIME for synchronization
Jan 10 18:56:55 stor02 phc2sys: [7085844.008] selecting eth1 as the master clock
Jan 10 18:56:55 stor02 ptp4l: [7085844.404] port 1: SLAVE to UNCALIBRATED on SYNCHRONIZATION_FAULT
Jan 10 18:56:56 stor02 phc2sys: [7085845.008] port ac1f6b.fffe.212073-1 changed state
Jan 10 18:56:56 stor02 phc2sys: [7085845.008] reconfiguring after port state change
Jan 10 18:56:56 stor02 phc2sys: [7085845.008] master clock not ready, waiting...

There were seemingly no other problems; beam.smp of storage node ("suspended" at that time from manager, so no load, but obviously working as process) was rotating logs at hour change and things like that. It shut down properly as well after that, after firmware update was completed (during system reboot). I'm writing this to give better understanding that usually (at least for other systems) online update of firmware through IPMI web interface doesn't cause any service interruption, and here storage nodes weren't affected as well, as long as there was no external load. We updated nodes stor01-05 this way.

By some mistake, on the last node (stor06) firmware update was started without "suspending" node first. The firmware update (and timer errors) started at around 20:19:

Jan 10 20:19:35 stor06 ptp4l: [7090787.328] clockcheck: clock jumped backward or running slower than expected!
Jan 10 20:19:36 stor06 ptp4l: [7090787.987] clockcheck: clock jumped forward or running faster than expected!
Jan 10 20:19:36 stor06 ptp4l: [7090788.087] port 1: SLAVE to UNCALIBRATED on SYNCHRONIZATION_FAULT
Jan 10 20:19:36 stor06 phc2sys: [7090788.386] port 0cc47a.fffe.e1d37b-1 changed state
Jan 10 20:19:36 stor06 phc2sys: [7090788.387] reconfiguring after port state change
Jan 10 20:19:36 stor06 phc2sys: [7090788.387] master clock not ready, waiting...
Jan 10 20:19:37 stor06 ptp4l: [7090789.086] clockcheck: clock jumped backward or running slower than expected!
Jan 10 20:19:38 stor06 ptp4l: [7090790.085] port 1: UNCALIBRATED to SLAVE on MASTER_CLOCK_SELECTED
Jan 10 20:19:38 stor06 phc2sys: [7090790.387] port 0cc47a.fffe.e1d37b-1 changed state

4 minutes later ,at 20:23 the node started to produce very strange errors. In stdout/stderr (as captured per journalctl):

янв 10 20:00:59 stor06.selectel.cloud.lan leo_storage[27782]: * opening log file is /mnt/avs4/bodies/log/leo_object_storage_30023.20180110.20.1
янв 10 20:00:59 stor06.selectel.cloud.lan leo_storage[27782]: * closing log file is /mnt/avs4/bodies/log/leo_object_storage_30018.20180110.19.1
янв 10 20:00:59 stor06.selectel.cloud.lan leo_storage[27782]: * opening log file is /mnt/avs4/bodies/log/leo_object_storage_30018.20180110.20.1
янв 10 20:00:59 stor06.selectel.cloud.lan leo_storage[27782]: * closing log file is /mnt/avs2/bodies/log/leo_object_storage_10007.20180110.19.1
янв 10 20:00:59 stor06.selectel.cloud.lan leo_storage[27782]: * opening log file is /mnt/avs2/bodies/log/leo_object_storage_10007.20180110.20.1
янв 10 20:23:45 stor06.selectel.cloud.lan leo_storage[27782]: id:leo_diagnosis_log_10019, path:"/mnt/avs2/bodies/log/", filename:"leo_object_storage_10019"
янв 10 20:23:48 stor06.selectel.cloud.lan leo_storage[27782]: id:leo_diagnosis_log_14, path:"/mnt/avs1/bodies/log/", filename:"leo_object_storage_14"
янв 10 20:23:48 stor06.selectel.cloud.lan leo_storage[27782]: id:leo_diagnosis_log_10057, path:"/mnt/avs2/bodies/log/", filename:"leo_object_storage_10057"
янв 10 20:23:48 stor06.selectel.cloud.lan leo_storage[27782]: id:leo_diagnosis_log_48, path:"/mnt/avs1/bodies/log/", filename:"leo_object_storage_48"
янв 10 20:23:49 stor06.selectel.cloud.lan leo_storage[27782]: leo_backend_db_sup:131 - leo_metadata_63 {error,
янв 10 20:23:49 stor06.selectel.cloud.lan leo_storage[27782]: {{db_open,
янв 10 20:23:49 stor06.selectel.cloud.lan leo_storage[27782]: "IO error: lock /mnt/avs1/bodies/metadata/63/LOCK: already held by process"},
янв 10 20:23:49 stor06.selectel.cloud.lan leo_storage[27782]: {child,undefined,leo_metadata_63,
янв 10 20:23:49 stor06.selectel.cloud.lan leo_storage[27782]: {leo_backend_db_server,
янв 10 20:23:49 stor06.selectel.cloud.lan leo_storage[27782]: start_link,
янв 10 20:23:49 stor06.selectel.cloud.lan leo_storage[27782]: [leo_metadata_63,
янв 10 20:23:49 stor06.selectel.cloud.lan leo_storage[27782]: leo_backend_db_eleveldb,
янв 10 20:23:49 stor06.selectel.cloud.lan leo_storage[27782]: "/mnt/avs1/bodies/metadata/63",
янв 10 20:23:49 stor06.selectel.cloud.lan leo_storage[27782]: false]},
янв 10 20:23:49 stor06.selectel.cloud.lan leo_storage[27782]: permanent,2000,worker,
янв 10 20:23:49 stor06.selectel.cloud.lan leo_storage[27782]: [leo_backend_db_server]}}}
янв 10 20:23:49 stor06.selectel.cloud.lan leo_storage[27782]: leo_backend_db_sup:131 - leo_metadata_10063 {error,
янв 10 20:23:49 stor06.selectel.cloud.lan leo_storage[27782]: {{db_open,
янв 10 20:23:49 stor06.selectel.cloud.lan leo_storage[27782]: "IO error: lock /mnt/avs2/bodies/metadata/10063/LOCK: already held by process"},
янв 10 20:23:49 stor06.selectel.cloud.lan leo_storage[27782]: {child,undefined,
янв 10 20:23:49 stor06.selectel.cloud.lan leo_storage[27782]: leo_metadata_10063,
янв 10 20:23:49 stor06.selectel.cloud.lan leo_storage[27782]: {leo_backend_db_server,
янв 10 20:23:49 stor06.selectel.cloud.lan leo_storage[27782]: start_link,
янв 10 20:23:49 stor06.selectel.cloud.lan leo_storage[27782]: [leo_metadata_10063,
янв 10 20:23:49 stor06.selectel.cloud.lan leo_storage[27782]: leo_backend_db_eleveldb,
янв 10 20:23:49 stor06.selectel.cloud.lan leo_storage[27782]: "/mnt/avs2/bodies/metadata/10063",
янв 10 20:23:49 stor06.selectel.cloud.lan leo_storage[27782]: false]},
янв 10 20:23:49 stor06.selectel.cloud.lan leo_storage[27782]: permanent,2000,worker,
янв 10 20:23:49 stor06.selectel.cloud.lan leo_storage[27782]: [leo_backend_db_server]}}}
янв 10 20:23:59 stor06.selectel.cloud.lan leo_storage[27782]: id:leo_diagnosis_log_20063, path:"/mnt/avs3/bodies/log/", filename:"leo_object_storage_20063"
янв 10 20:23:59 stor06.selectel.cloud.lan leo_storage[27782]: * opening log file is /mnt/avs3/bodies/log/leo_object_storage_20063.20180110.20.2
янв 10 20:24:07 stor06.selectel.cloud.lan leo_storage[27782]: id:leo_diagnosis_log_30063, path:"/mnt/avs4/bodies/log/", filename:"leo_object_storage_30063"

crash.log was created with tons of errors at that time as well, plus tons of errors in error.log and info.log; there are really tons and I'm unable to quote them here so attaching these files.

Just two interesting notes, first of all there were lots of "slow operation" errors like

[I]     bodies@stor06.selectel.cloud.lan        2018-01-10 20:23:48.291096 +0300        1515605028      null:null       0       {module,"leo_object_storage_event"},{function,"handle_event/2"},{line,54},{body,[{cause,"slow operation"},{method,head},{key,<<"body/8e/cc/80/8ecc80a03c2eb51fb5c90a07835dcfb321124310dbb6a27a14e4e6abc3db336d68a01a8ad7029b606fb3aa5ccbc9f13be756010000000000.xz">>},{processing_time,53244}]}
[I]     bodies@stor06.selectel.cloud.lan        2018-01-10 20:23:48.292162 +0300        1515605028      null:null       0       {module,"leo_object_storage_event"},{function,"handle_event/2"},{line,54},{body,[{cause,"slow operation"},{method,put},{key,<<"body/66/66/a0/6666a03a49997a3938c432fb46291fa76cd88518fa4912fc93b888c775b1f7c4634804c8fef8d65e7c2f38dc0c3f6a3b00e2060000000000.xz">>},{processing_time,52191}]}

and second is, I was unable to execute "mq-stats stor06" command from manager node. "status stor06" worked fine, "version all" worked fine as well, but "mq-stats" produced no result (exited immediately, without output or error). Also on other nodes, messages started to appear in replication queue. At this point (after observing situation long enough) the node was suspended, then rebooted.
"status" output for stor06:

--------------------------------------+--------------------------------------
                Item                  |                 Value                
--------------------------------------+--------------------------------------
 Config-1: basic
--------------------------------------+--------------------------------------
                              version | 1.3.8
                     number of vnodes | 168
                    object containers | - path:[/mnt/avs1/bodies], # of containers:64
                                                  | - path:[/mnt/avs2/bodies], # of containers:64
                                                  | - path:[/mnt/avs3/bodies], # of containers:64
                                                  | - path:[/mnt/avs4/bodies], # of containers:64
                        log directory | /var/log/leofs/leo_storage/erlang
                            log level | info
--------------------------------------+--------------------------------------
 Config-2: watchdog
--------------------------------------+--------------------------------------
 [rex(rpc-proc)]                      |
                    check interval(s) | 10
               threshold mem capacity | 33554432
--------------------------------------+--------------------------------------
 [cpu]                                |
                     enabled/disabled | disabled
                    check interval(s) | 10
               threshold cpu load avg | 5.0
                threshold cpu util(%) | 100
--------------------------------------+--------------------------------------
 [disk]                               |
                     enabled/disalbed | disabled
                    check interval(s) | 10
                threshold disk use(%) | 85
               threshold disk util(%) | 90
                    threshold rkb(kb) | 98304
                    threshold wkb(kb) | 98304
--------------------------------------+--------------------------------------
 Config-3: message-queue
--------------------------------------+--------------------------------------
                   number of procs/mq | 4
        number of batch-procs of msgs | max:3000, regular:1600
   interval between batch-procs (ms)  | max:3000, regular:500
--------------------------------------+--------------------------------------
 Config-4: autonomic operation
--------------------------------------+--------------------------------------
 [auto-compaction]                    |
                     enabled/disabled | disabled
        warning active size ratio (%) | 70
      threshold active size ratio (%) | 60
             number of parallel procs | 1
                        exec interval | 3600
--------------------------------------+--------------------------------------
 Config-5: data-compaction
--------------------------------------+--------------------------------------
  limit of number of compaction procs | 4
        number of batch-procs of objs | max:1500, regular:1000
   interval between batch-procs (ms)  | max:3000, regular:500
--------------------------------------+--------------------------------------
 Status-1: RING hash
--------------------------------------+--------------------------------------
                    current ring hash | 4bb5757b
                   previous ring hash | 4bb5757b
--------------------------------------+--------------------------------------
 Status-2: Erlang VM
--------------------------------------+--------------------------------------
                           vm version | 8.3.5.3
                      total mem usage | 89009224
                     system mem usage | 62449984
                      procs mem usage | 26667536
                        ets mem usage | 6122632
                                procs | 1023/1048576
                          kernel_poll | true
                     thread_pool_size | 128
--------------------------------------+--------------------------------------
 Status-3: Number of messages in MQ
--------------------------------------+--------------------------------------
                 replication messages | 206
                  vnode-sync messages | 0
                   rebalance messages | 0
--------------------------------------+--------------------------------------

Another (maybe unrealted) note is that startup of stor06 after that took very long time (> 40 min to start that node), the symptoms were as usual (highly fragmented leveldb .log files, startup disk IO bound), but it was kind of even slower than before, maybe it didn't shut down correctly after all and was trying to startup without state files or something. The important point here is that it took up to 20 seconds per 3-4 MB log file (with backend_db.eleveldb.write_buf_size = 5242880 setting), so in this configuration 30 minute startup timeout wasn't enough and systemd was killing node until the startup timeout was increased. Since 30 minute is default startup timeout in current repo, maybe it's best to increase that for everybody if there is no way to make node actually startup faster.

янв 10 21:44:09 stor06.selectel.cloud.lan leo_storage[17587]: * opening log file is /mnt/avs3/bodies/log/leo_object_storage_20031.20180110.21.2
янв 10 21:44:29 stor06.selectel.cloud.lan leo_storage[17587]: id:leo_diagnosis_log_30031, path:"/mnt/avs4/bodies/log/", filename:"leo_object_storage_30031"

Full logs when this problem happened: stor06-error-logs.tar.gz

@mocchira
Copy link
Member

WIP.
At a first glance,

янв 10 20:23:49 stor06.selectel.cloud.lan leo_storage[27782]: "IO error: lock /mnt/avs2/bodies/metadata/10063/LOCK: already held by process"},

This can happen only in case env->LockFile get invoked more than once without calling env->UnLockFile according to https://github.com/basho/leveldb/blob/2.0.34/util/env_posix.cc#L720. also given that LockFile can be invoked from Repair/Recover/DestroyDB, it seems some race could happen on leo_storage startup/shutdown phase.

I will take a closer look around https://github.com/basho/eleveldb/blob/develop/c_src/eleveldb.cc#L1264-L1275 as I had never read what leveldb::Env::Shutdown do.

@mocchira
Copy link
Member

Updates.
Some testing with gdb, I've still never faced to hit the break point on LockFile/UnlockFile. It seems probably Repair or Recover should be triggered somehow. I will try to make it happen by breaking .log file intentionally.

@vstax
Copy link
Contributor Author

vstax commented Jan 12, 2018

seems some race could happen on leo_storage startup/shutdown phase.

Nothing of sorts was done explicitly, though; the node was working just like before (also, the service log that I posted above is full stdout/stderr output around time of problem). Lines with "id:leo_diagnosis_log" that appeared there are usually logged only during node startup; I have no idea what it means when they are printed on a node that is up & running for a long time. I've checked logs for the past months on few nodes and it never happened before, I could see these lines only during startup phase.

I also would like to note that I have no idea to what extent timer problems did exist at that time (at very least, not enough for kernel and other applications to complain). If ptp4l says that clock was running slower or faster than expected it must have been true, but since it tries to track time with ns precision, it is very sensitive to even smallest fluctuations that other software doesn't notice. There could've been some other effects from firmware update going on as well; then again, nothing that kernel or other applications have noticed.

@mocchira
Copy link
Member

@vstax It turned out that doubly called leo_backend_db_eleveldb:open without calling its close on leofs causes "Io error: lock ${file_name} already held by process" error so there can be something wrong in error handlings on startup (the parallel open patch might be involved).

@vstax
Copy link
Contributor Author

vstax commented Jan 16, 2018

@mocchira I see. Yes, this is version with parallel open. I just don't understand why you mention "startup" - the node was up and running for over two weeks when this happened?

@mocchira
Copy link
Member

@vstax

I just don't understand why you mention "startup" - the node was up and running for over two weeks when this happened?

Sorry, strictly speaking, it's not node startup but erlang internal process startup so it can happen while node was up and running.

Anyway, we will consider to rollback the PR to fix #940 as we cannot gain any performance improvement on your case. what do you think?

@vstax
Copy link
Contributor Author

vstax commented Jan 17, 2018

@mocchira
Sure, rolling back sounds like a good idea if this problem was aggravated by parallel open change.
It's a bit strange that this "internal process startup" has happened at all in the middle of load, though. At least, I haven't seen this to happen before (id:leo_diagnosis_log_* lines appeared only on actual node startup). Is there a known way to force this to happen to make sure that the problem described here won't happen even when there is no parallel open?

Also, what do you think about increasing TimeoutStartSec - problem described in the last paragraph of this issue, "startup of stor06 after that took very long time (> 40 min to start that node)"? Should it be accepted that yes, storage nodes can take dozens of minutes to start up and the time limit has to be increased? And the only solutions are either using SSDs for storing metadata or switching to RocksDB in the future? Using SSDs purely for solving "long startup problem" - if HDDs provide enough performance for normal use cases - really seems like a waste. (e.g. if we were to use SSDs, that would significantly increase the cost because keeping SSD in one of 4 hotswap bays of 1U server would leave us with 3 HDDs and requirement to use 33% more nodes to get same amount of storage. Or larger HDDs, which is equally bad idea as I'm already unhappy with disks so large, from reliability perspective).

In general: should TimeoutStartSec be increased in main repo and this issue to be documented for now?

Specifically for our case: I wonder if write_buf_size should be deceased even more for our system. It's at 5 MB per-container right now, so it's 256*5 = 1280 MB in total, maybe half of that value (roughly returning to original total write_buf_size for 8 containers configuration) would still work fine (considering that our load isn't write-intensive). The reason for my doubts is that the thought "long startup happens because of reading large fragmented WAL, so decreasing its maximum size will make startup faster" feels kind of wrong when considering that total WAL size for all containers on the node is only twice of what is suggested in config file by default, but there was already a case with 40 minute long startup! It's as if there is some other important factor that I'm missing here. Judging by numbers from #940 (comment) the node should've taken just 5 minutes to startup with 5 MB write_buf_size. It was reading from disk at around the same 7-9 MB/s speed during startup and was disk-IO bound; as I understand, when starting without .state files, it needs to read all LevelDB database in memory - but the part in .sst files isn't fragmented and should be read fast?

Is it safe to for example stop node, remove all .state files and start it up to measure time it needs to count all objects during startup? To check if that operation was the culprit here.

@mocchira
Copy link
Member

@vstax WIP however a little bit progresses there.

  • The root cause is many timeouts happened on requests to leo_backend_db_eleveldb (queue or metadata)
  • That said, some part of eleveldb/leveldb written in C/C++ might be vulnerable against system clock correction. (No vulnerable code in Erlang and its application while using right runtime API) now further digging into C/C++ code.
  • Subsequent many crashes after the root cause (many timeouts) were caused by the secondary disaster - many slow processing because the single gen_server leo_object_storage_event_notifier have to handle all slow processing requests coming from multiple gen_server (leo_object_storage_server so 256 in your case) as a result, many timeouts happened also here. This can be fixed by getting rid of the bottleneck leo_object_storage_event_notifier and implementing it in another something scalable way.
  • The second disaster followed by many crashes on leo_object_storage_server caused the third disaster the supervisor (leo_mq_sup and leo_object_storage_sup) restart as a result,

I haven't seen this to happen before (id:leo_diagnosis_log_* lines appeared only on actual node startup). Is there a known way to force this to happen to make sure that the problem described here won't happen even when there is no parallel open?

The situation above happened while the node is up and running and It seems this supervisor restart can cause the "IO lock already held by process", I will look into further here.

Long story short,

  • the root cause might be in C/C++ code of eleveldb/leveldb
  • the subsequent catastrophic errors and crashes were caused by the existing LeoFS code without enough error handling and non-scalable erlang process tree structure.

I will file the latter one as multiple other issues later. As for the former one, I will keep digging.

@mocchira
Copy link
Member

@vstax

In general: should TimeoutStartSec be increased in main repo and this issue to be documented for now?

Agreed.

Is it safe to for example stop node, remove all .state files and start it up to measure time it needs to count all objects during startup? To check if that operation was the culprit here.

Safe while the node is suspending so please go forward for checking it!

@vstax
Copy link
Contributor Author

vstax commented Jan 18, 2018

@mocchira
Thanks for looking at this in-depths! Very interesting. Unfortunately, I don't know how to reliably reproduce this problem (not that I want to, on production, anyway), so I don't know if I can provide any help in testing even if you were to fix original issues.

There is one more thing, however. I didn't mention it before because I didn't think it was important (since that problem was gone), but after reading your explanation, I think there might be some relation. I first saw this problem during last update - when one node is suspended, all other nodes start to consume tons of CPU resources. It continues until the node is resumed, when all nodes are in cluster, CPU usage goes down (which is why I thought that I shouldn't pay much attention to the problem). It doesn't matter which node is suspended, this happens with any of them.

I repeated this experiment just now. I suspend node stor06 (leofs-adm suspend bodies@stor06.selectel.cloud.lan). Short time after (a few minutes) CPU usage goes very high on all other five nodes. Normally it's 2-5% under load, but here it raises to 380-450% (counting each core as 100%, of course). Here is top per-thread view on stor01 (but it's the same on all of them except for stor06, of course):

  PID USER      PR  NI    VIRT    RES    SHR S %CPU %MEM     TIME+ COMMAND
 4151 leofs     20   0 9947924 1,864g 134264 R 86,7  6,0 156:17.30 2_scheduler
 4150 leofs     20   0 9947924 1,864g 134264 R 80,0  6,0 242:53.98 1_scheduler
 4152 leofs     20   0 9947924 1,864g 134264 R 80,0  6,0 161:14.34 3_scheduler
 4153 leofs     20   0 9947924 1,864g 134264 S 80,0  6,0 143:13.10 4_scheduler
 4154 leofs     20   0 9947924 1,864g 134264 S 80,0  6,0 118:22.35 5_scheduler
 4155 leofs     20   0 9947924 1,864g 134264 R 66,7  6,0  91:59.88 6_scheduler
 4035 leofs     20   0 9947924 1,864g 134264 S  6,7  6,0   0:08.17 async_27
 4114 leofs     20   0 9947924 1,864g 134264 S  6,7  6,0   0:06.91 async_105
 4129 leofs     20   0 9947924 1,864g 134264 S  6,7  6,0   0:07.58 async_119
 4159 leofs     20   0 9947924 1,864g 134264 S  6,7  6,0  18:18.14 aux
 1315 leofs     20   0 9947924 1,864g 134264 S  0,0  6,0   0:00.02 beam.smp
 4001 leofs     20   0 9947924 1,864g 134264 S  0,0  6,0   0:00.00 sys_sig_dispatc
 4005 leofs     20   0 9947924 1,864g 134264 S  0,0  6,0   0:00.00 sys_msg_dispatc
 4009 leofs     20   0 9947924 1,864g 134264 S  0,0  6,0   0:05.94 async_1
 4010 leofs     20   0 9947924 1,864g 134264 S  0,0  6,0   0:04.40 async_2
 4011 leofs     20   0 9947924 1,864g 134264 S  0,0  6,0   0:11.22 async_3

There are no messages in info or error log files. There are some messages appearing in replication queues, of course:

[vm@bodies-master ~]$ leofs-adm status
 [System Confiuration]
-----------------------------------+----------
 Item                              | Value
-----------------------------------+----------
 Basic/Consistency level
-----------------------------------+----------
                    system version | 1.3.8
                        cluster Id | leofs_1
                             DC Id | dc_1
                    Total replicas | 3
          number of successes of R | 1
          number of successes of W | 2
          number of successes of D | 2
 number of rack-awareness replicas | 0
                         ring size | 2^128
-----------------------------------+----------
 Multi DC replication settings
-----------------------------------+----------
 [mdcr] max number of joinable DCs | 2
 [mdcr] total replicas per a DC    | 1
 [mdcr] number of successes of R   | 1
 [mdcr] number of successes of W   | 1
 [mdcr] number of successes of D   | 1
-----------------------------------+----------
 Manager RING hash
-----------------------------------+----------
                 current ring-hash | 4bb5757b
                previous ring-hash | 4bb5757b
-----------------------------------+----------

 [State of Node(s)]
-------+---------------------------------------+--------------+----------------+----------------+----------------------------
 type  |                 node                  |    state     |  current ring  |   prev ring    |          updated at
-------+---------------------------------------+--------------+----------------+----------------+----------------------------
  S    | bodies@stor01.selectel.cloud.lan      | running      | 4bb5757b       | 4bb5757b       | 2018-01-10 18:51:51 +0300
  S    | bodies@stor02.selectel.cloud.lan      | running      | 4bb5757b       | 4bb5757b       | 2018-01-10 19:16:26 +0300
  S    | bodies@stor03.selectel.cloud.lan      | running      | 4bb5757b       | 4bb5757b       | 2018-01-10 19:49:34 +0300
  S    | bodies@stor04.selectel.cloud.lan      | running      | 4bb5757b       | 4bb5757b       | 2018-01-10 20:12:14 +0300
  S    | bodies@stor05.selectel.cloud.lan      | running      | 4bb5757b       | 4bb5757b       | 2018-01-10 20:30:03 +0300
  S    | bodies@stor06.selectel.cloud.lan      | suspend      | 4bb5757b       | 4bb5757b       | 2018-01-10 22:14:22 +0300
  G    | gateway@k01.selectel.cloud.lan        | running      | 4bb5757b       | 4bb5757b       | 2017-12-15 21:38:12 +0300
  G    | gateway@k02.selectel.cloud.lan        | running      | 4bb5757b       | 4bb5757b       | 2017-12-15 21:53:38 +0300
  G    | gateway@k06.selectel.cloud.lan        | running      | 4bb5757b       | 4bb5757b       | 2017-12-15 21:54:29 +0300
-------+------------------------------------------------+--------------+----------------+----------------+----------------------------

[vm@bodies-master ~]$ leofs-adm mq-stats bodies@stor01.selectel.cloud.lan
              id                |    state    | number of msgs | batch of msgs  |    interval    |                 description
--------------------------------+-------------+----------------|----------------|----------------|---------------------------------------------
 leo_async_deletion_queue       |   running   | 2              | 1600           | 500            | async deletion of objs
 leo_comp_meta_with_dc_queue    |   idling    | 0              | 1600           | 500            | compare metadata w/remote-node
 leo_delete_dir_queue_1         |   idling    | 0              | 1600           | 500            | deletion bucket #1
 leo_delete_dir_queue_2         |   idling    | 0              | 1600           | 500            | deletion bucket #2
 leo_delete_dir_queue_3         |   idling    | 0              | 1600           | 500            | deletion bucket #3
 leo_delete_dir_queue_4         |   idling    | 0              | 1600           | 500            | deletion bucket #4
 leo_delete_dir_queue_5         |   idling    | 0              | 1600           | 500            | deletion bucket #5
 leo_delete_dir_queue_6         |   idling    | 0              | 1600           | 500            | deletion bucket #6
 leo_delete_dir_queue_7         |   idling    | 0              | 1600           | 500            | deletion bucket #7
 leo_delete_dir_queue_8         |   idling    | 0              | 1600           | 500            | deletion bucket #8
 leo_per_object_queue           |   running   | 751            | 1600           | 500            | recover inconsistent objs
 leo_rebalance_queue            |   idling    | 0              | 1600           | 500            | rebalance objs
 leo_recovery_node_queue        |   idling    | 0              | 1600           | 500            | recovery objs of node
 leo_req_delete_dir_queue       |   idling    | 0              | 1600           | 500            | request removing directories
 leo_sync_by_vnode_id_queue     |   idling    | 0              | 1600           | 500            | sync objs by vnode-id
 leo_sync_obj_with_dc_queue     |   idling    | 0              | 1600           | 500            | sync objs w/remote-node

Leofs_doctor output for the node at this time: https://pastebin.com/d7EZ7CLy
There is no significant disk IO load.

I execute leofs-adm resume bodies@stor06.selectel.cloud.lan. CPU usage drops to 30-40% almost immediately, 20 seconds or so later queues are empty, and CPU load goes back to near-zero one. Leofs_doctor output at this time with empty queues: https://pastebin.com/rSqhy1F8

Also, it seems like right after "suspend", CPU usage on working nodes grows corresponding to amount of messages in leo_per_object_queue. E.g. with 11 messages CPU usage gets to around 10%, with 17 it gets to 15-17%, with 30 it's 30-45%, with 130 it's 110-120% and so on, until it tops at 450 or something (well, fluctuates around, really).

Anyhow, the reason why I think it might have relation to what you wrote about the problem is that stor05 was in "suspended" state when the things I described in this issue have happened. So stor06, just like stor01-04 was in this "high CPU usage" state at that time. Then firmware update on stor06 was started (before stor05 was resumed), and some minutes after the issue has happened (online firmware update takes really long time, like 15 minutes or so). We waited till stor05 finished startup after reboot, only after that we could "resume" stor05 and "suspend" and stop stor06.

Safe while the node is suspending so please go forward for checking it!

Yes I did the tests, this was the problem. I filled another issue (#974) because it's really unrelated to this one.

@mocchira
Copy link
Member

mocchira commented Feb 2, 2018

@vstax Sorry for the late reply.

Thanks for looking at this in-depths! Very interesting. Unfortunately, I don't know how to reliably reproduce this problem (not that I want to, on production, anyway), so I don't know if I can provide any help in testing even if you were to fix original issues.

Don't mind we can reproduce the catastrophic scenario by killing some erlang processes intentionally within a certain period (anyway you can do the same through remote_console however never do it on production until the below TODOs get fixed) so additional UT which will be added through #977 should cover the fix for this.

TODO:

There is one more thing, however. I didn't mention it before because I didn't think it was important (since that problem was gone), but after reading your explanation, I think there might be some relation. I first saw this problem during last update - when one node is suspended, all other nodes start to consume tons of CPU resources. It continues until the node is resumed, when all nodes are in cluster, CPU usage goes down (which is why I thought that I shouldn't pay much attention to the problem). It doesn't matter which node is suspended, this happens with any of them.

As you may know, this is caused by #710 (1.3.3 has fixed part of problem but still remains when the node stopped/suspended for the long time).

#710 (comment) will/should solve the problem. we'd like to prioritize the issue and fix ASAP.

Anyhow, the reason why I think it might have relation to what you wrote about the problem is that stor05 was in "suspended" state when the things I described in this issue have happened. So stor06, just like stor01-04 was in this "high CPU usage" state at that time. Then firmware update on stor06 was started (before stor05 was resumed), and some minutes after the issue has happened (online firmware update takes really long time, like 15 minutes or so). We waited till stor05 finished startup after reboot, only after that we could "resume" stor05 and "suspend" and stop stor06.

Good catch, as I stated on #979 (comment), there can be something wrong in leo_sorage boot phase while receiving lots of requests from leo_gateway and leo_storage. I'm now digging in this direction and will file the issue and add to the above TODO if I can find something wrong. (Fixing #710 might be sufficient to make it never happen as #710 can cause crash.log you would share on the prev comment)

Yes I did the tests, this was the problem. I filled another issue (#974) because it's really unrelated to this one.

As you may know, it's been fixed now so please give it a try if you have time.

@vstax
Copy link
Contributor Author

vstax commented Feb 5, 2018

@mocchira

As you may know, this is caused by #710 (1.3.3 has fixed part of problem but still remains when the node stopped/suspended for the long time).

I see! I had no idea, actually. I kind of assumed that "very long time" from #710 meant hours/days and lots of messages in queue, so I never thought it's possible to trigger it within just minutes and with hundreds of messages. Well, I'm glad that the cause is known.

As you may know, it's been fixed now so please give it a try if you have time.

Yes, it's much better with that fix, thanks. I closed that issue.

@mocchira
Copy link
Member

mocchira commented Feb 6, 2018

@vstax

Also, it seems like right after "suspend", CPU usage on working nodes grows corresponding to amount of messages in leo_per_object_queue. E.g. with 11 messages CPU usage gets to around 10%, with 17 it gets to 15-17%, with 30 it's 30-45%, with 130 it's 110-120% and so on, until it tops at 450 or something (well, fluctuates around, really).

Note: the upper bound of CPU usage is limited by https://github.com/leo-project/leofs/blob/1.4.0-rc2/apps/leo_storage/priv/leo_storage.conf#L237-L238 settings (default: 4 so that means around 400% CPU could be consumed by leo_mq while there are suspended/downed nodes in the cluster). That being said, there is a workaround by setting this param to lower value which result in consuming less CPU usage during that time. However currently this value has been used in all leo_mq instances other than the mq for recover inconsistent objects so we may consider to make those values configurable by type of mq.

@windkit
Copy link
Contributor

windkit commented Feb 14, 2018

@mocchira so if there is suspended node, working nodes would keep trying to send repair messages to the suspended as the message were not processed, they got re-enqueued and re-processed (and loop over and over again?)

@mocchira
Copy link
Member

@windkit See all comments on #710 for its background.

so if there is suspended node, working nodes would keep trying to send repair messages to the suspended as the message were not processed, they got re-enqueued and re-processed (and loop over and over again?)

The current IMPL is almost what you said but there is also a certain sleep (configurable) per batch so CPU should not be hogged by design. IMO, #994 will solve the CPU hogged problem.

@mocchira
Copy link
Member

#977 (WIP) will be finished before long so once the fix landed, I'll close this issue.

@mocchira
Copy link
Member

@vstax Now all issues which can cause this problem have gone so I'll close, please feel free to reopen when you face the same issue again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants