-
Notifications
You must be signed in to change notification settings - Fork 155
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Node entered "strange" state with internal crashes (timer problems?) #966
Comments
WIP.
This can happen only in case I will take a closer look around https://github.com/basho/eleveldb/blob/develop/c_src/eleveldb.cc#L1264-L1275 as I had never read what leveldb::Env::Shutdown do. |
Updates. |
Nothing of sorts was done explicitly, though; the node was working just like before (also, the service log that I posted above is full stdout/stderr output around time of problem). Lines with "id:leo_diagnosis_log" that appeared there are usually logged only during node startup; I have no idea what it means when they are printed on a node that is up & running for a long time. I've checked logs for the past months on few nodes and it never happened before, I could see these lines only during startup phase. I also would like to note that I have no idea to what extent timer problems did exist at that time (at very least, not enough for kernel and other applications to complain). If ptp4l says that clock was running slower or faster than expected it must have been true, but since it tries to track time with ns precision, it is very sensitive to even smallest fluctuations that other software doesn't notice. There could've been some other effects from firmware update going on as well; then again, nothing that kernel or other applications have noticed. |
@vstax It turned out that doubly called leo_backend_db_eleveldb:open without calling its close on leofs causes "Io error: lock ${file_name} already held by process" error so there can be something wrong in error handlings on startup (the parallel open patch might be involved). |
@mocchira I see. Yes, this is version with parallel open. I just don't understand why you mention "startup" - the node was up and running for over two weeks when this happened? |
Sorry, strictly speaking, it's not node startup but erlang internal process startup so it can happen while node was up and running. Anyway, we will consider to rollback the PR to fix #940 as we cannot gain any performance improvement on your case. what do you think? |
@mocchira Also, what do you think about increasing TimeoutStartSec - problem described in the last paragraph of this issue, "startup of stor06 after that took very long time (> 40 min to start that node)"? Should it be accepted that yes, storage nodes can take dozens of minutes to start up and the time limit has to be increased? And the only solutions are either using SSDs for storing metadata or switching to RocksDB in the future? Using SSDs purely for solving "long startup problem" - if HDDs provide enough performance for normal use cases - really seems like a waste. (e.g. if we were to use SSDs, that would significantly increase the cost because keeping SSD in one of 4 hotswap bays of 1U server would leave us with 3 HDDs and requirement to use 33% more nodes to get same amount of storage. Or larger HDDs, which is equally bad idea as I'm already unhappy with disks so large, from reliability perspective). In general: should TimeoutStartSec be increased in main repo and this issue to be documented for now? Specifically for our case: I wonder if write_buf_size should be deceased even more for our system. It's at 5 MB per-container right now, so it's 256*5 = 1280 MB in total, maybe half of that value (roughly returning to original total write_buf_size for 8 containers configuration) would still work fine (considering that our load isn't write-intensive). The reason for my doubts is that the thought "long startup happens because of reading large fragmented WAL, so decreasing its maximum size will make startup faster" feels kind of wrong when considering that total WAL size for all containers on the node is only twice of what is suggested in config file by default, but there was already a case with 40 minute long startup! It's as if there is some other important factor that I'm missing here. Judging by numbers from #940 (comment) the node should've taken just 5 minutes to startup with 5 MB write_buf_size. It was reading from disk at around the same 7-9 MB/s speed during startup and was disk-IO bound; as I understand, when starting without .state files, it needs to read all LevelDB database in memory - but the part in .sst files isn't fragmented and should be read fast? Is it safe to for example stop node, remove all .state files and start it up to measure time it needs to count all objects during startup? To check if that operation was the culprit here. |
@vstax WIP however a little bit progresses there.
The situation above happened while the node is up and running and It seems this supervisor restart can cause the "IO lock already held by process", I will look into further here. Long story short,
I will file the latter one as multiple other issues later. As for the former one, I will keep digging. |
Agreed.
Safe while the node is suspending so please go forward for checking it! |
@mocchira There is one more thing, however. I didn't mention it before because I didn't think it was important (since that problem was gone), but after reading your explanation, I think there might be some relation. I first saw this problem during last update - when one node is suspended, all other nodes start to consume tons of CPU resources. It continues until the node is resumed, when all nodes are in cluster, CPU usage goes down (which is why I thought that I shouldn't pay much attention to the problem). It doesn't matter which node is suspended, this happens with any of them. I repeated this experiment just now. I suspend node stor06 (
There are no messages in info or error log files. There are some messages appearing in replication queues, of course:
Leofs_doctor output for the node at this time: https://pastebin.com/d7EZ7CLy I execute Also, it seems like right after "suspend", CPU usage on working nodes grows corresponding to amount of messages in leo_per_object_queue. E.g. with 11 messages CPU usage gets to around 10%, with 17 it gets to 15-17%, with 30 it's 30-45%, with 130 it's 110-120% and so on, until it tops at 450 or something (well, fluctuates around, really). Anyhow, the reason why I think it might have relation to what you wrote about the problem is that stor05 was in "suspended" state when the things I described in this issue have happened. So stor06, just like stor01-04 was in this "high CPU usage" state at that time. Then firmware update on stor06 was started (before stor05 was resumed), and some minutes after the issue has happened (online firmware update takes really long time, like 15 minutes or so). We waited till stor05 finished startup after reboot, only after that we could "resume" stor05 and "suspend" and stop stor06.
Yes I did the tests, this was the problem. I filled another issue (#974) because it's really unrelated to this one. |
@vstax Sorry for the late reply.
Don't mind we can reproduce the catastrophic scenario by killing some erlang processes intentionally within a certain period (anyway you can do the same through remote_console however never do it on production until the below TODOs get fixed) so additional UT which will be added through #977 should cover the fix for this. TODO:
As you may know, this is caused by #710 (1.3.3 has fixed part of problem but still remains when the node stopped/suspended for the long time). #710 (comment) will/should solve the problem. we'd like to prioritize the issue and fix ASAP.
Good catch, as I stated on #979 (comment), there can be something wrong in leo_sorage boot phase while receiving lots of requests from leo_gateway and leo_storage. I'm now digging in this direction and will file the issue and add to the above TODO if I can find something wrong. (Fixing #710 might be sufficient to make it never happen as #710 can cause crash.log you would share on the prev comment)
As you may know, it's been fixed now so please give it a try if you have time. |
I see! I had no idea, actually. I kind of assumed that "very long time" from #710 meant hours/days and lots of messages in queue, so I never thought it's possible to trigger it within just minutes and with hundreds of messages. Well, I'm glad that the cause is known.
Yes, it's much better with that fix, thanks. I closed that issue. |
Note: the upper bound of CPU usage is limited by https://github.com/leo-project/leofs/blob/1.4.0-rc2/apps/leo_storage/priv/leo_storage.conf#L237-L238 settings (default: 4 so that means around 400% CPU could be consumed by leo_mq while there are suspended/downed nodes in the cluster). That being said, there is a workaround by setting this param to lower value which result in consuming less CPU usage during that time. However currently this value has been used in all leo_mq instances other than the mq for recover inconsistent objects so we may consider to make those values configurable by type of mq. |
@mocchira so if there is suspended node, working nodes would keep trying to send repair messages to the suspended as the message were not processed, they got re-enqueued and re-processed (and loop over and over again?) |
@windkit See all comments on #710 for its background.
The current IMPL is almost what you said but there is also a certain sleep (configurable) per batch so CPU should not be hogged by design. IMO, #994 will solve the CPU hogged problem. |
#977 (WIP) will be finished before long so once the fix landed, I'll close this issue. |
@vstax Now all issues which can cause this problem have gone so I'll close, please feel free to reopen when you face the same issue again. |
This is not 100% LeoFS problem as there was external factor present. So maybe there is no need to solve this directly but I wanted to report this in case there actually is some error (or error handling) in code that can be improved.
We were updating & rebooting cluster of storage nodes (firmware and kernel update against recent vulnerabilities). We have N=3, W=2, so maximum single node was touched at once. Procedure was as follows: new kernel was installed, each node was "suspended", then firmware update installed via IPMI without system interruption (current Supermicro systems support this), when it was over the system was rebooted - booting with new firmware, new kernel and after storage node was started, it was "resumed" from manager node.
During firmware update itself the system was working fine, aside from possible timer issues, PTP daemons were complaining in system logs:
There were seemingly no other problems; beam.smp of storage node ("suspended" at that time from manager, so no load, but obviously working as process) was rotating logs at hour change and things like that. It shut down properly as well after that, after firmware update was completed (during system reboot). I'm writing this to give better understanding that usually (at least for other systems) online update of firmware through IPMI web interface doesn't cause any service interruption, and here storage nodes weren't affected as well, as long as there was no external load. We updated nodes stor01-05 this way.
By some mistake, on the last node (stor06) firmware update was started without "suspending" node first. The firmware update (and timer errors) started at around 20:19:
4 minutes later ,at 20:23 the node started to produce very strange errors. In stdout/stderr (as captured per journalctl):
crash.log was created with tons of errors at that time as well, plus tons of errors in error.log and info.log; there are really tons and I'm unable to quote them here so attaching these files.
Just two interesting notes, first of all there were lots of "slow operation" errors like
and second is, I was unable to execute "mq-stats stor06" command from manager node. "status stor06" worked fine, "version all" worked fine as well, but "mq-stats" produced no result (exited immediately, without output or error). Also on other nodes, messages started to appear in replication queue. At this point (after observing situation long enough) the node was suspended, then rebooted.
"status" output for stor06:
Another (maybe unrealted) note is that startup of stor06 after that took very long time (> 40 min to start that node), the symptoms were as usual (highly fragmented leveldb .log files, startup disk IO bound), but it was kind of even slower than before, maybe it didn't shut down correctly after all and was trying to startup without state files or something. The important point here is that it took up to 20 seconds per 3-4 MB log file (with
backend_db.eleveldb.write_buf_size = 5242880
setting), so in this configuration 30 minute startup timeout wasn't enough and systemd was killing node until the startup timeout was increased. Since 30 minute is default startup timeout in current repo, maybe it's best to increase that for everybody if there is no way to make node actually startup faster.Full logs when this problem happened: stor06-error-logs.tar.gz
The text was updated successfully, but these errors were encountered: