-
Notifications
You must be signed in to change notification settings - Fork 155
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[leo_manager][leo_storage] Inconsistent RING can happen when rebalancing the cluster #1078
Comments
I'd like to summarize what happens when hitting this issue. What happens
That being said, all we have to do are
TODOIn order to
$ ./package/leo_manager_0/bin/leo_manager remote_console
(manager_0@127.0.0.1)1 > leo_redundant_manager_api:create(). The RING hash displayed on leofs-adm status will be changed to the same value with storage node's one.
|
@mocchira node01 still refuses to start (with del_dir removed). Maybe full logs will help, info:
error:
crash.log (last part; before that there are few others ERROR REPORTS regarding different files, they look exactly like the one I quoted here):
Should we remove all queues on all nodes or wait for some other solution? |
Yes as I've answered at #1076 (comment) The below is the relevant information quoted from the above comment.
So the procedure you would have to take is
OR Wait for the racy problem I described at #1076 (comment) get fixed and cherry-pick the fix into your LeoFS build. (the fix will just move one line to the other position so it'd be easy to cherry-pick the commit I think) The latter option would be available three or four days later (my next working day) so if you can't wait for it, then the former one would be your choice. |
@mocchira (cherry-picking/building and installing would be pretty hard for this cluster, unfortunately, as I'm not operating it and just helping to solve this problem - this cluster never been upgraded since initially installed on 1.3.8, for example, because upgrade wasn't yet tested by its operator) |
Got it.
No. It's not safe so
The above procedure you commented would be best if it's allowed to stop all nodes for a short time. |
@mocchira However, the original problem with errors in logs of node01 has returned. There are messages in logs like
Whereis output:
Another strange one is
Anyhow, these errors appear only on node01, and there are quite a lot of them. No errors on any of the other nodes (other than some related to timeouts during recover-node operations and cause,not_found ones when objects that don't exist are requested). |
Great to hear that.
I see. This might be something other problem happening only on node01. I will dig in from the next working day. |
|
@vstax Given that
That being said, |
@mocchira I do not see any difference between RINGs between storage nodes and gateways, there are some differences between RING on storage nodes and manager, only in timestamps:
The other thing that I've noticed is that *_worker.log files are 1.5 times larger now, you can compare to the older ones posted here #1076 (comment) - despite RING (and .dump) files themselves being essentially the same (at least on storage nodes). Anyhow dumps from manager: manager.tar.gz Just in case, status output for manager and the first node:
|
@vstax Thanks. It seems no problems around RING. Anyway, I've just remembered there has been a drawback in increasing num_of_containers while it has several advantages like decreasing the AVS file size, minimizing the compaction time etc. Increasing num_of_containers results in increasing the number of erlang processes (You can think of it as the degree of concurrency) dealing with AVS operations, OTOH leo_redundant_manager/leo_redundant_manager_worker dealing with RING operations like get_redundancies_by_key keeps being one erlang process. That being said, as num_of_containers increases, the possibility of timeout/error happening on RING operations can increase. Can you run leofs_doctor on node01 to check if the above assumption is correct. (if it's correct then leo_redundant_manager or leo_redundat_manager_worker's queue size has gotten a large value) |
@vstax if increased num_of_contianers is the culprit then you may decrease its value to more lower one to solve this problem. |
@mocchira Actually now that I checked, there are 100% CPU spikes in Eleveldb thread for a second or two every 5-7 seconds or so:
However, these are present on all nodes in cluster. Still, it's somewhat strange. Anyhow, leo_doctor output for problematic node node01: https://pastebin.com/0WxpQF4p And for node05 which is under recovery atm (as in And for node09, which is running on weaker configuration under VM (once again, no problems in logs): https://pastebin.com/e9BhE0zx Queues are empty on all of these nodes. The rest of nodes - 06, 07 and 08 have non-zero queues as they're pushing data to node05. But node01 and node09 currently don't participate in any such operations. From what I understand, the load at that moment was quite low (I don't quite have numbers, quick tcpdump on gateway is showing something among 20 reads per second and almost no writes at all). I think health check (/_leofs_adm/ping) was misconfigured and is requested 2-3 times per second, not sure if it can create a problem or not. Another thing I noticed is that "telegraf" is running on each node and gathering stats over SNMP locally, doing it rather often (every second). Not sure if it could be the source of the problem, but it's configured that way on each node as well, so most likely it is not...
Well, it's an option, but I have doubts about two things. First of all, I managed another cluster in the past and it was running perfectly fine with 4 * 64 containers, no such problems even during intense HEAD+PUT operations during data migration, despite somewhat weaker hardware (4-core Xeon E3). Here node01 has 6 * 64 containers, a bit more but the load is really low and hardware is better. Second is these Eleveldb CPU spikes, something just doesn't feel right about them. Maybe they are original source and in addition to something else (e.g. leo_redundant_manager like you mention) create problems for node01? Is it possible to understand source of these spikes? EDIT: One more thing, I've launched
|
@vstax Thanks for the quick reply with the result of leofs_doctor on several nodes. It seems there are no problems at all. so my assumption was wrong.
Yes this can be the original source problem so I will dig into this from now on.
The queue filling up phase happening right after recover-node get started tend to hog only one OS thread (iterating eleveldb to find objects to be sent to the target node would be executed one by one), OTOH the queue consuming phase being started right after the filling up phase get finished tend to hog as many scheduler threads as possible. this behavior might explain what is happening on your cluster now. |
I see. None of the nodes were filling queues in any of the above examples, though. node01 and node09 weren't doing anything with the queues and node05 was only accepting data pushed by queues from other nodes. I don't know if processing of recovery queue on e.g. node07 which is pushing data to node05 can cause any kind of load on node01 (which is long done with its recovery queue), though. I wish I could check if CPU spikes happen in the absence of recovery going on at all, but it's not likely to happen (14-24 mil of messages in recovery queues right now as the next "recover-node" was launched some time ago). In the worst case, if you will be unable to find solution to improve the current situation, do you think that recreating node01 (maybe with a new name) will fix all the problems (which can be done in 3-4 days from now on, I think, after recover-node is finished for the other 5 nodes)? |
Processing of the recovery queue on node07 which is pushing data to node05 (led by leofs-adm recover node05) shouldn't cause any kind of load on node01. (only node07 and node05 can be affected)
I see. please check it out once the every recover-node operation has finished.
Yes I think so (actually I was going to ask you to do so if I couldn't find any solution within a certain period) just in case, If you do this then I'd like you to set obj.num_of_containers back to the original value (16 * 6) for safe. (At this moment, I'm still not 100% sure that obj.num_of_containers has affected this problem but at least I believe there is a possibility that it can affect.) Anyway, I'm still trying to reproduce this problem on my dev-box and if I succeeds to reproduce and find out the root problem and its solution then I'll get back to you at here. |
@mocchira Here is trace from node06 which is not participating in any recovery, queues are all at zero. The load on cluster is around 50-60 GETs/sec. But just by looking at strace statistics (gathered in 20 seconds) - and that's just for the single Eleveldb thread out of few (2-3 or so) that consume lots of CPU (it's just one consuming the whole core at a single moment, but it can switch to another one every 20-40 seconds or so):
That's over 2000 reads per second (generally all from .sst files of AVS). I have no idea if that's how leveldb works or this number is too high. Strace statistics for node01 (not participating in recovery and queues are at zero as well) is very similar:
(the numbers of reads can vary in this 20 sec run in range like 10,000-50,000 depending on the luck of catching Eleveldb thread that's actually doing the work. I haven't seen more than 3 of these per node, though) Full strace outputs for node01 and node07, also lsof output for node01: strace.tar.gz By the way, queue .sst files are opened/closed quite a lot. |
@mocchira Since recovery was finished for 5 nodes, I was trying some experiments that might show the nature of the problem. I suspended and stopped node01 (it had zero queues at that moment) and wiped all the data from AVS. Launched it without data (same settings - 6*64 AVS files), resumed and the errors started to appear right away:
I wanted to check if the errors are related to some problem with metadata on node01.. looks like they aren't as they happen when the objects aren't even present there. Interestingly, read-repair also didn't work for the objects that produced these errors! It works for the objects that are requested without errors just fine, though (here all these 123 objects on node are from read-repair, obviously). Then I wiped everything once again and started node with 6 * 16 containers. No errors! Your guess about 6 * 64 containers being problematic was quite right. Removed and started with 6 * 32 containers. No errors so far. So it looks like some kind of problem appears only when total number of containers is greater than 192 or maybe 256 (what I used in the other cluster), and 384 containers was enough here to appear. Anyhow, unless errors return again, this also lowers priority of this problem (still, I'm kind of interested in knowing what was the original reason for this one!) Now regarding the second problem, spiky CPU usage in Eleveldb thread - which turned out to be unrelated to the errors on node01. Right after node was launched without data, these 100% CPU spikes appeared again, despite node holding almost no objects. I suspended node01 from manager and they still happened! Which seemed quite strange. Here is strace and lsof of Elevedb thread from node01 with wiped data, holding only about 200 objects and being in "suspend" state: strace-lsof-node01.tar.gz So it turns out that it's reading .sst files from queue/1 like crazy even though there is nothing in queues:
All the disk usage is in .sst files, from what I understand they should really contain no data, only some removed stuff that wasn't compacted. But every few seconds or so Eleveldb thread is trying to read them and does something with what is read (uses lots of CPU in process). Note that these queues were wiped just a few days ago when I was trying to fix launch of node01. I think this problem might also be the reason why |
Great to hear that! and I see regarding the prev comment.
Now I'm suspecting that some lock contention has occurred because of scrambling for some shared resources between underlying leveldb instances. This rarely happens in case num_of_containers is relatively small however it starts to happen once num_of_containers goes above a certain threshold (192 or 256 as you mentioned). I will look into what the actual resource is next week.
#1060 might be the culprit (there are many queues and each queue has its own polling erlang process) so this might be fixed once your cluster get upgraded to >= 1.4.2. |
Now it turns out that there have been three problems in the original issue filed on #1076 .
I'm going to deal with the first problem here, the second problem here: #1086. I think the last one will probably be solved by #1060 as I mentioned at the above comment so I don't file any issue at the moment however if you still face the issue after upgrading to >= 1.4.2 then I'd like to file the issue for that (or please file by yourself) From now on, let's discuss each topic on the appropriate issue. |
Through the investigation for #1076, I've found out the way to reproduce this problem.
In short, Trying to join the node which node name was previously used by another node which existed in the cluster can cause this RING inconsistent problem. For example, the below procedure can reproduce the problem.
Prerequisites
Steps to reproduce
The result
As you can see, the Manager RING hash differs from the one on storage nodes.
The text was updated successfully, but these errors were encountered: