-
Notifications
You must be signed in to change notification settings - Fork 155
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Recover-node fails to recover all data on storage node #880
Comments
It looks similar to the issue I had #881 |
@windkit
However, while usually these objects are missing on both stor01 and stor03, there are a few cases among these (less than 10) where object that is missing on stor03 is present on stor01 (primary) and stor02:
It could be that when both stor01 and stor03 are being recovered at once this bug might be causing situations like this as well. I'd say it is probably caused by the same problem as #881, the only (strange) addition here is how on stor03 I'm missing objects specifically from avs3 directory, even though all four were removed and recovered at once. I find it hard to believe that removed avs3 on stor01 could've caused that but since I don't know how exactly AVS directory/file is picked I can't say anything about it. If, in theory, it could be that there is relation between third AVS directory on stor01 and stor03 (like, objects that go to both of these nodes end up going into same AVS directory?), this could happen. But having such relation would be really strange, I think? |
Objects having a same key would belong to the same AVS directory/file on each replicated node (if they have the same obj_containers settings) according to https://github.com/leo-project/leo_object_storage/blob/develop/src/leo_object_storage_api.erl#L352 so yes there is a relation. |
@mocchira From df on stor03:
i.e. around 47 GB of data arrived there, but I'm still missing around 110 GB. Checking objects with whereis, it looks like the objects that have arrived to stor03 now are the ones that were existing on stor01 and with stor01 being primary, but missing on stor03 (second example on #880 (comment)). As for the objects that are missing on both stor01 and stor03 and have either stor01 or stor03 as primary (and only third copy exists somewhere else), none have arrived to stor01/stor03. |
@vstax Thanks for checking.
#889 should fix both cases ONLY if you run recover-node one by one. |
@mocchira Another note - either because of some latest patches, or maybe something else? - I get the feeling that first stage of recover-node (where leo_per_object_queue gets filled) proceeded few times faster than during previous experiments (including original one when I was running single recover-node on node that was fully consistent around two weeks ago). It only took minutes to get 2.5 M objects in queue of each node and the queue started to just consume after that. The consumption itself is much slower with this patch, however (it's still saturates incoming network but apparently has to send much more data now). But, well, it's not a large price to pay for being able to actually get all the objects that are supposed to be on the node :) |
Good to hear that.
Yes I'm also surprised so much, anyway thanks for doing this test that will contribute to improve the quality.
Thanks :) just in case, let us know how much longer the recover-node take with the latest patch, if it take considerable time then as I said on the reply above, we will consider more efficient way to reduce traffic somehow (actually I have the idea for that). |
@vstax Finally found the culprit, now we are considering the best way to fix this problem. EDIT: now if you run 2 recover-node in parallel then the former one can be immature (part of objects will not be recovered) OTOH the latter one completed without any inconsistencies. |
Apparently it's about twice as long now. I'm not 100% sure but I think that amount of messages in leo_per_object_queue is around twice of it was before during its peak as well. I see about two recoveries at once. If there is no easy fix, I guess at least it should be either added to documentation, or the nodes internally shouldn't start filling queue with objects related to second recovery before finishing with the first one (probably resulting in the same one-by-one recovery but user won't be able to mess it up)? |
@vstax just in case, let me clarify
What do you mean "the problem with inconsistent RING"?
OK. we will document it for now and fix later somehow. |
@mocchira Ah sorry I wrote it wrong. Yes I meant #846 and of course it's not "inconsistent", I was probably just thinking of something else when writing and made a mistake. Anyhow, from our current situation (space running out on the old storages and we don't want to add hardware there since we're migrating to LeoFS) we'll have to start migration in just a few days (probably on Monday), regardless whether situation with uneven distribution can be solved before or not. (naturally, we reserve enough space to be able to rollback to old systems if something will go wrong in a week or few, it's just that we'll run out of that reserve space if we wait more). So I wondered whether it's possible to perform rebalance/recover after fixing RING distribution if we launch like it is now or it will be too dangerous to do. If it isn't, it's not a huge problem per se - it's just that I (liking perfection a bit too much) don't like the current situation, but from objective point of view, the uneven distribution is not on a critical level, just a bit annoying. As I'm performing the last tests like recover-node and delete bucket I can clearly see (from the queues and the load on servers) how having uneven distribution makes these tasks longer. |
@vstax Thanks for answering. Got it.
We are going to enable you to fix the uneven distribution on working cluster afterwards safely. |
@mocchira OK, thank you. I managed to create better RING so uneven distribution is much less of a deal now. This issue is fixed now - I managed to recover both nodes and don't see any issues when recovering node one by one so it might be OK to close this. (the issue about two recoveries at once still has to be documented or fixed) |
Got it. I will close once we reach the consensus how to fix it (now I've come up with several ways to solve this problem so in the middle of discussing among the team) and the document has been published. |
There are 6 storage nodes, N=3, W=2, R=1; everything is consistent at the start of experiment. Each node has 4 AVS directories (one per drive). There are no deleted / overwritten objects on cluster (just a tiny amount of multipart headers), ratio of active size is 100% on each node. The distribution of data is not even between nodes, but very even between AVS directories on each node (less than 0.3% difference). The cluster is running with #876 changes.
The experiment is as follows:
There are "slow operation" messages in info logs, example log:
This is on step 3) while stor03 was stopped, at other times amount of such messages was way smaller (really insignificant amounts). There are some timeouts in gateway logs as well.
Soon after stor03 started, I got bunch of messages like this in error log on stor01/02/04/05/06:
Their amount wasn't terribly big, they started to appear when stor03 finished startup and stopped about 15 minutes after.
Logs on stor03 have just these two warnings:
Expected, I suppose, given that for some objects copies both on stor01 and stor03 were lost and only third copy exists.
I don't see any real signs of problems up to here. However, this is end result after recovery was over, df output on stor03:
This doesn't make sense. I'm clearly missing data in avs3. Each (of 64) AVS file there is around 10-11 GB and it's supposed to be around 13 GB like on other nodes. This is first problem.
On stor01, where I remove avs3 I get this
At first glance, this is expected: since I stopped stor03 in the middle of preparing data to be pushed on stor01 to fill "/mnt/avs3" and deleted queues there, it couldn't restore everything on stor01. However, technically, since N=3, some other node could've pushed that data in place of stor03. Well, anyway, the real problem here is that amount of lost data - >400 GB - can't possibly match amount of data that stor03 was supposed to push into stor01! With 6 nodes, it should've been 1/6 of file at most, not nearly half of it.
Another (strange) fact is that stor03 is lacking data in "/mnt/avs3" as well. Which is.. strange. "/mnt/avs3" is what I removed on stor01, on stor03 I removed all avs directories. I double-checked it now, the create dates of surrounding directories and such - there should be nothing special about avs3 on stor03. No idea if this is coincidence or not.
That said, I didn't notice this discrepancy at that moment (and didn't notice that stor03 is lacking some data until now) and executed "recover-node stor01" again. There was no other load on cluster. After it was over, it got like this on stor01:
Which is wrong as well. Some data that's supposed to be there clearly isn't there, even though there were no problems during "recover-node". Though stor03 is missing some data as well, but I don't know if it can affect this.
Amount of data missing on each of either stor01 or stor03 doesn't match amount of data I was uploading (as I calculated earlier, that should've been 57-67 GB per AVS directory). It also by no means matches amount of objects mentioned in error/info logs during these experiments which is relatively small.
I can show that data is missing another way as well. This is current "du" output:
From #846 I know that before delete and recover stor01 had 113% of objects compared to stor06 and stor03 had 103% of objects compared to stor06. Upload process doesn't change these numbers. But now distribution is - stor01 has 110% of objects comparing to stor06, and stor03 has 98% of objects. While counters used for "du" might be wrong under some conditions, here they match the amount of data missing in AVS files (e.g. stor01 is missing ~110 GB of data which is 3%).
The text was updated successfully, but these errors were encountered: