Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recover-node fails to recover all data on storage node #880

Closed
vstax opened this issue Oct 13, 2017 · 16 comments
Closed

Recover-node fails to recover all data on storage node #880

vstax opened this issue Oct 13, 2017 · 16 comments

Comments

@vstax
Copy link
Contributor

vstax commented Oct 13, 2017

There are 6 storage nodes, N=3, W=2, R=1; everything is consistent at the start of experiment. Each node has 4 AVS directories (one per drive). There are no deleted / overwritten objects on cluster (just a tiny amount of multipart headers), ratio of active size is 100% on each node. The distribution of data is not even between nodes, but very even between AVS directories on each node (less than 0.3% difference). The cluster is running with #876 changes.

The experiment is as follows:

  1. I start upload of data to cluster. I will be uploading 500 GB of data, that is, it should increase disk usage on each node roughly by 500 * 3 / 6 = 250 GB, or ~62 GB per AVS directory on each node. In reality, due to uneven distribution, there will be up to 16% difference between nodes, so it will roughly 57 to 67 GB per AVS directory (but increase should be the same for each AVS directory on any node).
  2. I suspend and stop first node - stor01 and remove one of AVS directories, simulating drive failure. I start node, resume it and execute "rebalance-node stor01". The queues on other node starts filling, lots of traffic flows to stor01. The lost AVS directory starts filling up.
  3. I suspend stor03 and stop it, simulating another node failure in the middle of recover-node operation. N=3 so I'm supposed not to lose any data over this. Since queues on stor03 are large and trying to fill fast, stopping takes sometime (plus tons of badargs errors from leveldb right before it stops) - but this is expected.
  4. I wipe all data on stor03, including all AVS files and all queues (including membership). I launch it and resume - it has the same name and automatically attaches to cluster, gets RING and starts receiving data that is uploaded to cluster. Upload of data doesn't stop because W=2 is sastisfied at all times.
  5. I execute "recover-node stor03" (with recover-node stor01 still going on). I can see both stor01 and stor03 receiving lots of data. I wait till all queues are empty.

There are "slow operation" messages in info logs, example log:

[I]	bodies01@stor01.selectel.cloud.lan	2017-10-11 22:10:54.586854 +0300	1507749054	leo_object_storage_event:handle_event/2	54	[{cause,"slow operation"},{method,get},{key,<<"bod1/00/de/f1/00def1df121b056aba83f3dad7fa75538010002d93e62e31d3b122eb01a8863b426425d77f844c208439fb6f8043ab7688240a0000000000.xz">>},{processing_time,6839}]
[I]	bodies01@stor01.selectel.cloud.lan	2017-10-11 22:10:54.587046 +0300	1507749054	leo_object_storage_event:handle_event/2	54	[{cause,"slow operation"},{method,head},{key,<<"bod1/3f/66/4d/3f664df2f427a0ac57cd3b4dcae7ae2d36a2a295e8f292372f0ea476ce0859ea8a139d2fe5f1b427d01c59c4de1d939de8b40d0000000000.xz">>},{processing_time,6410}]
[I]	bodies01@stor01.selectel.cloud.lan	2017-10-11 22:10:59.729838 +0300	1507749059	leo_object_storage_event:handle_event/2	54	[{cause,"slow operation"},{method,fetch},{key,<<>>},{processing_time,5197}]
[I]	bodies01@stor01.selectel.cloud.lan	2017-10-11 22:11:18.728820 +0300	1507749078	leo_object_storage_event:handle_event/2	54	[{cause,"slow operation"},{method,fetch},{key,<<>>},{processing_time,6609}]
[I]	bodies01@stor01.selectel.cloud.lan	2017-10-11 22:11:18.728972 +0300	1507749078	leo_object_storage_event:handle_event/2	54	[{cause,"slow operation"},{method,head},{key,<<"bod1/55/55/ba/5555bad0ba80004a65f2be8e8740797517db050e471786913ffb05b5f8d8fdea183844e623cb960cada725e418072bdb009e090000000000.xz">>},{processing_time,5167}]
[I]	bodies01@stor01.selectel.cloud.lan	2017-10-11 22:11:26.181764 +0300	1507749086	leo_object_storage_event:handle_event/2	54	[{cause,"slow operation"},{method,fetch},{key,<<>>},{processing_time,7453}]
[I]	bodies01@stor01.selectel.cloud.lan	2017-10-11 22:11:26.181938 +0300	1507749086	leo_object_storage_event:handle_event/2	54	[{cause,"slow operation"},{method,head},{key,<<"bod1/3f/f5/87/3ff5870173ffbd88eb9af762bffc495992ef49b91ecf2568a574d77172fb86084c1b400cec0e3e95fcdbd9317e4e16e828f4600000000000.xz">>},{processing_time,7372}]
[I]	bodies01@stor01.selectel.cloud.lan	2017-10-11 22:11:26.182116 +0300	1507749086	leo_object_storage_event:handle_event/2	54	[{cause,"slow operation"},{method,head},{key,<<"bod13/31/1b/c3/311bc313abc833bada8a84829ccdf5ab34ae18cefb8dae65f503fbb3274083d60e99823d56302d65e898e6d218cc3c8cb06e090100000000.xz\ncb7e5b94e5564cc2c6f312a2b9276c45">>},{processing_time,6307}]
[I]	bodies01@stor01.selectel.cloud.lan	2017-10-11 22:11:26.182272 +0300	1507749086	leo_object_storage_event:handle_event/2	54	[{cause,"slow operation"},{method,head},{key,<<"bod13/31/1b/c3/311bc313abc833bada8a84829ccdf5ab34ae18cefb8dae65f503fbb3274083d60e99823d56302d65e898e6d218cc3c8cb06e090100000000.xz\ncb7e5b94e5564cc2c6f312a2b9276c45">>},{processing_time,6297}]
[I]	bodies01@stor01.selectel.cloud.lan	2017-10-11 22:11:26.182399 +0300	1507749086	leo_object_storage_event:handle_event/2	54	[{cause,"slow operation"},{method,put},{key,<<"bod13/31/1b/c3/311bc313abc833bada8a84829ccdf5ab34ae18cefb8dae65f503fbb3274083d60e99823d56302d65e898e6d218cc3c8cb06e090100000000.xz\ncb7e5b94e5564cc2c6f312a2b9276c45">>},{processing_time,6840}]

This is on step 3) while stor03 was stopped, at other times amount of such messages was way smaller (really insignificant amounts). There are some timeouts in gateway logs as well.

Soon after stor03 started, I got bunch of messages like this in error log on stor01/02/04/05/06:

[E]     bodies01@stor01.selectel.cloud.lan     2017-10-11 22:07:21.616638 +0300        1507748841      leo_mq_consumer:consume/4       526     [{module,leo_storage_mq},{id,leo_per_object_queue},{cause,{timeout,{gen_server,call,[leo_object_storage_read_4_0,{get,{216869700495053906819207869156722907209,<<"bod1/00/42/c7/0042c78f42247f361f70be868be52e8ddbcbad83b57235175f4687be9f20bd1fb95158ffa5a0edb62200a9008c1eda5b00ad070000000000.xz">>},-1,-1,true,666268},30000]}}}]
[E]     bodies01@stor01.selectel.cloud.lan     2017-10-11 22:07:21.757639 +0300        1507748841      leo_mq_consumer:consume/4       526     [{module,leo_storage_mq},{id,leo_per_object_queue},{cause,{timeout,{gen_server,call,[leo_object_storage_read_4_1,{get,{233468761328939195850605299707763487896,<<"bod1/00/23/b8/0023b899c8affee881b520131c1c1f5c70f8a5a598c31460dd4fc93b4b6a44e95f800257d0c9506f823ae2ec2fe73cfd00da070000000000.xz">>},-1,-1,true,666409},30000]}}}]
[E]     bodies01@stor01.selectel.cloud.lan     2017-10-11 22:07:21.820630 +0300        1507748841      leo_mq_consumer:consume/4       526     [{module,leo_storage_mq},{id,leo_per_object_queue},{cause,{timeout,{gen_server,call,[leo_object_storage_read_4_1,{get,{188914636427983509554839109418181432768,<<"bod1/00/14/b4/0014b4aa8dd8721b33c1bf871c65a27292390bee5670f31749cfe8c66326ecd24086db43c8c14e90a574d90a67a9631c00a6070000000000.xz">>},-1,-1,true,666472},30000]}}}]
[E]     bodies01@stor01.selectel.cloud.lan     2017-10-11 22:11:26.181940 +0300        1507749086      leo_storage_handler_object:put/4        424     [{from,storage},{method,delete},{key,<<"bod13/31/1b/c3/311bc313abc833bada8a84829ccdf5ab34ae18cefb8dae65f503fbb3274083d60e99823d56302d65e898e6d218cc3c8cb06e090100000000.xz\ncb7e5b94e5564cc2c6f312a2b9276c45">>},{req_id,11017595},{cause,not_found}]

Their amount wasn't terribly big, they started to appear when stor03 finished startup and stopped about 15 minutes after.

Logs on stor03 have just these two warnings:

[W]     bodies03@stor03.selectel.cloud.lan     2017-10-12 00:05:32.20685 +0300 1507755932      leo_storage_read_repairer:compare/4     167     [{node,'bodies01@stor01.selectel.cloud.lan'},{addr_id,7946999318617476916794463740239994466},{key,<<"bod13/08/a9/cc/08a9cce0df7a51a65bc1da075a03714f8146cdf732308de9fc24262e67332bad1539086398b1688f8d622a6fc83db077b06e090100000000.xz\neca1aedfb5c7f8bf816f93c4c7f0e848">>},{clock,1507755930845926},{cause,not_found}]
[W]     bodies03@stor03.selectel.cloud.lan     2017-10-11 22:11:26.316603 +0300        1507749086      leo_storage_read_repairer:compare/4     167     [{node,'bodies01@stor01.selectel.cloud.lan'},{addr_id,59684934647226860873027859410263418647},{key,<<"bod13/31/1b/c3/311bc313abc833bada8a84829ccdf5ab34ae18cefb8dae65f503fbb3274083d60e99823d56302d65e898e6d218cc3c8cb06e090100000000.xz\ncb7e5b94e5564cc2c6f312a2b9276c45">>},{clock,1507749079342341},{cause,not_found}]

Expected, I suppose, given that for some objects copies both on stor01 and stor03 were lost and only third copy exists.

I don't see any real signs of problems up to here. However, this is end result after recovery was over, df output on stor03:

/dev/sdd2                    5,5T         805G  4,7T           15% /mnt/avs4
/dev/sdc2                    5,5T         650G  4,8T           12% /mnt/avs3
/dev/sda4                    5,5T         806G  4,7T           15% /mnt/avs1
/dev/sdb4                    5,5T         807G  4,7T           15% /mnt/avs2

This doesn't make sense. I'm clearly missing data in avs3. Each (of 64) AVS file there is around 10-11 GB and it's supposed to be around 13 GB like on other nodes. This is first problem.

On stor01, where I remove avs3 I get this

/dev/sdd2                    5,5T         888G  4,6T           17% /mnt/avs4
/dev/sdc2                    5,5T         465G  4,4T            9% /mnt/avs3
/dev/sda4                    5,5T         888G  4,6T           17% /mnt/avs1
/dev/sdb4                    5,5T         889G  4,6T           17% /mnt/avs2

At first glance, this is expected: since I stopped stor03 in the middle of preparing data to be pushed on stor01 to fill "/mnt/avs3" and deleted queues there, it couldn't restore everything on stor01. However, technically, since N=3, some other node could've pushed that data in place of stor03. Well, anyway, the real problem here is that amount of lost data - >400 GB - can't possibly match amount of data that stor03 was supposed to push into stor01! With 6 nodes, it should've been 1/6 of file at most, not nearly half of it.

Another (strange) fact is that stor03 is lacking data in "/mnt/avs3" as well. Which is.. strange. "/mnt/avs3" is what I removed on stor01, on stor03 I removed all avs directories. I double-checked it now, the create dates of surrounding directories and such - there should be nothing special about avs3 on stor03. No idea if this is coincidence or not.

That said, I didn't notice this discrepancy at that moment (and didn't notice that stor03 is lacking some data until now) and executed "recover-node stor01" again. There was no other load on cluster. After it was over, it got like this on stor01:

/dev/sdd2                    5,5T         888G  4,6T           17% /mnt/avs4
/dev/sdc2                    5,5T         777G  4,7T           15% /mnt/avs3
/dev/sda4                    5,5T         888G  4,6T           17% /mnt/avs1
/dev/sdb4                    5,5T         889G  4,6T           17% /mnt/avs2

Which is wrong as well. Some data that's supposed to be there clearly isn't there, even though there were no problems during "recover-node". Though stor03 is missing some data as well, but I don't know if it can affect this.

Amount of data missing on each of either stor01 or stor03 doesn't match amount of data I was uploading (as I calculated earlier, that should've been 57-67 GB per AVS directory). It also by no means matches amount of objects mentioned in error/info logs during these experiments which is relatively small.

I can show that data is missing another way as well. This is current "du" output:

[vm@bodies-master ~]$ leofs-adm du bodies01@stor01.selectel.cloud.lan
 active number of objects: 5770311
  total number of objects: 5861284
   active size of objects: 3691180624254
    total size of objects: 3691207752413
     ratio of active size: 100.0%
    last compaction start: 2017-09-24 19:17:31 +0300
      last compaction end: 2017-09-24 19:18:24 +0300

[vm@bodies-master ~]$ leofs-adm du bodies03@stor03.selectel.cloud.lan
 active number of objects: 5144047
  total number of objects: 5149233
   active size of objects: 3291583488114
    total size of objects: 3291585038728
     ratio of active size: 100.0%
    last compaction start: ____-__-__ __:__:__
      last compaction end: ____-__-__ __:__:__

[vm@bodies-master ~]$ leofs-adm du bodies06@stor06.selectel.cloud.lan
 active number of objects: 5254085
  total number of objects: 5359937
   active size of objects: 3360352546015
    total size of objects: 3360384109837
     ratio of active size: 100.0%
    last compaction start: 2017-09-24 19:08:15 +0300
      last compaction end: 2017-09-24 19:09:02 +0300

From #846 I know that before delete and recover stor01 had 113% of objects compared to stor06 and stor03 had 103% of objects compared to stor06. Upload process doesn't change these numbers. But now distribution is - stor01 has 110% of objects comparing to stor06, and stor03 has 98% of objects. While counters used for "du" might be wrong under some conditions, here they match the amount of data missing in AVS files (e.g. stor01 is missing ~110 GB of data which is 3%).

@windkit
Copy link
Contributor

windkit commented Oct 16, 2017

It looks similar to the issue I had #881
Could you try to sample some objects and see if those missing all has stor01 / stor03 as the primary (1st entry in leofs-adm whereis bucket/object)

@vstax
Copy link
Contributor Author

vstax commented Oct 16, 2017

@windkit
Looks like you're right, I managed to find names of about 200 missing objects and they all have either stor01 or stor03 as primary:

       | bodies03@stor03.selectel.cloud.lan      |                                      |            |              |                |                |                | 
       | bodies01@stor01.selectel.cloud.lan      |                                      |            |              |                |                |                | 
       | bodies04@stor04.selectel.cloud.lan      | cbb793cf5e7dbbed69205e64bff27755     |       287K |   72f56a577c | false          |              0 | 55a3c89d67b93  | 2017-09-28 12:29:00 +0300

However, while usually these objects are missing on both stor01 and stor03, there are a few cases among these (less than 10) where object that is missing on stor03 is present on stor01 (primary) and stor02:

       | bodies01@stor01.selectel.cloud.lan      | 931bf6ae3ccd0358962a637160b423d2     |       929K |   1c1d75cd00 | false          |              0 | 55a3c883eeb4a  | 2017-09-28 12:28:33 +0300
       | bodies02@stor02.selectel.cloud.lan      | 931bf6ae3ccd0358962a637160b423d2     |       929K |   1c1d75cd00 | false          |              0 | 55a3c883eeb4a  | 2017-09-28 12:28:33 +0300
       | bodies03@stor03.selectel.cloud.lan      |                                      |            |              |                |                |                | 

It could be that when both stor01 and stor03 are being recovered at once this bug might be causing situations like this as well.

I'd say it is probably caused by the same problem as #881, the only (strange) addition here is how on stor03 I'm missing objects specifically from avs3 directory, even though all four were removed and recovered at once. I find it hard to believe that removed avs3 on stor01 could've caused that but since I don't know how exactly AVS directory/file is picked I can't say anything about it. If, in theory, it could be that there is relation between third AVS directory on stor01 and stor03 (like, objects that go to both of these nodes end up going into same AVS directory?), this could happen. But having such relation would be really strange, I think?

@mocchira
Copy link
Member

@vstax

I'd say it is probably caused by the same problem as #881, the only (strange) addition here is how on stor03 I'm missing objects specifically from avs3 directory, even though all four were removed and recovered at once. I find it hard to believe that removed avs3 on stor01 could've caused that but since I don't know how exactly AVS directory/file is picked I can't say anything about it. If, in theory, it could be that there is relation between third AVS directory on stor01 and stor03 (like, objects that go to both of these nodes end up going into same AVS directory?), this could happen. But having such relation would be really strange, I think?

Objects having a same key would belong to the same AVS directory/file on each replicated node (if they have the same obj_containers settings) according to https://github.com/leo-project/leo_object_storage/blob/develop/src/leo_object_storage_api.erl#L352 so yes there is a relation.

mocchira added a commit to mocchira/leofs that referenced this issue Oct 17, 2017
@mocchira
Copy link
Member

@vstax please give it a try with #886

@vstax
Copy link
Contributor Author

vstax commented Oct 17, 2017

@mocchira
This fix doesn't seem to change anything for me. From the state described above, I've launched "recover-node stor01" and "recover-node stor03" at once. After recovery was finished, not a extra single object was stored on stor01, and only some of the objects that were missing on stor03 have appeared there. No error/info logs during recovery process.

From df on stor03:

/dev/sdc2                    5,5T         697G  4,8T           13% /mnt/avs3

i.e. around 47 GB of data arrived there, but I'm still missing around 110 GB.
On stor01 the state is exactly the same as before, and cdate of AVS files isn't changed, so no writes at all.

Checking objects with whereis, it looks like the objects that have arrived to stor03 now are the ones that were existing on stor01 and with stor01 being primary, but missing on stor03 (second example on #880 (comment)). As for the objects that are missing on both stor01 and stor03 and have either stor01 or stor03 as primary (and only third copy exists somewhere else), none have arrived to stor01/stor03.

@mocchira
Copy link
Member

@vstax Thanks for checking.

Checking objects with whereis, it looks like the objects that have arrived to stor03 now are the ones that were existing on stor01 and with stor01 being primary, but missing on stor03 (second example on #880 (comment)). As for the objects that are missing on both stor01 and stor03 and have either stor01 or stor03 as primary (and only third copy exists somewhere else), none have arrived to stor01/stor03.

#889 should fix both cases ONLY if you run recover-node one by one.
As you guessed, it turned out that something goes wrong if you run multiple recover-node(s) in parallel however still don't find the bug so I'll keep tackling this issue tomorrow.

@vstax
Copy link
Contributor Author

vstax commented Oct 18, 2017

@mocchira
The recovery of nodes is still in progress - since recovery of each node takes 12-14 hours (EDIT: apparently, now it's much longer than that), but from the looks of it the first node is recovering just fine now, so this fix works. Recover-file and read-repair work as well (maybe they worked before patch too, not sure). I'm surprised that running two recover-node operations creates different situation compared to recovering them one after another, I thought there would be no difference.

Another note - either because of some latest patches, or maybe something else? - I get the feeling that first stage of recover-node (where leo_per_object_queue gets filled) proceeded few times faster than during previous experiments (including original one when I was running single recover-node on node that was fully consistent around two weeks ago). It only took minutes to get 2.5 M objects in queue of each node and the queue started to just consume after that. The consumption itself is much slower with this patch, however (it's still saturates incoming network but apparently has to send much more data now). But, well, it's not a large price to pay for being able to actually get all the objects that are supposed to be on the node :)

@mocchira
Copy link
Member

@vstax

The recovery of nodes is still in progress - since recovery of each node takes 12-14 hours (EDIT: apparently, now it's much longer than that), but from the looks of it the first node is recovering just fine now, so this fix works. Recover-file and read-repair work as well (maybe they worked before patch too, not sure).

Good to hear that.
As for taking much longer time, it's expected because the latest patch sends more traffic than previous one to deal with the possibility other nodes may not have the proper objects. however if it matters to you then we are going to consider more efficient way to handle this case so please let me know.

I'm surprised that running two recover-node operations creates different situation compared to recovering them one after another, I thought there would be no difference.

Yes I'm also surprised so much, anyway thanks for doing this test that will contribute to improve the quality.

Another note - either because of some latest patches, or maybe something else? - I get the feeling that first stage of recover-node (where leo_per_object_queue gets filled) proceeded few times faster than during previous experiments (including original one when I was running single recover-node on node that was fully consistent around two weeks ago). It only took minutes to get 2.5 M objects in queue of each node and the queue started to just consume after that. The consumption itself is much slower with this patch, however (it's still saturates incoming network but apparently has to send much more data now). But, well, it's not a large price to pay for being able to actually get all the objects that are supposed to be on the node :)

Thanks :) just in case, let us know how much longer the recover-node take with the latest patch, if it take considerable time then as I said on the reply above, we will consider more efficient way to reduce traffic somehow (actually I have the idea for that).

@mocchira
Copy link
Member

mocchira commented Oct 19, 2017

@vstax Finally found the culprit, now we are considering the best way to fix this problem.

EDIT: now if you run 2 recover-node in parallel then the former one can be immature (part of objects will not be recovered) OTOH the latter one completed without any inconsistencies.

@vstax
Copy link
Contributor Author

vstax commented Oct 19, 2017

@mocchira

Thanks :) just in case, let us know how much longer the recover-node take with the latest patch, if it take considerable time then as I said on the reply above, we will consider more efficient way to reduce traffic somehow (actually I have the idea for that).

Apparently it's about twice as long now. I'm not 100% sure but I think that amount of messages in leo_per_object_queue is around twice of it was before during its peak as well.
As for priority, I don't think it matters that much. I mean, it's nice if it can be improved, but given the circumstances of that operation, it shouldn't matter that much if it takes 40 vs 80 hours on production - and if it does, then something was done incorrectly. As for our systems, only the problem with inconsistent RING if of high priority now, though we'll probably launch like this now if it can be fixed on working cluster afterwards?

I see about two recoveries at once. If there is no easy fix, I guess at least it should be either added to documentation, or the nodes internally shouldn't start filling queue with objects related to second recovery before finishing with the first one (probably resulting in the same one-by-one recovery but user won't be able to mess it up)?

@mocchira
Copy link
Member

@vstax just in case, let me clarify

As for our systems, only the problem with inconsistent RING if of high priority now, though we'll probably launch like this now if it can be fixed on working cluster afterwards?

What do you mean "the problem with inconsistent RING"?
this one (uneven distribution) #846 or something other that causes objects to be inconsistent state? if the latter case, please point me out where the problem described in detail (I may miss it).

I see about two recoveries at once. If there is no easy fix, I guess at least it should be either added to documentation, or the nodes internally shouldn't start filling queue with objects related to second recovery before finishing with the first one (probably resulting in the same one-by-one recovery but user won't be able to mess it up)?

OK. we will document it for now and fix later somehow.

@vstax
Copy link
Contributor Author

vstax commented Oct 19, 2017

@mocchira Ah sorry I wrote it wrong. Yes I meant #846 and of course it's not "inconsistent", I was probably just thinking of something else when writing and made a mistake.

Anyhow, from our current situation (space running out on the old storages and we don't want to add hardware there since we're migrating to LeoFS) we'll have to start migration in just a few days (probably on Monday), regardless whether situation with uneven distribution can be solved before or not. (naturally, we reserve enough space to be able to rollback to old systems if something will go wrong in a week or few, it's just that we'll run out of that reserve space if we wait more). So I wondered whether it's possible to perform rebalance/recover after fixing RING distribution if we launch like it is now or it will be too dangerous to do. If it isn't, it's not a huge problem per se - it's just that I (liking perfection a bit too much) don't like the current situation, but from objective point of view, the uneven distribution is not on a critical level, just a bit annoying. As I'm performing the last tests like recover-node and delete bucket I can clearly see (from the queues and the load on servers) how having uneven distribution makes these tasks longer.

@mocchira
Copy link
Member

@vstax Thanks for answering.

Got it.
Regardless of how the uneven distribution happened (bugs? like #846 or updating the same object frequently or bad luck or something else), we are now considering to implement sort of rebalance RING as a new feature for those who have faced the uneven distribution right now so

though we'll probably launch like this now if it can be fixed on working cluster afterwards?

We are going to enable you to fix the uneven distribution on working cluster afterwards safely.
As we will update the progress on #846 comment so please keep an eye on it.

@vstax
Copy link
Contributor Author

vstax commented Oct 23, 2017

@mocchira OK, thank you. I managed to create better RING so uneven distribution is much less of a deal now.

This issue is fixed now - I managed to recover both nodes and don't see any issues when recovering node one by one so it might be OK to close this. (the issue about two recoveries at once still has to be documented or fixed)

@mocchira
Copy link
Member

@vstax

This issue is fixed now - I managed to recover both nodes and don't see any issues when recovering node one by one so it might be OK to close this. (the issue about two recoveries at once still has to be documented or fixed)

Got it. I will close once we reach the consensus how to fix it (now I've come up with several ways to solve this problem so in the middle of discussing among the team) and the document has been published.

@mocchira
Copy link
Member

Document #902 will be published once 1.3.8 landed.
Mutiple recover-nodes at once filed on #910.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants