feat: add migration error statistic #4643

BorysTheDev · 2025-02-21T10:25:46Z

added migration_errors_total statistic
fixed data race

kostasrim · 2025-02-21T11:10:15Z

src/server/cluster/outgoing_slot_migration.cc

+      auto err = cntx_.GetError();
+      LOG(ERROR) << err.Format();
+      ReportError(std::move(err));
+      cntx_.Reset(nullptr);


This will preempt the current fiber because it will need to Join on the error handler fiber ReportError. So if this is the case, do we really need to sleep below ?

it reduces a load on the instance. I don't see any issues with it and prefer to leave as is

src/server/cluster/outgoing_slot_migration.cc

tests/dragonfly/cluster_test.py

kostasrim · 2025-02-21T11:16:56Z

src/server/cluster/cluster_family.cc

@@ -1038,6 +1038,22 @@ void ClusterFamily::PauseAllIncomingMigrations(bool pause) {
  }
 }

+size_t ClusterFamily::MigrationsErrorNum() const {
+  util::fb2::LockGuard lk(migration_mu_);


Why do we need this mutex ? GetErrorNum() is an atomic operation so what's the reason for the extra synchronization here ?

incoming_migrations_jobs_ and outgoing_migration_jobs_ can be removed

oh now I understand. And then the atomic is to do the increment without the lock

kostasrim · 2025-02-21T11:18:31Z

src/server/cluster/incoming_slot_migration.h

@@ -48,6 +48,7 @@ class IncomingSlotMigration {
  }

  void ReportError(dfly::GenericError err) ABSL_LOCKS_EXCLUDED(error_mu_) {
+    error_num_.fetch_add(1, std::memory_order_relaxed);


That's per migration flow right ?

per every migration, not migration flow

yes yes that's what I meant. Lit makes sense

src/server/cluster/outgoing_slot_migration.h

src/server/cluster/outgoing_slot_migration.cc

adiholden · 2025-02-23T16:10:35Z

src/server/cluster/outgoing_slot_migration.cc

+      LOG(ERROR) << err.Format();
+      ReportError(std::move(err));
+      cntx_.Reset(nullptr);
+      ThisFiber::SleepFor(500ms);  // wait some time before next retry


why did you change the sleep time?

I've decided that it's enough, because we are going to implement force migration finalization

src/server/cluster/cluster_family.cc

adiholden · 2025-02-23T16:21:35Z

src/server/cluster/outgoing_slot_migration.h

-  dfly::GenericError last_error_;
+  mutable util::fb2::Mutex error_mu_;
+  dfly::GenericError last_error_ ABSL_GUARDED_BY(error_mu_);
+  std::atomic<size_t> error_num_ = 0;


error count is better I think
but could be that reconnect count is even better as the fact that we have error does not say that we did restarted , but in our flow I think that each error results in restarting the migration

I would prefer to leave the errors count, it is more flexible, because we can improve our logic in the future and for example, restart only one flow if it fails.

src/server/cluster/outgoing_slot_migration.cc

adiholden · 2025-02-26T10:08:39Z

tests/dragonfly/cluster_test.py

+
+    await push_config(json.dumps(generate_config(nodes_info)), [c_nodes[0]])
+
+    await wait_for_errors_num(c_nodes[0], 1)


you dont have another push_config after you edit the config so what are we checking ?

I check the missing config on the target node error

sorry, I dont understand.
in line 3026 you edit the config but if you dont push it, it has no affect

you are right I will check

the tests works correctly because we restart the migration process and get the error one more time.

adiholden · 2025-02-26T12:11:31Z

src/server/cluster/outgoing_slot_migration.cc

@@ -223,13 +220,7 @@ void OutgoingMigration::SyncFb() {
    }

    if (!CheckRespIsSimpleReply("OK")) {
-      if (CheckRespIsSimpleReply(kUnknownMigration)) {
-        VLOG(2) << "Target node does not recognize migration; retrying";


I see that before your change when CheckRespIsSimpleReply(kUnknownMigration) we did not do cntx_.ReportError. was this a bug?

no, it wasn't a bug. As I remember @andydunstall removed this error because they parsed the status to understand the migration error status. Now I've added an error statistic so the control plane team can ignore this error, so I returned this error.

but we will get alerted by that from graffana once they move to track this migration error count

No, we have discussed with Andy that we generate alerts only if a number of errors is more than some threshold because we restart migration automatically if we get an error.

@andydunstall I want to make sure control plane test will not fail due to this change

I would prefer do not remove it because it's error, we don't know do you forget to send the config or it's only a delay

@andydunstall Is it possible to send a config to the target node first?
Also as an option we can make an error after 10 seconds for example

we don't know do you forget to send the config

The control plane monitors that every node is configured, so from the data planes perspective it's not an error

Is it possible to send a config to the target node first?

It would add a lot of complexity which doesn't seem worth it?

@andydunstall What if I add timeout or 10 attempts before report the error?

sure - as discussed in chat i don't really see the point given the control plane monitors whether configuration is propagated - but a 30s timeout sounds fine

Signed-off-by: Borys <borys@dragonflydb.io>

BorysTheDev requested review from kostasrim and andydunstall February 21, 2025 10:25

feat: add migration error statistic

0e68416

BorysTheDev force-pushed the add_migration_error_statistic branch from b4755e9 to 0e68416 Compare February 21, 2025 11:17

kostasrim reviewed Feb 21, 2025

View reviewed changes

refactor: address comments

67d575c

BorysTheDev force-pushed the add_migration_error_statistic branch from 57d82ec to 67d575c Compare February 21, 2025 11:33

BorysTheDev added 2 commits February 21, 2025 22:33

feat: added metric

9d31542

fix:remove extra sleep

d0857b6

BorysTheDev requested a review from adiholden February 22, 2025 09:04

adiholden reviewed Feb 23, 2025

View reviewed changes

src/server/cluster/outgoing_slot_migration.h Outdated Show resolved Hide resolved

adiholden reviewed Feb 23, 2025

View reviewed changes

src/server/cluster/outgoing_slot_migration.cc Outdated Show resolved Hide resolved

adiholden reviewed Feb 23, 2025

View reviewed changes

src/server/cluster/cluster_family.cc Outdated Show resolved Hide resolved

adiholden reviewed Feb 23, 2025

View reviewed changes

src/server/cluster/cluster_family.cc Outdated Show resolved Hide resolved

adiholden reviewed Feb 23, 2025

View reviewed changes

src/server/cluster/outgoing_slot_migration.cc Outdated Show resolved Hide resolved

refactor: address comments

1d6466e

adiholden reviewed Feb 26, 2025

View reviewed changes

BorysTheDev added 2 commits February 26, 2025 12:10

refactor: address comments

acc429a

refactor: address comments

b31a3a2

BorysTheDev force-pushed the add_migration_error_statistic branch from 108d25d to b31a3a2 Compare February 26, 2025 12:06

adiholden reviewed Feb 26, 2025

View reviewed changes

Merge branch 'main' into add_migration_error_statistic

5b16fd0

Signed-off-by: Borys <borys@dragonflydb.io>

BorysTheDev force-pushed the add_migration_error_statistic branch from 8f06bb6 to 5b16fd0 Compare February 26, 2025 12:22

feat: add timeout for UnknownMigration error report

c063ca2

BorysTheDev requested a review from adiholden February 27, 2025 11:17

adiholden approved these changes Feb 27, 2025

View reviewed changes

BorysTheDev merged commit 4da1678 into main Feb 27, 2025
10 checks passed

BorysTheDev deleted the add_migration_error_statistic branch February 27, 2025 11:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add migration error statistic #4643

feat: add migration error statistic #4643

BorysTheDev commented Feb 21, 2025

kostasrim Feb 21, 2025

BorysTheDev Feb 21, 2025

kostasrim Feb 21, 2025

BorysTheDev Feb 21, 2025 •

edited

Loading

kostasrim Feb 21, 2025

kostasrim Feb 21, 2025

BorysTheDev Feb 21, 2025

kostasrim Feb 21, 2025

adiholden Feb 23, 2025

BorysTheDev Feb 26, 2025

adiholden Feb 23, 2025

BorysTheDev Feb 26, 2025

adiholden Feb 26, 2025

BorysTheDev Feb 26, 2025

adiholden Feb 26, 2025

BorysTheDev Feb 26, 2025

BorysTheDev Feb 26, 2025

adiholden Feb 26, 2025

BorysTheDev Feb 26, 2025

adiholden Feb 26, 2025

BorysTheDev Feb 26, 2025 •

edited

Loading

adiholden Feb 26, 2025

BorysTheDev Feb 26, 2025

BorysTheDev Feb 26, 2025 •

edited

Loading

andydunstall Feb 26, 2025

BorysTheDev Feb 26, 2025

andydunstall Feb 27, 2025


		await push_config(json.dumps(generate_config(nodes_info)), [c_nodes[0]])

		await wait_for_errors_num(c_nodes[0], 1)

feat: add migration error statistic #4643

feat: add migration error statistic #4643

Conversation

BorysTheDev commented Feb 21, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

BorysTheDev Feb 21, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

BorysTheDev Feb 26, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

BorysTheDev Feb 26, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

BorysTheDev Feb 21, 2025 •

edited

Loading

BorysTheDev Feb 26, 2025 •

edited

Loading

BorysTheDev Feb 26, 2025 •

edited

Loading