Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add migration error statistic #4643

Merged
merged 9 commits into from
Feb 27, 2025
Merged

Conversation

BorysTheDev
Copy link
Contributor

fixes: #4572

added migration_errors_total statistic
fixed data race

@BorysTheDev BorysTheDev force-pushed the add_migration_error_statistic branch from b4755e9 to 0e68416 Compare February 21, 2025 11:17
auto err = cntx_.GetError();
LOG(ERROR) << err.Format();
ReportError(std::move(err));
cntx_.Reset(nullptr);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will preempt the current fiber because it will need to Join on the error handler fiber ReportError. So if this is the case, do we really need to sleep below ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it reduces a load on the instance. I don't see any issues with it and prefer to leave as is

@@ -1038,6 +1038,22 @@ void ClusterFamily::PauseAllIncomingMigrations(bool pause) {
}
}

size_t ClusterFamily::MigrationsErrorNum() const {
util::fb2::LockGuard lk(migration_mu_);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this mutex ? GetErrorNum() is an atomic operation so what's the reason for the extra synchronization here ?

Copy link
Contributor Author

@BorysTheDev BorysTheDev Feb 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

incoming_migrations_jobs_ and outgoing_migration_jobs_ can be removed

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh now I understand. And then the atomic is to do the increment without the lock

@@ -48,6 +48,7 @@ class IncomingSlotMigration {
}

void ReportError(dfly::GenericError err) ABSL_LOCKS_EXCLUDED(error_mu_) {
error_num_.fetch_add(1, std::memory_order_relaxed);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's per migration flow right ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

per every migration, not migration flow

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes yes that's what I meant. Lit makes sense

@BorysTheDev BorysTheDev force-pushed the add_migration_error_statistic branch from 57d82ec to 67d575c Compare February 21, 2025 11:33
LOG(ERROR) << err.Format();
ReportError(std::move(err));
cntx_.Reset(nullptr);
ThisFiber::SleepFor(500ms); // wait some time before next retry
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why did you change the sleep time?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've decided that it's enough, because we are going to implement force migration finalization

dfly::GenericError last_error_;
mutable util::fb2::Mutex error_mu_;
dfly::GenericError last_error_ ABSL_GUARDED_BY(error_mu_);
std::atomic<size_t> error_num_ = 0;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

error count is better I think
but could be that reconnect count is even better as the fact that we have error does not say that we did restarted , but in our flow I think that each error results in restarting the migration

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer to leave the errors count, it is more flexible, because we can improve our logic in the future and for example, restart only one flow if it fails.


await push_config(json.dumps(generate_config(nodes_info)), [c_nodes[0]])

await wait_for_errors_num(c_nodes[0], 1)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you dont have another push_config after you edit the config so what are we checking ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I check the missing config on the target node error

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, I dont understand.
in line 3026 you edit the config but if you dont push it, it has no affect

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are right I will check

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the tests works correctly because we restart the migration process and get the error one more time.

@BorysTheDev BorysTheDev force-pushed the add_migration_error_statistic branch from 108d25d to b31a3a2 Compare February 26, 2025 12:06
@@ -223,13 +220,7 @@ void OutgoingMigration::SyncFb() {
}

if (!CheckRespIsSimpleReply("OK")) {
if (CheckRespIsSimpleReply(kUnknownMigration)) {
VLOG(2) << "Target node does not recognize migration; retrying";
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that before your change when CheckRespIsSimpleReply(kUnknownMigration) we did not do cntx_.ReportError. was this a bug?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, it wasn't a bug. As I remember @andydunstall removed this error because they parsed the status to understand the migration error status. Now I've added an error statistic so the control plane team can ignore this error, so I returned this error.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but we will get alerted by that from graffana once they move to track this migration error count

Copy link
Contributor Author

@BorysTheDev BorysTheDev Feb 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, we have discussed with Andy that we generate alerts only if a number of errors is more than some threshold because we restart migration automatically if we get an error.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andydunstall I want to make sure control plane test will not fail due to this change

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer do not remove it because it's error, we don't know do you forget to send the config or it's only a delay

Copy link
Contributor Author

@BorysTheDev BorysTheDev Feb 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andydunstall Is it possible to send a config to the target node first?
Also as an option we can make an error after 10 seconds for example

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't know do you forget to send the config

The control plane monitors that every node is configured, so from the data planes perspective it's not an error

Is it possible to send a config to the target node first?

It would add a lot of complexity which doesn't seem worth it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andydunstall What if I add timeout or 10 attempts before report the error?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure - as discussed in chat i don't really see the point given the control plane monitors whether configuration is propagated - but a 30s timeout sounds fine

Signed-off-by: Borys <borys@dragonflydb.io>
@BorysTheDev BorysTheDev force-pushed the add_migration_error_statistic branch from 8f06bb6 to 5b16fd0 Compare February 26, 2025 12:22
@BorysTheDev BorysTheDev merged commit 4da1678 into main Feb 27, 2025
10 checks passed
@BorysTheDev BorysTheDev deleted the add_migration_error_statistic branch February 27, 2025 11:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add metric for migration errors
4 participants