-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add migration error statistic #4643
Conversation
b4755e9
to
0e68416
Compare
auto err = cntx_.GetError(); | ||
LOG(ERROR) << err.Format(); | ||
ReportError(std::move(err)); | ||
cntx_.Reset(nullptr); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will preempt the current fiber because it will need to Join on the error handler fiber ReportError
. So if this is the case, do we really need to sleep below ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it reduces a load on the instance. I don't see any issues with it and prefer to leave as is
@@ -1038,6 +1038,22 @@ void ClusterFamily::PauseAllIncomingMigrations(bool pause) { | |||
} | |||
} | |||
|
|||
size_t ClusterFamily::MigrationsErrorNum() const { | |||
util::fb2::LockGuard lk(migration_mu_); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need this mutex ? GetErrorNum()
is an atomic operation
so what's the reason for the extra synchronization here ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
incoming_migrations_jobs_ and outgoing_migration_jobs_ can be removed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh now I understand. And then the atomic is to do the increment without the lock
@@ -48,6 +48,7 @@ class IncomingSlotMigration { | |||
} | |||
|
|||
void ReportError(dfly::GenericError err) ABSL_LOCKS_EXCLUDED(error_mu_) { | |||
error_num_.fetch_add(1, std::memory_order_relaxed); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's per migration flow right ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
per every migration, not migration flow
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes yes that's what I meant. Lit makes sense
57d82ec
to
67d575c
Compare
LOG(ERROR) << err.Format(); | ||
ReportError(std::move(err)); | ||
cntx_.Reset(nullptr); | ||
ThisFiber::SleepFor(500ms); // wait some time before next retry |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why did you change the sleep time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've decided that it's enough, because we are going to implement force migration finalization
dfly::GenericError last_error_; | ||
mutable util::fb2::Mutex error_mu_; | ||
dfly::GenericError last_error_ ABSL_GUARDED_BY(error_mu_); | ||
std::atomic<size_t> error_num_ = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
error count is better I think
but could be that reconnect count is even better as the fact that we have error does not say that we did restarted , but in our flow I think that each error results in restarting the migration
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer to leave the errors count, it is more flexible, because we can improve our logic in the future and for example, restart only one flow if it fails.
tests/dragonfly/cluster_test.py
Outdated
|
||
await push_config(json.dumps(generate_config(nodes_info)), [c_nodes[0]]) | ||
|
||
await wait_for_errors_num(c_nodes[0], 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you dont have another push_config after you edit the config so what are we checking ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I check the missing config on the target node error
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry, I dont understand.
in line 3026 you edit the config but if you dont push it, it has no affect
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you are right I will check
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the tests works correctly because we restart the migration process and get the error one more time.
108d25d
to
b31a3a2
Compare
@@ -223,13 +220,7 @@ void OutgoingMigration::SyncFb() { | |||
} | |||
|
|||
if (!CheckRespIsSimpleReply("OK")) { | |||
if (CheckRespIsSimpleReply(kUnknownMigration)) { | |||
VLOG(2) << "Target node does not recognize migration; retrying"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see that before your change when CheckRespIsSimpleReply(kUnknownMigration) we did not do cntx_.ReportError. was this a bug?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, it wasn't a bug. As I remember @andydunstall removed this error because they parsed the status to understand the migration error status. Now I've added an error statistic so the control plane team can ignore this error, so I returned this error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but we will get alerted by that from graffana once they move to track this migration error count
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, we have discussed with Andy that we generate alerts only if a number of errors is more than some threshold because we restart migration automatically if we get an error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@andydunstall I want to make sure control plane test will not fail due to this change
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer do not remove it because it's error, we don't know do you forget to send the config or it's only a delay
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@andydunstall Is it possible to send a config to the target node first?
Also as an option we can make an error after 10 seconds for example
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we don't know do you forget to send the config
The control plane monitors that every node is configured, so from the data planes perspective it's not an error
Is it possible to send a config to the target node first?
It would add a lot of complexity which doesn't seem worth it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@andydunstall What if I add timeout or 10 attempts before report the error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure - as discussed in chat i don't really see the point given the control plane monitors whether configuration is propagated - but a 30s timeout sounds fine
Signed-off-by: Borys <borys@dragonflydb.io>
8f06bb6
to
5b16fd0
Compare
fixes: #4572
added migration_errors_total statistic
fixed data race