Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactors is_banned logic and forces health check on unban #288

Merged
merged 6 commits into from
Jan 20, 2023

Conversation

zainkabani
Copy link
Contributor

@zainkabani zainkabani commented Jan 18, 2023

A banned instance will only be banned for the duration of the ban_time setting. If an instance experiences issues longer than the ban time and shorter than the health check delay, it will return to the pool of available instances and clients will connect to the bad replica.

This PR also makes the is_banned function more lightweight (it's used by the admin db) and forces health check after unbanning an instance

Refactor of: #184

@zainkabani zainkabani marked this pull request as ready for review January 18, 2023 19:42
src/pool.rs Outdated
warn!("Unbanning all replicas.");
return false;
write_guard[address.shard].clear();
drop(write_guard);
Copy link
Contributor

@levkk levkk Jan 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I drop it before issuing a warn! because warn! is technically IO, so it's pretty slow. In this case though, it may be okay to wait 0.0001s it takes to print something to the screen before unlocking the mutex. If you're dropping it just before a return, you don't need an explicit drop because it will get dropped as the function returns.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea here is to time the log closer to the actual banning event

src/pool.rs Outdated
debug!("{:?} is ok", address);
false
}
let now = chrono::offset::Utc::now().naive_utc();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be cool to just use std::time instead of chrono, since we don't really care about timezones. We assume that the timezone of the machine won't change between invocations of Instant::now().

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is being used consistently for all the banning logic, can create a new PR to change this later

src/pool.rs Outdated

true
} else {
warn!("{:?} is banned", address);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will spam the log massively on loaded systems, think 4,000 times per second and more. It should be debug imo.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah that's fair point.

src/pool.rs Show resolved Hide resolved
let guard = self.banlist.read();
/// Determines if we can try to unban this server
pub async fn can_unban(&self, address: &Address) -> bool {
// If somehow primary ends up being banned we should return true here
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should keep the "why", i.e. the primary can never and should never be banned.

src/pool.rs Outdated
return true;
}

// Check if all instances are banned, in that case unban everything
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Check if all instances are banned, in that case unban everything
// Check if all replicas are banned, in that case unban all of them

let read_guard = self.banlist.read();
let banned_timestamp = match read_guard[address.shard].get(address) {
Some(timestamp) => timestamp.clone(),
None => return true,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The debug here was useful to know that the instance is not banned during development. It would be nice to log address is ok here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so this function is not responsible for saying the address is okay since the is_banned function is doing that and logging it

src/pool.rs Outdated
// Check if ban time is expired
let read_guard = self.banlist.read();
let banned_timestamp = match read_guard[address.shard].get(address) {
Some(timestamp) => timestamp.clone(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't need to clone, you can keep a reference.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since this results in a reference and we want to drop the guard immediately after reading this value. can do the operations I need though before dropping this

src/pool.rs Outdated

let guard = self.banlist.read();
/// Determines if we can try to unban this server
pub async fn can_unban(&self, address: &Address) -> bool {
Copy link
Contributor

@levkk levkk Jan 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is the right name for this function. can_unban tells me that you're checking if you're allowed to unban the instance given some condition, and then you can decide whether to do so or not. In this implementation, you're unbanning the instance if you can in the same function. Maybe a better name could be is_unbanned?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed to try_unban

@levkk
Copy link
Contributor

levkk commented Jan 20, 2023

Sweet.

@levkk levkk merged commit a0e740d into postgresml:main Jan 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants