Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ensure we reconnect on failure #173

Merged
merged 3 commits into from
May 10, 2024
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
27 changes: 26 additions & 1 deletion src/extensions/client/endpoint.rs
Original file line number Diff line number Diff line change
Expand Up @@ -20,8 +20,10 @@ pub struct Endpoint {
url: String,
health: Arc<Health>,
client_rx: tokio::sync::watch::Receiver<Option<Arc<Client>>>,
reconnect_tx: tokio::sync::mpsc::Sender<()>,
on_client_ready: Arc<tokio::sync::Notify>,
background_tasks: Vec<tokio::task::JoinHandle<()>>,
connect_counter: Arc<AtomicU32>,
}

impl Drop for Endpoint {
Expand All @@ -38,19 +40,23 @@ impl Endpoint {
health_config: HealthCheckConfig,
) -> Self {
let (client_tx, client_rx) = tokio::sync::watch::channel(None);
let (reconnect_tx, mut reconnect_rx) = tokio::sync::mpsc::channel(1);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tokio::sync::Notify may be a better option

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

notify is one off thing but we may need to reconnect multiple times

let on_client_ready = Arc::new(tokio::sync::Notify::new());
let health = Arc::new(Health::new(url.clone(), health_config));
let connect_counter = Arc::new(AtomicU32::new(0));

let url_ = url.clone();
let health_ = health.clone();
let on_client_ready_ = on_client_ready.clone();
let connect_counter_ = connect_counter.clone();

// This task will try to connect to the endpoint and keep the connection alive
let connection_task = tokio::spawn(async move {
let connect_backoff_counter = Arc::new(AtomicU32::new(0));

loop {
tracing::info!("Connecting endpoint: {url_}");
connect_counter_.fetch_add(1, Ordering::Relaxed);

let client = WsClientBuilder::default()
.request_timeout(request_timeout.unwrap_or(Duration::from_secs(30)))
Expand All @@ -68,7 +74,15 @@ impl Endpoint {
on_client_ready_.notify_waiters();
tracing::info!("Endpoint connected: {url_}");
connect_backoff_counter.store(0, Ordering::Relaxed);
client.on_disconnect().await;

tokio::select! {
_ = reconnect_rx.recv() => {
tracing::debug!("Endpoint reconnect requested: {url_}");
},
_ = client.on_disconnect() => {
tracing::debug!("Endpoint disconnected: {url_}");
}
}
}
Err(err) => {
health_.on_error(&err);
Expand All @@ -88,8 +102,10 @@ impl Endpoint {
url,
health,
client_rx,
reconnect_tx,
on_client_ready,
background_tasks: vec![connection_task, health_checker],
connect_counter,
}
}

Expand All @@ -108,6 +124,10 @@ impl Endpoint {
self.on_client_ready.notified().await;
}

pub fn connect_counter(&self) -> u32 {
self.connect_counter.load(Ordering::Relaxed)
}

pub async fn request(
&self,
method: &str,
Expand Down Expand Up @@ -165,4 +185,9 @@ impl Endpoint {
}
}
}

pub async fn reconnect(&self) {
// notify the client to reconnect
self.reconnect_tx.send(()).await.unwrap();
}
}
18 changes: 13 additions & 5 deletions src/extensions/client/mod.rs
Original file line number Diff line number Diff line change
Expand Up @@ -261,13 +261,17 @@ impl Client {
}
// wait for at least one endpoint to connect
futures::future::select_all(endpoints.iter().map(|x| x.connected().boxed())).await;
// Sort by health score
endpoints.sort_by_key(|endpoint| std::cmp::Reverse(endpoint.health().score()));
// Pick the first one
endpoints[0].clone()

endpoints
.iter()
.max_by_key(|endpoint| endpoint.health().score())
.expect("No endpoints")
.clone()
};

let mut selected_endpoint = healthiest_endpoint(None).await;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is important. it ensures at least one endpoint is connected. selecting just the first one may result on endpoint connection failure and never connects so selected_endpoint.connected().await will never resolve

Copy link
Member Author

@xlc xlc May 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in that case we need a test. the current behaviour makes unit test non-deterministic as it may connect any of the dummy server so it is best to fix the waiting for connect behaviour anyway

let mut selected_endpoint = endpoints[0].clone();

selected_endpoint.connected().await;

let handle_message = |message: Message, endpoint: Arc<Endpoint>, rotation_notify: Arc<Notify>| {
let tx = message_tx_bg.clone();
Expand Down Expand Up @@ -422,6 +426,10 @@ impl Client {
_ = selected_endpoint.health().unhealthy() => {
// Current selected endpoint is unhealthy, try to rotate to another one.
// In case of all endpoints are unhealthy, we don't want to keep rotating but stick with the healthiest one.

// The ws client maybe in a state that requires a reconnect
selected_endpoint.reconnect().await;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when will execute the moment endpoint becomes unhealthy and when that happens it will try to reconnect. I don't think this extra reconnect will help

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is not reconnect currently. we have to drop and re-create the client to actually reconnect. currently it will always fail if the remote drops connection and can never be able to connect to it anymore


let new_selected_endpoint = healthiest_endpoint(None).await;
if new_selected_endpoint.url() != selected_endpoint.url() {
tracing::warn!("Switch to endpoint: {new_url}", new_url=new_selected_endpoint.url());
Expand Down
43 changes: 43 additions & 0 deletions src/extensions/client/tests.rs
Original file line number Diff line number Diff line change
Expand Up @@ -290,3 +290,46 @@ async fn health_check_works() {
handle1.stop().unwrap();
handle2.stop().unwrap();
}

#[tokio::test]
async fn reconnect_on_disconnect() {
let (addr1, handle1, mut rx1, _) = dummy_server().await;
let (addr2, handle2, mut rx2, _) = dummy_server().await;

let client = Client::new(
[format!("ws://{addr1}"), format!("ws://{addr2}")],
Some(Duration::from_millis(100)),
None,
Some(2),
None,
)
.unwrap();

let h1 = tokio::spawn(async move {
let _req = rx1.recv().await.unwrap();
// no response, let it timeout
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a request timeout will make endpoint unhealthy therefor it will try to reconnect itself

tokio::time::sleep(Duration::from_millis(200)).await;
});

let h2 = tokio::spawn(async move {
let req = rx2.recv().await.unwrap();
req.respond(json!(1));
});

let h3 = tokio::spawn(async move {
let res = client.request("mock_rpc", vec![]).await;
assert_eq!(res.unwrap(), json!(1));

tokio::time::sleep(Duration::from_millis(2000)).await;

assert_eq!(client.endpoints()[0].connect_counter(), 2);
assert_eq!(client.endpoints()[1].connect_counter(), 1);
});

h3.await.unwrap();
h1.await.unwrap();
h2.await.unwrap();

handle1.stop().unwrap();
handle2.stop().unwrap();
}
Loading