-
Notifications
You must be signed in to change notification settings - Fork 120
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Suspected busy-wait on a Mutex blocking bot #69
Comments
Log from it going down a couple of minutes ago:
|
I'm still none too sure as to the cause, but I have a gut feeling it's due to ShardHandle (de)registration in songbird::shards::SerenityShardHandle. From your trace there, it looks as though Shard 6's reconnect triggers it, but the surrounding events should be hitting shard 5... |
I'm wondering if it's related to the OOM issue I had the other day on a non-songbird serenity bot, and something to do with songbird is just exacerbating it (i.e that songbird has a higher CPU footprint). Either way, the issue seems to come and go a lot. Last week it was down almost daily, but I'm yet to run into an issue this week and the issue never seems to happen on the weekends. Possibly it's something to do with when Discord restarts their gateways? |
Adds some additional logging around some critical sections, rarely hit (i.e., during shard reconnections) in pursuit of issue serenity-rs#69. It's strongly suspected to lie here, at any rate...
Adds some additional logging around some critical sections, rarely hit (i.e., during shard reconnections) in pursuit of issue serenity-rs#69. It's strongly suspected to lie here, at any rate...
I shoved some extra logging around the suspected issue spot in both |
Alright, thank you. I'll try running with that and see if I get any useful output |
Died again; logging on songbird has the following: |
Thanks; did you filter these logs to only have songbird entries? Some of what is relevant might have ended up in serenity:
|
Sorry, the serenity logging was off at the time since my journal was being annoying but I've got it running with it now so will just wait for it to die again |
Apologies for that; hopefully those logs will be a bit more revealing. It looks as though there were a few channel update events near the end of the logs you've submitted which might be the root cause. |
Fingers crossed- thank you for your patience with this! Really appreciated
|
dead.log |
I can't find any mention of shard handle (de/re-)registration, which I thought were the most likely causes (i.e., in and around complete shard reconnect events). There are a few resumes, but nothing weird there (and nothing indicating a shard death due to that). What's puzzling me is that the last events are all |
dead.log |
I've also been hitting this issue and might be able to shed some light on it. Information
What HappensVaries a little by environment but I'll focus on my Mac since that's where I'm able to gather the most info. The memory pressure on my system will spike, eventually causing it to freeze if I don't kill the bot fast enough. What I learnedWhile it was running on my live server I noticed the process was switching between the Now that I knew it was an allocation issue I used the the wonderful cargo instruments package with the MacOS instruments tool to profile the allocations while running it locally. The resulting trace from my first successful capture (after freezing my computer a few times) revealed the issue was an allocation of 384 bytes being allocated and freed super fast. If you want to see the trace, let me know, I'll have to find a place to upload it since it's over 4GB though Compiling in debug mode, I was able to get a useful stack trace and since I knew the allocation size I was able to filter out irrelevant allocations. The problematic ones came from an Searching through the Songbird code, the only place where a songbird/src/driver/tasks/ws.rs Line 72 in 7e47d8b
Which you'll notice happens in a My hypothesis is that somehow one of those branches is triggering the loop to run with very little to no delay and causing the memory and CPU issue. Next stepsHopefully this is enough detail to chase down the problem on your side. If not then I think the next move would be to add a bunch of logging/tracing to figure out which path in that loop is causing the issue and wait for it to happen again. I don't understand enough of what's going on in that loop to spot a possible problematic path but it looks super state-full so logging might be needed anyway. Final thoughts
|
The timings of this issue seems to strongly indicate this. The issue has never happened for me on a weekend, and very rarely in the early morning, which seems to indicate something the system admins are doing at Discord is causing this to happen |
Thanks very much for looking into this; I still haven't had it happen to me yet, so this extra input is a godsend. |
There are two potential
|
Ah you're right I missed that there were two call sites. Looking back at my traces I see both locations so either call site is possible. Here is the stack trace for the other one. Unfortunately since my release mode traces are useless at identifying the issue it seems that I can't tell you which one is at fault (at least not yet). Given the udp TX task is using a compile time timeout it seems likely the ws task is to blame. |
Had a chance to look at this some more. I downloaded songbird and added a bunch of logging to the loop in question. Running this morning I managed to capture a #[instrument(skip(self))]
async fn run(&mut self, interconnect: &mut Interconnect) {
let mut next_heartbeat = Instant::now() + self.heartbeat_interval;
info!("Starting main ws loop");
loop {
let mut ws_error = false;
let mut should_reconnect = false;
let hb = sleep_until(next_heartbeat);
info!("created heartbeat timeout");
select! {
_ = hb => {
info!("heartbeat timeout reached, sending heartbeat");
ws_error = match self.send_heartbeat().await {
Err(e) => {
error!("heartbeat error: {:?}", e);
should_reconnect = ws_error_is_not_final(e);
info!("should reconnect is {}", should_reconnect);
true
},
_ => {
info!("heartbeat successful");
false
},
};
info!("ws_error is {}", ws_error);
next_heartbeat = self.next_heartbeat();
info!("next_heartbeat is {:?}", next_heartbeat);
info!("done sending heartbeat");
}
ws_msg = self.ws_client.recv_json_no_timeout(), if !self.dont_send => {
info!("in recv_json_no_timeout, dont_send={}", self.dont_send);
ws_error = match ws_msg {
Err(WsError::Json(e)) => {
error!("Unexpected JSON {:?}.", e);
false
},
Err(e) => {
error!("Got web socket error: {:?}", e);
should_reconnect = ws_error_is_not_final(e);
info!("should_reconnect={}", should_reconnect);
true
},
Ok(Some(msg)) => {
info!("got valid ws message, processing");
self.process_ws(interconnect, msg);
false
},
Ok(None) => {
info!("got empty ws message");
false
},
};
info!("ws_error={}", ws_error);
info!("done handling json receive no timeout");
}
inner_msg = self.rx.recv_async() => {
info!("handling inner_msg");
match inner_msg {
Ok(WsMessage::Ws(data)) => {
info!("received data");
self.ws_client = *data;
next_heartbeat = self.next_heartbeat();
info!("next_heartbeat={:?}", next_heartbeat);
self.dont_send = false;
info!("dont_send={}", self.dont_send);
},
Ok(WsMessage::ReplaceInterconnect(i)) => {
info!("replacing interconnect");
*interconnect = i;
},
Ok(WsMessage::SetKeepalive(keepalive)) => {
info!("handling keepalive");
self.heartbeat_interval = Duration::from_secs_f64(keepalive / 1000.0);
info!("heartbeat_interval={:?}", self.heartbeat_interval);
next_heartbeat = self.next_heartbeat();
info!("next_heartbeat={:?}", next_heartbeat);
},
Ok(WsMessage::Speaking(is_speaking)) => {
info!("handling speaking");
if self.speaking.contains(SpeakingState::MICROPHONE) != is_speaking && !self.dont_send {
self.speaking.set(SpeakingState::MICROPHONE, is_speaking);
info!("Changing to {:?}", self.speaking);
let ssu_status = self.ws_client
.send_json(&GatewayEvent::from(Speaking {
delay: Some(0),
speaking: self.speaking,
ssrc: self.ssrc,
user_id: None,
}))
.await;
info!("sent speaking update");
ws_error |= match ssu_status {
Err(e) => {
error!("Error while trying to send speaking state: {:?}", e);
should_reconnect = ws_error_is_not_final(e);
info!("should_reconnect={}", should_reconnect);
true
},
_ => false,
};
info!("ws_error={}", ws_error);
}
},
Err(e) => {
error!("inner messge error: {:?}", e);
}
Ok(WsMessage::Poison) => {
error!("got poison message, breaking");
break;
},
}
info!("done handling inner_msg");
}
}
if ws_error {
self.dont_send = true;
if should_reconnect {
let _ = interconnect.core.send(CoreMessage::Reconnect);
} else {
break;
}
}
}
} The logs change at timestamp Also from a general safety perspective it seems like this loop and maybe a few others can potentially be problematic like this. Even if they're bug free discord won't be and could still cause some undesired behavior. It seems like it would be good to define some generic safety parameters (must run no more than 100 times per second or something or if there's vector or queue check that it isn't getting really large when it shouldn't be, etc) and bail out back to the initial state and reconnect from there. Might help keep these issues from becoming such a big pain. |
Thanks very much. I think the fault may lie here:
|
Would you be able to test out a test fix? I have a branch here that applies the above modification---ideally, whatever WS issue is cropping up should actually prompt a reconnect now. |
Yeah happy to try it out. I actually added the fix to my branch yesterday and left it running overnight. What I found was it does correctly return a Err(e) => {
error!("inner messge error: {:?}", e);
// missing break here
}
Ok(WsMessage::Poison) => {
error!("got poison message, breaking");
break;
}, That said in your code it would have hit the break statement. Trying it again as we speak and assuming it works for a while I'll run on your branch without the logging as well. Just to be doubly sure I didn't miss something else. |
I've rebuilt on that branch and pushed it now. Will see how it goes for the next week or so. Thank you! |
...A full week now with no issues, looking very good. Thank you very much both @jtscuba for the far superior debugging and @FelixMcFelix for sorting out a fix for this and having patience with the little information I could gather. Much appreciated! |
Thanks immensely to you both, and my apologies that work's been too hectic for me to seriously investigate. I'll merge it in shortly. |
Debugging work put forth by JellyWx and jtscuba suggests that this WS thread receive is ending up in a permanent failure loop. This adapts the timeout-less receive such that receive failures are properly propagated. This was tested using `cargo make ready` and using the `voice_storage` example (although I cannot repro the locking behaviour myself). Comments in serenity-rs#69 suggest this is fixed -- closes serenity-rs#69.
Debugging work put forth by JellyWx and jtscuba suggests that this WS thread receive is ending up in a permanent failure loop. This adapts the timeout-less receive such that receive failures are properly propagated. This was tested using `cargo make ready` and using the `voice_storage` example (although I cannot repro the locking behaviour myself). Comments in #69 suggest this is fixed -- closes #69.
Adds some additional logging around some critical sections, rarely hit (i.e., during shard reconnections) in pursuit of issue serenity-rs#69. It's strongly suspected to lie here, at any rate...
Adds some additional logging around some critical sections, rarely hit (i.e., during shard reconnections) in pursuit of issue serenity-rs#69. It's strongly suspected to lie here, at any rate...
An issue I keep running into.
Information:
next
next
What happens:
After running for a while, the bot goes offline and begins to consume the maximum CPU allotted to it. Using
perf top
, this consumption can be fed back to a mutex, and a number of other calls (one called 'Sleep::new_timeout' from tokio):The memory also spikes a little bit, but there appears to be no active leaking occuring.
The bot is in around 8'000 guilds now, and runs autosharded. It usually goes down once every 12-36 hours. The server it is running on has more than enough memory/CPU available to the bot (average total system load 0.5-2.0 out of 12.0) and only runs other Discord bots and a couple of websites
My bot that is having this is at https://github.com/jellywx/soundfx-rs
The text was updated successfully, but these errors were encountered: