Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Suspected busy-wait on a Mutex blocking bot #69

Closed
JellyWX opened this issue Apr 29, 2021 · 25 comments · Fixed by #78
Closed

Suspected busy-wait on a Mutex blocking bot #69

JellyWX opened this issue Apr 29, 2021 · 25 comments · Fixed by #78
Labels
bug Something isn't working

Comments

@JellyWX
Copy link
Contributor

JellyWX commented Apr 29, 2021

An issue I keep running into.

Information:

  • serenity 0.10.5 next
  • songbird 0.1.6 next
  • tokio 1.5.0
  • Rustc 1.51.0
  • glibc 2.31
  • Ubuntu 20.04 Linux 5.4.0-70-generic

What happens:

After running for a while, the bot goes offline and begins to consume the maximum CPU allotted to it. Using perf top, this consumption can be fed back to a mutex, and a number of other calls (one called 'Sleep::new_timeout' from tokio):

a
b

The memory also spikes a little bit, but there appears to be no active leaking occuring.

The bot is in around 8'000 guilds now, and runs autosharded. It usually goes down once every 12-36 hours. The server it is running on has more than enough memory/CPU available to the bot (average total system load 0.5-2.0 out of 12.0) and only runs other Discord bots and a couple of websites

My bot that is having this is at https://github.com/jellywx/soundfx-rs

@FelixMcFelix FelixMcFelix added the bug Something isn't working label Apr 29, 2021
@JellyWX
Copy link
Contributor Author

JellyWX commented Apr 29, 2021

Log from it going down a couple of minutes ago:

Apr 29 12:31:09 jellywx.com soundfx-rs[3875847]: [2021-04-29T12:31:09Z INFO  serenity::client::bridge::gateway::shard_runner] handle_voice_event; event=VoiceStateUpdate(VoiceStateUpdateEvent { guild_id: Some(GuildId(829618232096587796)), voice_state: VoiceState { channel_id: Some(ChannelId(829618232306171956)), deaf: false, guild_id: Some(GuildId(829618232096587796)), member: Some(Member { deaf: false, guild_id: GuildId(829618232096587796), joined_at: Some(2021-04-08T09:49:09.745312Z), mute: false, nick: Some("Padre Buon Signore"), roles: [RoleId(829657588404584518), RoleId(829668076835307542)], user: User { id: UserId(766392003381690369), avatar: None, bot: false, discriminator: 2821, name: "Simone", public_flags: None }, pending: false, premium_since: None }), mute: false, self_deaf: false, self_mute: true, self_stream: None, self_video: false, session_id: "8be6a7d229377b203f71b3fe09b509d8", suppress: false, user_id: UserId(766392003381690369), request_to_speak_timestamp: None } })
Apr 29 12:31:09 jellywx.com soundfx-rs[3875847]: [2021-04-29T12:31:09Z INFO  serenity::client::dispatch] handle_event; runner_tx=UnboundedSender(Some(UnboundedSenderInner { inner: UnboundedInner { state: 9223372036854775808, message_queue: Queue { head: 0x7f68f7e9e460, tail: UnsafeCell }, num_senders: 3, recv_task: AtomicWaker } })) shard_id=5
Apr 29 12:31:09 jellywx.com soundfx-rs[3875847]: [2021-04-29T12:31:09Z INFO  serenity::gateway::shard] handle_event; event=Err(Tungstenite(Protocol(ResetWithoutClosingHandshake)))
Apr 29 12:31:09 jellywx.com soundfx-rs[3875847]: [2021-04-29T12:31:09Z WARN  serenity::gateway::shard] [Shard [6, 8]] Websocket error: Protocol(ResetWithoutClosingHandshake)
Apr 29 12:31:09 jellywx.com soundfx-rs[3875847]: [2021-04-29T12:31:09Z INFO  serenity::gateway::shard] [Shard [6, 8]] Will attempt to auto-reconnect
Apr 29 12:31:09 jellywx.com soundfx-rs[3875847]: [2021-04-29T12:31:09Z INFO  serenity::internal::ws_impl] create_rustls_client; url=Url { scheme: "wss", username: "", password: None, host: Some(Domain("gateway.discord.gg")), port: None, path: "/", query: Some("v=8"), fragment: None }
Apr 29 12:31:09 jellywx.com soundfx-rs[3875847]: [2021-04-29T12:31:09Z INFO  serenity::gateway::shard] handle_event; event=Ok(Dispatch(8811, MessageCreate(MessageCreateEvent { message: Message { id: MessageId(837304999943733268), attachments: [], author: User { id: UserId(424410634919084032), avatar: Some("7c7bc69d13d92a3097a8ffee5c88d2ea"), bot: false, discriminator: 2453, name: "Adokus", public_flags: Some((empty)) }, channel_id: ChannelId(705957876232880174), content: "https://youtu.be/neqa3ih2KHY", edited_timestamp: None, embeds: [Embed { author: Some(EmbedAuthor { icon_url: None, name: "\u{64b}", proxy_icon_url: None, url: Some("https://www.youtube.com/channel/UCw7CjJ7lFfpEC5FKmIS1ySw") }), colour: Colour(16711680), description: Some("\u{200f}\u{200f}\u{200e}"), fields: [], footer: None, image: None, kind: "video", provider: Some(EmbedProvider { name: "YouTube", url: Some("https://www.youtube.com") }), thumbnail: Some(EmbedThumbnail { height: 720, proxy_url: "https://images-ext-1.discordapp.net/external/QMI7MZcMl1jzCRCl7wdB2ANZjm4cBhT_-Pf9fsuPbfo/https/i.ytimg.com/vi/neqa3ih2KHY/maxresdefault.jpg", url: "https://i.ytimg.com/vi/neqa3ih2KHY/maxresdefault.jpg", width: 1280 }), timestamp: None, title: Some("\u{200f}\u{200f}\u{200e}"), url: Some("https://www.youtube.com/watch?v=neqa3ih2KHY"), video: Some(EmbedVideo { height: 720, url: "https://www.youtube.com/embed/neqa3ih2KHY", width: 1280 }) }], guild_id: Some(GuildId(705957876232880169)), kind: Regular, member: Some(PartialMember { deaf: false, joined_at: Some(2020-05-02T01:45:50.897Z), mute: false, nick: Some("The Sad Kid"), roles: [RoleId(707072429935296564)], pending: false, premium_since: None }), mention_everyone: false, mention_roles: [], mention_channels: [], mentions: [], nonce: String("837304994687352832"), pinned: false, reactions: [], timestamp: 2021-04-29T12:31:08.361Z, tts: false, webhook_id: None, activity: None, application: None, message_reference: None, flags: Some((empty)), stickers: [], referenced_message: None } })))
Apr 29 12:31:09 jellywx.com soundfx-rs[3875847]: [2021-04-29T12:31:09Z INFO  serenity::gateway::shard] handle_gateway_dispatch; seq=8811 event=MessageCreate(MessageCreateEvent { message: Message { id: MessageId(837304999943733268), attachments: [], author: User { id: UserId(424410634919084032), avatar: Some("7c7bc69d13d92a3097a8ffee5c88d2ea"), bot: false, discriminator: 2453, name: "Adokus", public_flags: Some((empty)) }, channel_id: ChannelId(705957876232880174), content: "https://youtu.be/neqa3ih2KHY", edited_timestamp: None, embeds: [Embed { author: Some(EmbedAuthor { icon_url: None, name: "\u{64b}", proxy_icon_url: None, url: Some("https://www.youtube.com/channel/UCw7CjJ7lFfpEC5FKmIS1ySw") }), colour: Colour(16711680), description: Some("\u{200f}\u{200f}\u{200e}"), fields: [], footer: None, image: None, kind: "video", provider: Some(EmbedProvider { name: "YouTube", url: Some("https://www.youtube.com") }), thumbnail: Some(EmbedThumbnail { height: 720, proxy_url: "https://images-ext-1.discordapp.net/external/QMI7MZcMl1jzCRCl7wdB2ANZjm4cBhT_-Pf9fsuPbfo/https/i.ytimg.com/vi/neqa3ih2KHY/maxresdefault.jpg", url: "https://i.ytimg.com/vi/neqa3ih2KHY/maxresdefault.jpg", width: 1280 }), timestamp: None, title: Some("\u{200f}\u{200f}\u{200e}"), url: Some("https://www.youtube.com/watch?v=neqa3ih2KHY"), video: Some(EmbedVideo { height: 720, url: "https://www.youtube.com/embed/neqa3ih2KHY", width: 1280 }) }], guild_id: Some(GuildId(705957876232880169)), kind: Regular, member: Some(PartialMember { deaf: false, joined_at: Some(2020-05-02T01:45:50.897Z), mute: false, nick: Some("The Sad Kid"), roles: [RoleId(707072429935296564)], pending: false, premium_since: None }), mention_everyone: false, mention_roles: [], mention_channels: [], mentions: [], nonce: String("837304994687352832"), pinned: false, reactions: [], timestamp: 2021-04-29T12:31:08.361Z, tts: false, webhook_id: None, activity: None, application: None, message_reference: None, flags: Some((empty)), stickers: [], referenced_message: None } })
Apr 29 12:31:09 jellywx.com soundfx-rs[3875847]: [2021-04-29T12:31:09Z INFO  serenity::client::bridge::gateway::shard_runner] handle_voice_event; event=MessageCreate(MessageCreateEvent { message: Message { id: MessageId(837304999943733268), attachments: [], author: User { id: UserId(424410634919084032), avatar: Some("7c7bc69d13d92a3097a8ffee5c88d2ea"), bot: false, discriminator: 2453, name: "Adokus", public_flags: Some((empty)) }, channel_id: ChannelId(705957876232880174), content: "https://youtu.be/neqa3ih2KHY", edited_timestamp: None, embeds: [Embed { author: Some(EmbedAuthor { icon_url: None, name: "\u{64b}", proxy_icon_url: None, url: Some("https://www.youtube.com/channel/UCw7CjJ7lFfpEC5FKmIS1ySw") }), colour: Colour(16711680), description: Some("\u{200f}\u{200f}\u{200e}"), fields: [], footer: None, image: None, kind: "video", provider: Some(EmbedProvider { name: "YouTube", url: Some("https://www.youtube.com") }), thumbnail: Some(EmbedThumbnail { height: 720, proxy_url: "https://images-ext-1.discordapp.net/external/QMI7MZcMl1jzCRCl7wdB2ANZjm4cBhT_-Pf9fsuPbfo/https/i.ytimg.com/vi/neqa3ih2KHY/maxresdefault.jpg", url: "https://i.ytimg.com/vi/neqa3ih2KHY/maxresdefault.jpg", width: 1280 }), timestamp: None, title: Some("\u{200f}\u{200f}\u{200e}"), url: Some("https://www.youtube.com/watch?v=neqa3ih2KHY"), video: Some(EmbedVideo { height: 720, url: "https://www.youtube.com/embed/neqa3ih2KHY", width: 1280 }) }], guild_id: Some(GuildId(705957876232880169)), kind: Regular, member: Some(PartialMember { deaf: false, joined_at: Some(2020-05-02T01:45:50.897Z), mute: false, nick: Some("The Sad Kid"), roles: [RoleId(707072429935296564)], pending: false, premium_since: None }), mention_everyone: false, mention_roles: [], mention_channels: [], mentions: [], nonce: String("837304994687352832"), pinned: false, reactions: [], timestamp: 2021-04-29T12:31:08.361Z, tts: false, webhook_id: None, activity: None, application: None, message_reference: None, flags: Some((empty)), stickers: [], referenced_message: None } })

perf top:
2021-04-29_13-48

@FelixMcFelix
Copy link
Member

I'm still none too sure as to the cause, but I have a gut feeling it's due to ShardHandle (de)registration in songbird::shards::SerenityShardHandle. From your trace there, it looks as though Shard 6's reconnect triggers it, but the surrounding events should be hitting shard 5...

@JellyWX
Copy link
Contributor Author

JellyWX commented May 11, 2021

I'm wondering if it's related to the OOM issue I had the other day on a non-songbird serenity bot, and something to do with songbird is just exacerbating it (i.e that songbird has a higher CPU footprint).

Either way, the issue seems to come and go a lot. Last week it was down almost daily, but I'm yet to run into an issue this week and the issue never seems to happen on the weekends. Possibly it's something to do with when Discord restarts their gateways?

FelixMcFelix added a commit to FelixMcFelix/songbird that referenced this issue May 16, 2021
Adds some additional logging around some critical sections, rarely hit (i.e., during shard reconnections) in pursuit of issue serenity-rs#69. It's strongly suspected to lie here, at any rate...
FelixMcFelix added a commit to FelixMcFelix/songbird that referenced this issue May 16, 2021
Adds some additional logging around some critical sections, rarely hit (i.e., during shard reconnections) in pursuit of issue serenity-rs#69. It's strongly suspected to lie here, at any rate...
@FelixMcFelix
Copy link
Member

I shoved some extra logging around the suspected issue spot in both v0.2.0-beta.1 (crates.io) and next. I haven't seen any issues on my side, I'll try this and force explicit sharding to see if I can repro.

@JellyWX
Copy link
Contributor Author

JellyWX commented May 17, 2021

Alright, thank you. I'll try running with that and see if I get any useful output

@JellyWX
Copy link
Contributor Author

JellyWX commented May 20, 2021

Died again; logging on songbird has the following:
dead.log

@FelixMcFelix
Copy link
Member

Thanks; did you filter these logs to only have songbird entries? Some of what is relevant might have ended up in serenity:

May 21 09:49:49.771 DEBUG run:recv_event:handle_voice_event{event=Ready(ReadyEvent { ready: Ready { application: PartialCurrentApplicationInfo { id: UserId(458276777832480769), flags: 262144 }, guilds: [Offline(GuildUnavailable { id: GuildId(192582352067624960), unavailable: true })], presences: {}, private_channels: {}, session_id: "4c3b49f3d950cc8d0e1de623779a2fba", shard: Some([0, 1]), trace: ["[\"gateway-prd-main-fw5s\",{\"micros\":37058,\"calls\":[\"discord-sessions-green-prd-2-100\",{\"micros\":33584,\"calls\":[\"start_session\",{\"micros\":29090,\"calls\":[\"api-prd-main-xngt\",{\"micros\":26241,\"calls\":[\"get_user\",{\"micros\":2332},\"add_authorized_ip\",{\"micros\":1230},\"get_guilds\",{\"micros\":4172},\"coros_wait\",{\"micros\":0}]}]},\"guilds_connect\",{\"micros\":2,\"calls\":[]},\"presence_connect\",{\"micros\":3952,\"calls\":[]}]}]}]"], user: CurrentUser { id: UserId(458276777832480769), avatar: None, bot: true, discriminator: 7265, email: None, mfa_enabled: false, name: "Test-small", verified: Some(true), public_flags: None }, version: 8 } })}: songbird::shards: Cleared queued messages.
May 21 09:49:49.772 DEBUG run:recv_event:handle_voice_event{event=Ready(ReadyEvent { ready: Ready { application: PartialCurrentApplicationInfo { id: UserId(458276777832480769), flags: 262144 }, guilds: [Offline(GuildUnavailable { id: GuildId(192582352067624960), unavailable: true })], presences: {}, private_channels: {}, session_id: "4c3b49f3d950cc8d0e1de623779a2fba", shard: Some([0, 1]), trace: ["[\"gateway-prd-main-fw5s\",{\"micros\":37058,\"calls\":[\"discord-sessions-green-prd-2-100\",{\"micros\":33584,\"calls\":[\"start_session\",{\"micros\":29090,\"calls\":[\"api-prd-main-xngt\",{\"micros\":26241,\"calls\":[\"get_user\",{\"micros\":2332},\"add_authorized_ip\",{\"micros\":1230},\"get_guilds\",{\"micros\":4172},\"coros_wait\",{\"micros\":0}]}]},\"guilds_connect\",{\"micros\":2,\"calls\":[]},\"presence_connect\",{\"micros\":3952,\"calls\":[]}]}]}]"], user: CurrentUser { id: UserId(458276777832480769), avatar: None, bot: true, discriminator: 7265, email: None, mfa_enabled: false, name: "Test-small", verified: Some(true), public_flags: None }, version: 8 } })}: songbird::manager: Registered shard handle 0.

@JellyWX
Copy link
Contributor Author

JellyWX commented May 22, 2021

Sorry, the serenity logging was off at the time since my journal was being annoying but I've got it running with it now so will just wait for it to die again

@FelixMcFelix
Copy link
Member

Apologies for that; hopefully those logs will be a bit more revealing. It looks as though there were a few channel update events near the end of the logs you've submitted which might be the root cause.

@JellyWX
Copy link
Contributor Author

JellyWX commented May 23, 2021 via email

@JellyWX
Copy link
Contributor Author

JellyWX commented May 24, 2021

dead.log
Here's 1000 lines of dead log after it went down, with serenity on debug mode and songbird too

@FelixMcFelix
Copy link
Member

I can't find any mention of shard handle (de/re-)registration, which I thought were the most likely causes (i.e., in and around complete shard reconnect events). There are a few resumes, but nothing weird there (and nothing indicating a shard death due to that).

What's puzzling me is that the last events are all MessageCreates, which won't ever call anything to do with songbird (tracing pings them up, but they obviously fail to match Ready/VoiceStateUpdate/VoiceServerUpdate). There are some voice events nearby, but they seem to match other users/bots, and almost all shards are fine in those last few seconds. Maybe I'll spot something tomorrow. Might be something after serenity's shard_runner.rs:119, but before the next iteration begins.

@JellyWX
Copy link
Contributor Author

JellyWX commented May 25, 2021

dead.log
Another one here. Interestingly, another bot panicked at the exact same time with a BrokenPipe error (a bot running serenity 0.10.5), and a service that interacts with discord hit a 'Connection reset by peer'.
2021-05-25_22-16

@jtscuba
Copy link

jtscuba commented May 30, 2021

I've also been hitting this issue and might be able to shed some light on it.

Information

  • serenity 0.10.5 next
  • songbird 0.2.0-beta.2 next
  • tokio 0.2.25
  • Rustc 1.52.1
  • MacOS and Ubuntu

What Happens

Varies a little by environment but I'll focus on my Mac since that's where I'm able to gather the most info. The memory pressure on my system will spike, eventually causing it to freeze if I don't kill the bot fast enough.

What I learned

While it was running on my live server I noticed the process was switching between the D and R states was able to use strace to figure out what was going on. In the 1.5 seconds I traced it for, it allocated over 600MB of memory, without exceeding the memory limit of 180 MB. Then it finally did hit that limit, at which point the OOM killer restarted the process.

Now that I knew it was an allocation issue I used the the wonderful cargo instruments package with the MacOS instruments tool to profile the allocations while running it locally. The resulting trace from my first successful capture (after freezing my computer a few times) revealed the issue was an allocation of 384 bytes being allocated and freed super fast. If you want to see the trace, let me know, I'll have to find a place to upload it since it's over 4GB though

Compiling in debug mode, I was able to get a useful stack trace and since I knew the allocation size I was able to filter out irrelevant allocations. The problematic ones came from an __rdl_alloc primitive in called from somewhere in tokio and I was able to figure out the only place that many bytes is allocated on a regular basis with the __rd_alloc primitive is here.

Screen Shot 2021-05-30 at 12 52 39 PM

Searching through the Songbird code, the only place where a delay_until call is made (sleep_until tokio 1.x) is here

let hb = sleep_until(next_heartbeat);

Which you'll notice happens in a loop.

My hypothesis is that somehow one of those branches is triggering the loop to run with very little to no delay and causing the memory and CPU issue.

Next steps

Hopefully this is enough detail to chase down the problem on your side. If not then I think the next move would be to add a bunch of logging/tracing to figure out which path in that loop is causing the issue and wait for it to happen again. I don't understand enough of what's going on in that loop to spot a possible problematic path but it looks super state-full so logging might be needed anyway.

Final thoughts

  1. Reproducing seems to depend on something kinda rare happening on the discord side, which means I've had times where it failed in a few minutes and times where it ran with no issues in several days
  2. Running in release mode seems to make the issue worse, or at least I never ran into it until I ran locally in release mode.
  3. @FelixMcFelix thanks for all your hard work maintaining the serenity/songbird libraries. I know how much of a pain it can be maintaining this stuff and appreciate your efforts.

@JellyWX
Copy link
Contributor Author

JellyWX commented May 30, 2021

Reproducing seems to depend on something kinda rare happening on the discord side, which means I've had times where it failed in a few minutes and times where it ran with no issues in several days

The timings of this issue seems to strongly indicate this. The issue has never happened for me on a weekend, and very rarely in the early morning, which seems to indicate something the system admins are doing at Discord is causing this to happen

@FelixMcFelix
Copy link
Member

FelixMcFelix commented May 30, 2021

Thanks very much for looking into this; I still haven't had it happen to me yet, so this extra input is a godsend.

@FelixMcFelix
Copy link
Member

There are two potential delay_until sites:

  1. In the WS task as you suggest. It could be a reconnect setting the heartbeat value to something insane? This also opens up the possibility that something subtle could be going wrong in the select statement.
  2. The UDP TX task as your logs show (which uses the timeout future). The only issue is that the heartbeat timing there is a sane compile-time constant.

@jtscuba
Copy link

jtscuba commented May 31, 2021

Ah you're right I missed that there were two call sites. Looking back at my traces I see both locations so either call site is possible. Here is the stack trace for the other one. Unfortunately since my release mode traces are useless at identifying the issue it seems that I can't tell you which one is at fault (at least not yet).

Screen Shot 2021-05-30 at 7 55 08 PM

Given the udp TX task is using a compile time timeout it seems likely the ws task is to blame.

@jtscuba
Copy link

jtscuba commented Jun 2, 2021

Had a chance to look at this some more. I downloaded songbird and added a bunch of logging to the loop in question. Running this morning I managed to capture a 139411450 line, (14GB) log file from the run. Here's the relevant function with the extra logging. I also clipped the relevant section of the logs.
short.txt.zip

#[instrument(skip(self))]
    async fn run(&mut self, interconnect: &mut Interconnect) {
        let mut next_heartbeat = Instant::now() + self.heartbeat_interval;

        info!("Starting main ws loop");
        loop {
            let mut ws_error = false;
            let mut should_reconnect = false;

            let hb = sleep_until(next_heartbeat);
            info!("created heartbeat timeout");

            select! {
                _ = hb => {
                    info!("heartbeat timeout reached, sending heartbeat");
                    ws_error = match self.send_heartbeat().await {
                        Err(e) => {
                            error!("heartbeat error: {:?}", e);
                            should_reconnect = ws_error_is_not_final(e);
                            info!("should reconnect is {}", should_reconnect);
                            true
                        },
                        _ => {
                            info!("heartbeat successful");
                            false
                        },
                    };
                    info!("ws_error is {}", ws_error);
                    next_heartbeat = self.next_heartbeat();
                    info!("next_heartbeat is {:?}", next_heartbeat);
                    info!("done sending heartbeat");
                }
                ws_msg = self.ws_client.recv_json_no_timeout(), if !self.dont_send => {
                    info!("in recv_json_no_timeout, dont_send={}", self.dont_send);

                    ws_error = match ws_msg {
                        Err(WsError::Json(e)) => {
                            error!("Unexpected JSON {:?}.", e);
                            false
                        },
                        Err(e) => {
                            error!("Got web socket error: {:?}", e);
                            should_reconnect = ws_error_is_not_final(e);
                            info!("should_reconnect={}", should_reconnect);
                            true
                        },
                        Ok(Some(msg)) => {
                            info!("got valid ws message, processing");
                            self.process_ws(interconnect, msg);
                            false
                        },
                        Ok(None) => {
                            info!("got empty ws message");
                            false
                        },
                    };
                    info!("ws_error={}", ws_error);
                    info!("done handling json receive no timeout");
                }
                inner_msg = self.rx.recv_async() => {
                    info!("handling inner_msg");
                    match inner_msg {
                        Ok(WsMessage::Ws(data)) => {
                            info!("received data");
                            self.ws_client = *data;
                            next_heartbeat = self.next_heartbeat();
                            info!("next_heartbeat={:?}", next_heartbeat);
                            self.dont_send = false;
                            info!("dont_send={}", self.dont_send);
                        },
                        Ok(WsMessage::ReplaceInterconnect(i)) => {
                            info!("replacing interconnect");
                            *interconnect = i;
                        },
                        Ok(WsMessage::SetKeepalive(keepalive)) => {
                            info!("handling keepalive");
                            self.heartbeat_interval = Duration::from_secs_f64(keepalive / 1000.0);
                            info!("heartbeat_interval={:?}", self.heartbeat_interval);
                            next_heartbeat = self.next_heartbeat();
                            info!("next_heartbeat={:?}", next_heartbeat);
                        },
                        Ok(WsMessage::Speaking(is_speaking)) => {
                            info!("handling speaking");
                            if self.speaking.contains(SpeakingState::MICROPHONE) != is_speaking && !self.dont_send {
                                self.speaking.set(SpeakingState::MICROPHONE, is_speaking);
                                info!("Changing to {:?}", self.speaking);

                                let ssu_status = self.ws_client
                                    .send_json(&GatewayEvent::from(Speaking {
                                        delay: Some(0),
                                        speaking: self.speaking,
                                        ssrc: self.ssrc,
                                        user_id: None,
                                    }))
                                    .await;
                                info!("sent speaking update");

                                ws_error |= match ssu_status {
                                    Err(e) => {
                                        error!("Error while trying to send speaking state: {:?}", e);
                                        should_reconnect = ws_error_is_not_final(e);
                                        info!("should_reconnect={}", should_reconnect);
                                        true
                                    },
                                    _ => false,
                                };
                                info!("ws_error={}", ws_error);
                            }
                        },
                        Err(e) => {
                            error!("inner messge error: {:?}", e);
                        }
                        Ok(WsMessage::Poison) => {
                            error!("got poison message, breaking");
                            break;
                        },
                    }
                    info!("done handling inner_msg");
                }
            }

            if ws_error {
                self.dont_send = true;

                if should_reconnect {
                    let _ = interconnect.core.send(CoreMessage::Reconnect);
                } else {
                    break;
                }
            }
        }
    }

The logs change at timestamp2021-06-01T10:59:05 and based on what I can tell recv_json_no_timeout repeatedly returns Ok(None) with no delay. I don't know what that case signifies or what the right way of handling it would be but hopefully that helps narrow down the issue. Is there a way to trigger a reset of the voice connections/songbird state and restart like it's a fresh start or something? Maybe that would fix it?

Also from a general safety perspective it seems like this loop and maybe a few others can potentially be problematic like this. Even if they're bug free discord won't be and could still cause some undesired behavior. It seems like it would be good to define some generic safety parameters (must run no more than 100 times per second or something or if there's vector or queue check that it isn't getting really large when it shouldn't be, etc) and bail out back to the initial state and reconnect from there. Might help keep these issues from becoming such a big pain.

@FelixMcFelix
Copy link
Member

Thanks very much. I think the fault may lie here:
https://github.com/serenity-rs/songbird/blob/current/src/ws.rs#L103-L105

self.try_next().await.ok() is, I think, silently converting WS receive failures into None. self.try_next().await? is likely the correct substitution here. This is an educated guess, and seems to be the most likely explanation given all your debugging work.

@FelixMcFelix
Copy link
Member

Would you be able to test out a test fix? I have a branch here that applies the above modification---ideally, whatever WS issue is cropping up should actually prompt a reconnect now.

@jtscuba
Copy link

jtscuba commented Jun 3, 2021

Yeah happy to try it out. I actually added the fix to my branch yesterday and left it running overnight. What I found was it does correctly return a Disconnected error. Though due to a missed break statement when I spit this branch into two cases it ran away in an infinite loop again.

Err(e) => {
    error!("inner messge error: {:?}", e);
    // missing break here
}
Ok(WsMessage::Poison) => {
    error!("got poison message, breaking");
    break;
},

That said in your code it would have hit the break statement. Trying it again as we speak and assuming it works for a while I'll run on your branch without the logging as well. Just to be doubly sure I didn't miss something else.

@JellyWX
Copy link
Contributor Author

JellyWX commented Jun 3, 2021

I've rebuilt on that branch and pushed it now. Will see how it goes for the next week or so. Thank you!

@JellyWX
Copy link
Contributor Author

JellyWX commented Jun 11, 2021

...A full week now with no issues, looking very good. Thank you very much both @jtscuba for the far superior debugging and @FelixMcFelix for sorting out a fix for this and having patience with the little information I could gather. Much appreciated!

@FelixMcFelix
Copy link
Member

Thanks immensely to you both, and my apologies that work's been too hectic for me to seriously investigate. I'll merge it in shortly.

FelixMcFelix added a commit to FelixMcFelix/songbird that referenced this issue Jun 14, 2021
Debugging work put forth by JellyWx and jtscuba suggests that this WS thread receive is ending up in a permanent failure loop. This adapts the timeout-less receive such that receive failures are properly propagated.

This was tested using `cargo make ready` and using the `voice_storage` example (although I cannot repro the locking behaviour myself).

Comments in serenity-rs#69 suggest this is fixed -- closes serenity-rs#69.
FelixMcFelix added a commit that referenced this issue Jun 14, 2021
Debugging work put forth by JellyWx and jtscuba suggests that this WS thread receive is ending up in a permanent failure loop. This adapts the timeout-less receive such that receive failures are properly propagated.

This was tested using `cargo make ready` and using the `voice_storage` example (although I cannot repro the locking behaviour myself).

Comments in #69 suggest this is fixed -- closes #69.
FelixMcFelix added a commit to FelixMcFelix/songbird that referenced this issue Jun 14, 2021
Adds some additional logging around some critical sections, rarely hit (i.e., during shard reconnections) in pursuit of issue serenity-rs#69. It's strongly suspected to lie here, at any rate...
FelixMcFelix added a commit to FelixMcFelix/songbird that referenced this issue Jul 1, 2021
Adds some additional logging around some critical sections, rarely hit (i.e., during shard reconnections) in pursuit of issue serenity-rs#69. It's strongly suspected to lie here, at any rate...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants