Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle dropped gamestate after UDP download #286

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

Chomenor
Copy link
Contributor

@Chomenor Chomenor commented Jun 1, 2024

Currently, when a UDP download completes, there is a possibility that the gamestate message from the server is dropped due to network packet loss. Normally dropped gamestates are detected and resent by checking if client messages contain an old serverid value, but after downloads the client can actually have the correct serverid (if the map didn't change). As a result, the gamestate is never resent and the client can hang with "awaiting gamestate" message.

This fix adds an extra check for dropped gamestate for clients that just finished a UDP download, based on detecting if the client is sending messages with no move commands. Since this check is not quite as standard as the serverid method, I added some extra checks to minimize any chance of false positives in non-standard clients. This fix has no effect on non-UDP downloading clients.

@Chomenor Chomenor force-pushed the download_dropped_gamestate_fix branch from a875f25 to b318df2 Compare June 1, 2024 05:25
@ec-
Copy link
Owner

ec- commented Jun 7, 2024

I'm reworked retransmission and bring back serverId update, please check f8e7b42

@Chomenor
Copy link
Contributor Author

Chomenor commented Jun 7, 2024

One issue I can see is that this forces a gamestate retransmit if a map restart occurs while the client is loading the map. For example, on a server with a short warmup, this may cause slow loading clients to be sent for a second map load after the warmup completes.

May I ask what is the compatibility reason for bringing back the serverid change during map restart?

@ec-
Copy link
Owner

ec- commented Jun 8, 2024

May I ask what is the compatibility reason for bringing back the serverid change during map restart?

It is needed to discard outdated commands for CS_ACTIVE clients

One issue I can see is that this forces a gamestate retransmit if a map restart occurs while the client is loading the map

This is expected and intentional: client must acknowledge gamestate, new map / map restart == new gamestate

For example, on a server with a short warmup, this may cause slow loading clients to be sent for a second map load after the warmup completes

Is this a real issue? Because there is no map restart from server-side at warmup end, it just sends map_restart command to all clients

@Chomenor
Copy link
Contributor Author

Chomenor commented Jun 8, 2024

It is needed to discard outdated commands for CS_ACTIVE clients

I thought we had already addressed this with client->lastUsercmd.serverTime = sv.time - 1 to prevent pre-restart move commands being executed. Can you be more specific about what was wrong with that approach?

Is this a real issue? Because there is no map restart from server-side at warmup end, it just sends map_restart command to all clients

That behavior appears to be specific to baseq3a. I assume it is an issue to some degree for most other mods.

@ec-
Copy link
Owner

ec- commented Jun 8, 2024

I thought we had already addressed this with client->lastUsercmd.serverTime = sv.time - 1 to prevent pre-restart move commands being executed. Can you be more specific about what was wrong with that approach?

Move commands only, there's also client commands clc_clientCommand so for example you might execute outdated kill command when you shouldn't, or execute SV_ClientEnterWorld() relying on old serverId/wrong gamestate

That behavior appears to be specific to baseq3a. I assume it is an issue to some degree for most other mods.

That's correct, although that extra re-load is very unlikely and not a big deal comparing to maintaining overall coherency

@Chomenor
Copy link
Contributor Author

Chomenor commented Jun 8, 2024

you might execute outdated kill command when you shouldn't

It's a reliable command, so if you happened to (coincidentally) call it right before a map restart, won't the client just keep resending and it will still be executed post-restart anyway?

execute SV_ClientEnterWorld() relying on old serverId/wrong gamestate

Commands from wrong gamestate were already dropped, I don't understand how it relates to map restart.

@Chomenor Chomenor force-pushed the download_dropped_gamestate_fix branch from b318df2 to 42725fa Compare June 18, 2024 05:23
@Chomenor
Copy link
Contributor Author

The current q3e version does seem to have fixed the issue with dropped gamestates after downloads, but it does it in a way that may not be ideal for client compatibility. The server currently expects the client to acknowledge the exact message number of the gamestate, or a gamestate retransmit is forced. Normally the client does sent the correct acknowledge, due to these lines;

Quake3e/code/client/cl_main.c

Lines 2056 to 2058 in 9fd02e4

CL_WritePacket();
CL_WritePacket();
CL_WritePacket();

However, if you comment out these lines, the client will go into an infinite loop receiving gamestates when trying to join a server running the current (post f8e7b42) version of q3e. If you add NET_Sleep(0); ahead of these lines it also causes a loop.

I'm concerned this is adding an unnecessary dependency on the client, where potentially subtle changes can break compatibility with newer q3e versions (but still work on every other server). Sometimes players have old versions of unusual clients and it's hard to account for the behavior of every possible client.

I've updated this PR to restore a more traditional serverid-based gamestate check for non-downloading clients, with a separate message acknowledge-based check for downloading clients (with guardrails to avoid gamestate loops).

@ec-
Copy link
Owner

ec- commented Jun 19, 2024

  • At User_Move() you might have serverId = sv.serverId but incorrect messageAcknowledge which will place client into game so exact gamestate acknowledge is important
  • Issue with multiple CL_WritePacket() has been fixed there 8567b04
  • Issue with drifting chan->outgoingSequence which complicates gamestate acknowledge has beed fixed there 909ac2b

@Chomenor
Copy link
Contributor Author

I'm just looking at the first point to make sure we're on the same page, ignoring the no-snapshot or client changes for now. Are you saying the current version of this PR has a problem with incorrectly placing clients into the game?

As far as I know correct serverID + clc_move should always mean client is ready to be placed in the game. It's not apparent to me there is any need or even value in additional messageAcknowledge based checks before spawning the client. Other engines (vanilla, ioq3, etc.) have no such check either and I'm not aware of any issues linked to it.

Could you provide steps for how this "serverId = sv.serverId but incorrect messageAcknowledge" scenario would unfold? It's possible I missed something but at the moment I don't understand how this would happen.

@ec-
Copy link
Owner

ec- commented Jun 24, 2024

It is not correct to allow client to join by simple serverId == sv.serverId check as it may be a reply to any previous gamestate client received - so exact serverId == sv.serverId && cl->messageAcknowledge == cl->gamestateMessageNum check is mandatory

@Chomenor
Copy link
Contributor Author

Chomenor commented Jun 24, 2024

My understanding of the connection process is like this:

  1. Client connects or map changes
  2. Client is set to CS_PRIMED. Gamestate is sent, and resent 1 at a time if previous one is confirmed dropped.
  3. Correct serverId + clc_move confirms gamestate for current map has been loaded. Client is set to CS_ACTIVE. Pending configstring updates are sent.

I still don't understand the problem. The serverId does not allow just any gamestate, it has to be one from the current map. The way the connection process works, it should generally always be the most recent one sent, but even if an older version did somehow get loaded it would likely still be fine because of the way SV_UpdateConfigstrings takes care of modified configstrings.

@ec-
Copy link
Owner

ec- commented Jun 24, 2024

  • client could receive incorrect configstrings because csUpdated[] now is reset after sending (to avoid potential overflows)
  • dedicated gamestateAcked variable is needed to skip sending snapshots to clients

@Chomenor
Copy link
Contributor Author

Chomenor commented Jun 24, 2024

client could receive incorrect configstrings because csUpdated[] now is reset after sending (to avoid potential overflows)

Seems questionable this is very useful. For an overflow to happen because of extra csUpdated values accumulated prior to a gamestate resend, but not due to the (likely much longer) map loading time, seems very unlikely...

dedicated gamestateAcked variable is needed to skip sending snapshots to clients

Why is it needed to skip sending snapshots to clients? Why is dedicated variable needed instead of just skipping for non CS_ACTIVE?

@ec-
Copy link
Owner

ec- commented Jun 24, 2024

Seems questionable this is very useful. For an overflow to happen because of extra csUpdated values accumulated prior to a gamestate resend, but not due to the (likely much longer) map loading time, seems very unlikely...

  • Gamestate1:
    CS 1 "0"
    ...
    CS 1 "1" - csUpdated[1] = qtrue
    ...
    CS X "1" - csUpdated[X] = qtrue -> potential overflow till client confirms Gamestate#1 so all csUpdated[X] must be reset after sending Gamestate1

  • Gamestate2:
    CS 1 "1"

  • Client received Gamestate1, achnowledge to server is lost, server sends Gamestate2, if server accepts Gamestate1 acknowledge then client will have incorrect CS 1 "0"

Why is it needed to skip sending snapshots to clients?

To avoid drifting netchan.outgoingSequence

Why is dedicated variable needed instead of just skipping for non CS_ACTIVE?

It might be not needed but potential side-effects needs to be analyzed

@Chomenor
Copy link
Contributor Author

Chomenor commented Jun 24, 2024

potential overflow till client confirms Gamestate#1 so all csUpdated[X] must be reset after sending Gamestate1

Could you explain what you mean by the potential overflow? I know there is always a remote chance of an overflow when using csUpdated, since it causes commands to be sent when the client goes to CS_ACTIVE which technically could cause a command count or snapshot size overflow. Clearing csUpdated values between gamestate retransmits could help reduce the amount of commands in heavy packet loss (repeated gamestate retransmit) situations, but I suspect the chance of an overflow is so rare to begin with that it doesn't make more than a marginal difference.

Client received Gamestate1, achnowledge to server is lost, server sends Gamestate2, if server accepts Gamestate1 acknowledge

With the original gamestate check it shouldn't really be possible for the acknowledge to be "lost". The way the check works is:

  • Server sends gamestate. cl->gamestateMessageNum is set to the outgoing message number of the gamestate.
  • Server sends a stream of snapshots containing higher message numbers to the client. Client sends back a stream of messages containing the serverID and acknowledge number of most recent server message.
  • If the client sends a message with the acknowledge number of one of the snapshots rather than the gamestate (higher than cl->gamestateMessageNum), and the wrong serverID, then the server knows the gamestate is dropped. There is no way for the client to receive the old gamestate at this point because the network code ignores out of order messages and the old gamestate has a lower message number than the acknowledged snapshot.

To avoid drifting netchan.outgoingSequence

That is part of how the gamestate drop detection works. Without some form of incremented message number you can't reliably tell the difference between a dropped gamestate and one that is just delayed. It may be possible to do a different implementation that periodically retransmits the gamestate (with the same message number) instead of relying on drop detection, but I wouldn't assume it's worth the trouble of using a new untested approach if the original way works well enough.

@ec-
Copy link
Owner

ec- commented Jun 24, 2024

Could you explain what you mean by the potential overflow?

every csUpdated is +1 to cl->reliableSequence -> more chance to overflow

With the original gamestate check it shouldn't really be possible for the acknowledge to be "lost"

Packet drop could happen on client -> server side and vice versa so server could send first gamestate, client receives that, ack to server is lost, server sends second gamestate, second gamestate is lost but client sends command again outgoingSequence + 1 -- so in total client received gamestate1 but server thinks its acknowledged gamestate2 - which is totally wrong so exact gamestate ack is mandatory

That is part of how the gamestate drop detection works

There is no need to send snapshots till client acknowledges gamestate, it just disrupts gamestate acknowledge, it also seems to help with delta compression disruption as I no longer see Delta request from out of date packet messages during map spawn/map restart

@Chomenor
Copy link
Contributor Author

every csUpdated is +1 to cl->reliableSequence -> more chance to overflow

Yes but there is still the question of rarity. Clearing csUpdated between gamestate retransmit may make negligible practical difference in overflow chances.

ack to server is lost, server sends second gamestate

In general, server sending second gamestate should imply the first gamestate is actually lost, not just the "ack" is lost. I thought I explained that in my message?

There is no need to send snapshots till client acknowledges gamestate

Dropped gamestate handling appears to be broken since 909ac2b. Any dropped gamestate results in a stuck client, as retransmit condition cl->messageAcknowledge - cl->gamestateMessageNum > 0 is unreachable. As noted an alternative implementation could work without sending snapshots but that hasn't been done.

I no longer see Delta request from out of date packet messages

I suspect the issue may be somewhere else. It could be related to changes in q3e to checks for creating delta frames here.

@ensiform
Copy link
Contributor

Wasn't csupdated created because of changes made by ioq3 in the way that commands happen at certain time, I don't think the vanilla had that.

@Chomenor
Copy link
Contributor Author

Here is the commit and bug tracker link. It's an optimization so that if the same configstring changes multiple times while a client is loading the map, the server only sends the final version and can skip the intermediate versions.

@Chomenor Chomenor force-pushed the download_dropped_gamestate_fix branch from 42725fa to 8135f46 Compare August 18, 2024 04:24
@Chomenor
Copy link
Contributor Author

I reviewed the latest Q3e changes as of f0c43d3. It appears the issue of hypothetical clients that don't acknowledge the exact gamestate number has been improved; such clients will now only go into a gamestate loop if there is a dropped gamestate or UDP download, not every time connecting the server.

While much better than before, I believe this can be improved further because in the normal gamestate retransmit case (wrong serverid + post-gamestate message acknowledge) the gamestate is verifiably dropped and it's not possible for the client to load an old gamestate at any point in the future regardless of any packet loss or reordering. Therefore the exact message number ack requirement is unnecessary even if the gamestate has been retransmitted and csUpdated cleared.

I've updated this PR so that dropped gamestate is always detected by the original method (wrong serverid and messageAcknowledge > gamestateMessageNum). For UDP downloads there is an extra check; gamestate must be acknowledged by either a move command or exact message number ack. I think this implementation is simpler and more robust and it fixes the gamestate loop cases noted above. csUpdated is still cleared in cases where it is safe to do so.

@ec-
Copy link
Owner

ec- commented Aug 26, 2024

@Chomenor
Copy link
Contributor Author

Chomenor commented Aug 26, 2024

Could you explain how that code is not correct? It looks correct to me and seems consistent with ioq3, but I don't have much experience with oldServerTime behavior so there might be something I'm missing.

What really seems incorrect to me is this:

// if a map_restart occurs while a client is changing maps, we need
// to give them the correct time so that when they finish loading
// they don't violate the backwards time check in cl_cgame.c
for ( i = 0; i < sv.maxclients; i++ ) {
if ( svs.clients[i].state == CS_PRIMED ) {
svs.clients[i].oldServerTime = sv.restartTime;
}
}

@ec-
Copy link
Owner

ec- commented Aug 27, 2024

it says

// this client has acknowledged the new gamestate

which is NOT correct

Removed unnecessary checks for client acknowledging the exact gamestate message number.
@Chomenor Chomenor force-pushed the download_dropped_gamestate_fix branch from 8135f46 to 3708e7c Compare August 27, 2024 04:57
@Chomenor
Copy link
Contributor Author

Okay, so nothing wrong functionally, just the comment? I moved it under the if statement so it should be more correct now.

@Chomenor
Copy link
Contributor Author

Some of this might not be a big deal functionally, but it still can matter for code quality. If there are misleading assumptions it can cause trouble for somebody trying to understand the code or build on top of it even if it mostly works currently.

typedef enum {
GSA_INIT = 0, // gamestate never sent with current sv.serverId
GSA_SENT_ONCE, // gamestate sent once, client can reply with any (messageAcknowledge - gamestateMessageNum) >= 0 and correct serverId
GSA_SENT_MANY, // gamestate sent many times, client must reply with exact gamestateMessageNum == gamestateMessageNum and correct serverId
GSA_ACKED // gamestate acknowledged, no retansmissions needed
} gameStateAck_t;

I'm not a fan of this for several reasons, from least to most significant:

  1. GSA_INIT and GSA_ACKED seem redundant to CS_CONNECTED and CS_ACTIVE respectively, for practical purposes. Duplicated state isn't ideal for complexity, obviously.

  2. The premise that after a non-download gamestate retransmit (GSA_SENT_MANY) you need messageAcknowledge == gamestateMessageNum is wrong as far as I can tell. Despite a lot of attempts I haven't been able to find any good reason for an implementation to require this, and it increases the risk of compatibility issues leading to gamestate loops.

  3. The post-download behavior (which is currently shared with GSA_SENT_MANY) also seems to be problematic. In my testing vanilla Q3 clients don't always acknowledge messageAcknowledge == gamestateMessageNum exactly leading to occasional unnecessary map reloads, sometimes multiple times in a row. It's probably some kind of race condition, but without thorough investigation it's hard to say for sure how stable it is or if there are any settings or conditions where the loop could be more persistent. That's why my version uses a move command as download acknowledge instead, which is always reliable for typical clients, with exact message ack allowed as a fallback just to be extra safe.

Conversely, the reasons I like my approach with post-download specific check:

  1. There is less state complexity (2 states instead of 4).

  2. For non-downloading clients there is no need to ever require exact gamestate message ack, which is both simpler and potentially safer.

  3. For downloading clients move command acknowledge is more reliable then exact gamestate message ack, and fixes unnecessary reload issues.

@ec-
Copy link
Owner

ec- commented Aug 27, 2024

There is less state complexity (2 states instead of 4).

I wouldn't agree because my approach is more universal and verbose

For non-downloading clients there is no need to ever require exact gamestate message ack, which is both simpler and potentially safer

Practically there is no difference between downloading and non-downloading clients: the only thing that matter is that if we sent gamestate once within new serverId (so we can accept any messageAcknowledge) or many times (we must accept exact messageAcknowledge number), I wouldn't trade that consistency for any "optimization" or modded client(s)

For downloading clients move command acknowledge is more reliable then exact gamestate message ack, and fixes unnecessary reload issues.

Again, from server perspective there is no difference between downloading and non-downloading clients: if we sent gamestate once with pacticular serverId then we can accept any messageAcknowledge, otherwise exact gamestateMessageNum must be received - this is a strict logic with no flaws, and handles every possibe packet loss on both sides

Original code also implies serverId & gamestateMessageNum match for gamestate acknowledge, I see nothing there to "optimize" except GSA_SENT_ONCE which is already implemented

@Chomenor
Copy link
Contributor Author

sent gamestate once within new serverId (so we can accept any messageAcknowledge) or many times (we must accept exact messageAcknowledge number), I wouldn't trade that consistency

I think this is where we have a fundamental difference in understanding. I do not see any consistency benefit at all from requiring exact gamestate messageAcknowledge number at any time. It just doesn't make sense to me. Is this based on a problem seen in practice or strictly theoretical?

Original code also implies serverId & gamestateMessageNum match

Could you point to where in the original code gamestateMessageNum match is expected?

@Chomenor
Copy link
Contributor Author

This was your last explanation on the matter. Is this still your current understanding? I don't think it's correct as I've tried to explain before.

Packet drop could happen on client -> server side and vice versa so server could send first gamestate, client receives that, ack to server is lost, server sends second gamestate, second gamestate is lost but client sends command again outgoingSequence + 1 -- so in total client received gamestate1 but server thinks its acknowledged gamestate2 - which is totally wrong so exact gamestate ack is mandatory

The problem starts at "ack to server is lost, server sends second gamestate". What does it mean "ack to server is lost"?

Let's say the server sent the gamestate, with messageNum = 50. It continues to send SNAPFLAG_NOT_ACTIVE snapshots with messageNum = 51, 52, 53, ...

As stated the client has received the gamestate successfully. That means CL_ParseGamestate has been called, cl.serverId has been set to the server's current serverId, and clc.serverMessageSequence has been set to 50. The message sequence may continue to increment (51, 52, ...) as snapshots arrive.

It's also stated that client to server packet loss is in effect. Note the gamestate retransmit condition: serverId != sv.serverId && cl->messageAcknowledge > cl->gamestateMessageNum. It doesn't matter how many packets are lost, the first message that makes it through will have the correct serverId and the gamestate will be considered acknowledged by the server. There is no retransmit, which means no second gamestate to be lost.

@ec-
Copy link
Owner

ec- commented Aug 28, 2024

The problem starts at "ack to server is lost, server sends second gamestate". What does it mean "ack to server is lost"?

Correction: download/whatever completes, server sends new gamestate, client packet with old messageAcknowledge (for whatever reason) arrives, server should not accept it because client may not receive gamestate at all - this is handled by GSA_SENT_MANY (second and further submissions)

As stated the client has received the gamestate successfully. That means CL_ParseGamestate has been called, cl.serverId has been set to the server's current serverId, and clc.serverMessageSequence has been set to 50. The message sequence may continue to increment (51, 52, ...) as snapshots arrive.

This case already handled by GSA_SENT_ONCE (initial submission)

Note the gamestate retransmit condition: serverId != sv.serverId && cl->messageAcknowledge > cl->gamestateMessageNum. It doesn't matter how many packets are lost, the first message that makes it through will have the correct serverId and the gamestate will be considered acknowledged by the server

It was like that when sv.serverId was unique across map restarts, now that condition just skips whole scope that was active before.

With recent state optimizations I see nothing to discuss there, original issue is already resolved

@Chomenor
Copy link
Contributor Author

Correction: download/whatever completes

My point is, if you exclude the UDP download case, there never is or was a problem with the client receiving an older gamestate than the last one the server sent. As I understand it, it is impossible as long as serverId != sv.serverId && cl->messageAcknowledge > cl->gamestateMessageNum is required to retransmit gamestate. That implies all the logic and optimizations around GSA_SENT_ONCE/GSA_SENT_MANY, tracking csUpdated changes, etc. are just unnecessary clutter.

UDP downloads are a special case due to some unique client behavior. After a download completes the client hits this block here:

Quake3e/code/client/cl_main.c

Lines 2010 to 2021 in 959876a

if ( clc.downloadRestart ) {
clc.downloadRestart = qfalse;
FS_Restart(clc.checksumFeed); // We possibly downloaded a pak, restart the file system to load it
// inform the server so we get new gamestate info
CL_AddReliableCommand( "donedl", qfalse );
// by sending the donedl command we request a new gamestate
// so we don't want to load stuff yet
return;
}

The client is now at CA_CONNECTING waiting the gamestate triggered by the "donedl" command. If the gamestate is dropped, the standard retransmit method would normally work, but there is a slight problem unique to UDP downloads: if the map didn't change the client will already have the correct serverId from before the download started and will not pass the serverId != sv.serverId condition to trigger the retransmit.

There's a few ways you can address this:

  1. You could send a different serverId to downloading/post-download clients. There's probably a few variations on how you might try to implement this. Then dropped gamestate can be handled by the normal check.
  2. You could add an extra retransmit condition (for UDP downloads only) based on a sequence of client messages with no move commands, taking advantage of the fact that the client only issues move commands after it has a gamestate. I used this method for this PR on the basis that it was easier to implement than the first method and perfectly adequate for this edge case. I even skipped the csUpdated clearing during this retransmit just to be ultra safe.
  3. You could add an extra retransmit condition based on exact gamestate message ack. This is the method q3e currently uses (or did; it seems broken in latest commits), but is potentially a bit flaky on compatibility and has some issues with spurious reloads on vanilla Q3 1.32 clients.

So basically UDP downloads are a unique special case, the only one where a client can send the correct serverId but not actually have the gamestate, and that's why this PR reflects that with a specific variable and check just for this condition. Otherwise the normal retransmit check with serverId != sv.serverId && cl->messageAcknowledge > cl->gamestateMessageNum is sufficient and you don't need any of the exact message ack stuff.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants