-
Notifications
You must be signed in to change notification settings - Fork 811
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix loophole in game server initialization #354
Comments
Aaaaah! That's what is causing the e2e tests to fail! Nice catch, will look into it!!! |
This looks to be the cause of the issues. {
"error": "error getting external address for GameServer simple-fleet-xl2n9-gl8ws-nbtrl: error retrieving node for Pod simple-fleet-xl2n9-gl8ws-nbtrl-8j5b6: node \"\" not found",
"level": "error",
"msg": "",
"obj": "default/simple-fleet-xl2n9-gl8ws-nbtrl",
"queue": "stable.agones.dev.GameServerController",
"source": "*gameservers.Controller",
"time": "2018-09-12T16:32:00Z"
}
{
"level": "error",
"msg": "error getting external address for GameServer simple-fleet-xl2n9-gl8ws-nbtrl: error retrieving node for Pod simple-fleet-xl2n9-gl8ws-nbtrl-8j5b6: node \"\" not found",
"stack": [
"agones.dev/agones/pkg/gameservers.(*Controller).applyGameServerAddressAndPort\n\t/go/src/agones.dev/agones/pkg/gameservers/controller.go:506",
"agones.dev/agones/pkg/gameservers.(*Controller).syncGameServerCreatingState\n\t/go/src/agones.dev/agones/pkg/gameservers/controller.go:394",
"agones.dev/agones/pkg/gameservers.(*Controller).syncGameServer\n\t/go/src/agones.dev/agones/pkg/gameservers/controller.go:283",
"agones.dev/agones/pkg/gameservers.(*Controller).(agones.dev/agones/pkg/gameservers.syncGameServer)-fm\n\t/go/src/agones.dev/agones/pkg/gameservers/controller.go:115",
"agones.dev/agones/pkg/util/workerqueue.(*WorkerQueue).processNextWorkItem\n\t/go/src/agones.dev/agones/pkg/util/workerqueue/workerqueue.go:107",
"agones.dev/agones/pkg/util/workerqueue.(*WorkerQueue).runWorker\n\t/go/src/agones.dev/agones/pkg/util/workerqueue/workerqueue.go:83",
"agones.dev/agones/pkg/util/workerqueue.(*WorkerQueue).(agones.dev/agones/pkg/util/workerqueue.runWorker)-fm\n\t/go/src/agones.dev/agones/pkg/util/workerqueue/workerqueue.go:135",
"agones.dev/agones/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1\n\t/go/src/agones.dev/agones/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133",
"agones.dev/agones/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/go/src/agones.dev/agones/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:134",
"agones.dev/agones/vendor/k8s.io/apimachinery/pkg/util/wait.Until\n\t/go/src/agones.dev/agones/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88",
"agones.dev/agones/pkg/util/workerqueue.(*WorkerQueue).run\n\t/go/src/agones.dev/agones/pkg/util/workerqueue/workerqueue.go:135",
"runtime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:2361"
],
"time": "2018-09-12T16:32:00Z"
} Basically, a Pod can be created, and the backing container may(s) fail (or just be slow) to be initially scheduled. If they aren't scheduled, then they don't have a node. For example, there could be resource contention, and there could be a delay in the container being able to be scheduled on the cluster between being declared and actually running. This wasn't an issue before, because we were doing node detection on Initial thoughts for a fix are to potentially create a new state for Address population (Status:
I think that will work, but will want to try it out and see how it goes. |
@markmandel I had simpler solution in mind. Just populate the adress info when setting the GS as ready, If not set already. Why add an extra delay? If we set as ready, we definitely have a node. But yes, what you proposed makes sense. |
That's what we used to do - but we changed it so that the game server process could reliably get its ip and port if it needs to before declaring itself as Ready. If we revert this, that goes away. Unless I misunderstood? |
Yes, it looks like you misunderstood 😕. Anyway, your solution of blocking ready does not allow the server to |
Ah, okay. Still not quite sure I follow 😣 But re: blocking, that's mainly to ensure that a game server can't mark itself as Ready before it has an ip and port -- which stops a gameserver with (potentially) no IP and port being allocated. The game server can reliably get its ip and port, it needs to listen to the configuration update notifications via the sdk until the status values are populated, before marking itself as Ready (if that is something it needs to do). If it doesn't, it won't break anything else, because of the blocking behaviour (which should be relatively short/non existing). That make sense? |
Ok, this means that my idea would work. |
Okay cool - I'm working on my strategy 😄 at least one of them should definitely work! 👍 |
Work started here: https://github.com/markmandel/agones/tree/bug/no-node-pod |
Ah, I forgot that I can show code that way. |
Ooooh! I follow now! 🤦♂️ not "a instead of", but an "as well as". Got it. Code definitely made it clearer - thanks for that. (I'm blaming jet lag! 🤣) With this solution though - if the initial pass to set the address fails -- then how will it end up getting to the SDK's config listener? It will just get skipped, and never get populated? Now I'm thinking a solution that's in between will be best:
Or is that what you were already thinking? 😄 |
Updated with the first pass on the above approach: (Still need to fix tests and make sure everything works -- but you can get the gist) |
Yes, I took the liberty of leaving a couple of comments 😄 |
This fix does several things: 1. Breaks out ip and port setting of the `Status` into its own state, to make implementation and testing easier. 1. Listen for Pod updates, and if a Pod is a `GameServer` pod that has had its `NodeName` set, then queue the backing `GameServer` 1. Finally - if Ready() gets called before any of this, populate the `Status` ip and port values at this time as well. Fixes googleforgames#354
This fix does several things: 1. Breaks out ip and port setting of the `Status` into its own state, to make implementation and testing easier. 1. Listen for Pod updates, and if a Pod is a `GameServer` pod that has had its `NodeName` set, then queue the backing `GameServer` 1. Finally - if Ready() gets called before any of this, populate the `Status` ip and port values at this time as well. Fixes googleforgames#354
This fix does several things: 1. Breaks out ip and port setting of the `Status` into its own state, to make implementation and testing easier. 1. Listen for Pod updates, and if a Pod is a `GameServer` pod that has had its `NodeName` set, then queue the backing `GameServer` 1. Finally - if Ready() gets called before any of this, populate the `Status` ip and port values at this time as well. Fixes #354
The normal server init flow is creating->starting->requestready->ready
In creating state we do the following:
If after creating a pod, we can't retrieve its address for some reason (func applyGameServerAddressAndPort fails), we stop and the game remains in creating state without status unpopulated.
The pod is created though, and the game server starts and sets itself as ready
Therefore, we jump directly from creating to requestready state, without going through starting
The status is never populated again, because the prerequisite is to be in creating state.
The result is gameservers that look like this:
/cc @markmandel
The text was updated successfully, but these errors were encountered: