Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix loophole in game server initialization #354

Closed
victor-prodan opened this issue Sep 12, 2018 · 13 comments · Fixed by #357
Closed

Fix loophole in game server initialization #354

victor-prodan opened this issue Sep 12, 2018 · 13 comments · Fixed by #357
Assignees
Labels
area/user-experience Pertaining to developers trying to use Agones, e.g. SDK, installation, etc kind/bug These are bugs.
Milestone

Comments

@victor-prodan
Copy link
Contributor

victor-prodan commented Sep 12, 2018

The normal server init flow is creating->starting->requestready->ready

In creating state we do the following:

  • createGameServerPod
  • applyGameServerAddressAndPort

If after creating a pod, we can't retrieve its address for some reason (func applyGameServerAddressAndPort fails), we stop and the game remains in creating state without status unpopulated.

The pod is created though, and the game server starts and sets itself as ready
Therefore, we jump directly from creating to requestready state, without going through starting
The status is never populated again, because the prerequisite is to be in creating state.

The result is gameservers that look like this:

Status:
  Address:
  Node Name:
  Ports:      <nil>
  State:      Ready

/cc @markmandel

@markmandel
Copy link
Member

Aaaaah! That's what is causing the e2e tests to fail! Nice catch, will look into it!!!

@markmandel markmandel added area/user-experience Pertaining to developers trying to use Agones, e.g. SDK, installation, etc kind/bug These are bugs. labels Sep 12, 2018
@markmandel markmandel added this to the 0.5.0 milestone Sep 13, 2018
@markmandel
Copy link
Member

This looks to be the cause of the issues.

{
    "error": "error getting external address for GameServer simple-fleet-xl2n9-gl8ws-nbtrl: error retrieving node for Pod simple-fleet-xl2n9-gl8ws-nbtrl-8j5b6: node \"\" not found",
    "level": "error",
    "msg": "",
    "obj": "default/simple-fleet-xl2n9-gl8ws-nbtrl",
    "queue": "stable.agones.dev.GameServerController",
    "source": "*gameservers.Controller",
    "time": "2018-09-12T16:32:00Z"
}
{
    "level": "error",
    "msg": "error getting external address for GameServer simple-fleet-xl2n9-gl8ws-nbtrl: error retrieving node for Pod simple-fleet-xl2n9-gl8ws-nbtrl-8j5b6: node \"\" not found",
    "stack": [
        "agones.dev/agones/pkg/gameservers.(*Controller).applyGameServerAddressAndPort\n\t/go/src/agones.dev/agones/pkg/gameservers/controller.go:506",
        "agones.dev/agones/pkg/gameservers.(*Controller).syncGameServerCreatingState\n\t/go/src/agones.dev/agones/pkg/gameservers/controller.go:394",
        "agones.dev/agones/pkg/gameservers.(*Controller).syncGameServer\n\t/go/src/agones.dev/agones/pkg/gameservers/controller.go:283",
        "agones.dev/agones/pkg/gameservers.(*Controller).(agones.dev/agones/pkg/gameservers.syncGameServer)-fm\n\t/go/src/agones.dev/agones/pkg/gameservers/controller.go:115",
        "agones.dev/agones/pkg/util/workerqueue.(*WorkerQueue).processNextWorkItem\n\t/go/src/agones.dev/agones/pkg/util/workerqueue/workerqueue.go:107",
        "agones.dev/agones/pkg/util/workerqueue.(*WorkerQueue).runWorker\n\t/go/src/agones.dev/agones/pkg/util/workerqueue/workerqueue.go:83",
        "agones.dev/agones/pkg/util/workerqueue.(*WorkerQueue).(agones.dev/agones/pkg/util/workerqueue.runWorker)-fm\n\t/go/src/agones.dev/agones/pkg/util/workerqueue/workerqueue.go:135",
        "agones.dev/agones/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1\n\t/go/src/agones.dev/agones/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133",
        "agones.dev/agones/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/go/src/agones.dev/agones/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:134",
        "agones.dev/agones/vendor/k8s.io/apimachinery/pkg/util/wait.Until\n\t/go/src/agones.dev/agones/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88",
        "agones.dev/agones/pkg/util/workerqueue.(*WorkerQueue).run\n\t/go/src/agones.dev/agones/pkg/util/workerqueue/workerqueue.go:135",
        "runtime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:2361"
    ],
    "time": "2018-09-12T16:32:00Z"
}

Basically, a Pod can be created, and the backing container may(s) fail (or just be slow) to be initially scheduled. If they aren't scheduled, then they don't have a node. For example, there could be resource contention, and there could be a delay in the container being able to be scheduled on the cluster between being declared and actually running.

This wasn't an issue before, because we were doing node detection on Ready() which would only be called once the container was fully operational.

Initial thoughts for a fix are to potentially create a new state for Address population (Status: AwaitingAddress ?), which was something I had considered when implementing #293. Then what I think will need to happen is:

  1. track pod updates in the GameServer controller, and if a gameserver pod, and if has gone from not having a node, to having a node - populate the details of the GameServer, if not already present. This will catch anything that doesn't already have an address for any reason.
  2. The Ready() SDK will need to block, with a 30 second timeout, until the ports and address are populated, before marking the GameServer itself as Ready.

I think that will work, but will want to try it out and see how it goes.

@markmandel markmandel self-assigned this Sep 13, 2018
@victor-prodan
Copy link
Contributor Author

victor-prodan commented Sep 13, 2018

@markmandel I had simpler solution in mind. Just populate the adress info when setting the GS as ready, If not set already. Why add an extra delay? If we set as ready, we definitely have a node.

But yes, what you proposed makes sense.

@markmandel
Copy link
Member

That's what we used to do - but we changed it so that the game server process could reliably get its ip and port if it needs to before declaring itself as Ready. If we revert this, that goes away.

Unless I misunderstood?

@victor-prodan
Copy link
Contributor Author

Yes, it looks like you misunderstood 😕.
I did not say to revert, but to merge.

Anyway, your solution of blocking ready does not allow the server to reliably get its own ip before declaring itself as ready. You need an extra blocking call before that. Maybe the equivalent of GameLift's Init?

@markmandel
Copy link
Member

Ah, okay. Still not quite sure I follow 😣

But re: blocking, that's mainly to ensure that a game server can't mark itself as Ready before it has an ip and port -- which stops a gameserver with (potentially) no IP and port being allocated.

The game server can reliably get its ip and port, it needs to listen to the configuration update notifications via the sdk until the status values are populated, before marking itself as Ready (if that is something it needs to do). If it doesn't, it won't break anything else, because of the blocking behaviour (which should be relatively short/non existing).

That make sense?

@victor-prodan
Copy link
Contributor Author

Ok, this means that my idea would work.
And because a PR is better than 1000 words, I'll make one soon...

@markmandel
Copy link
Member

Okay cool - I'm working on my strategy 😄 at least one of them should definitely work! 👍

@markmandel
Copy link
Member

@victor-prodan
Copy link
Contributor Author

victor-prodan commented Sep 13, 2018

Ah, I forgot that I can show code that way.
No PR then, just have a look here: https://github.com/victor-prodan/agones/tree/serverloopholefix.
Mine is more of a "the smallest set of changes to fix the biggest isssue"

@markmandel
Copy link
Member

markmandel commented Sep 13, 2018

Ooooh! I follow now! 🤦‍♂️ not "a instead of", but an "as well as". Got it. Code definitely made it clearer - thanks for that. (I'm blaming jet lag! 🤣)

With this solution though - if the initial pass to set the address fails -- then how will it end up getting to the SDK's config listener? It will just get skipped, and never get populated?

Now I'm thinking a solution that's in between will be best:

  1. Keep the extra Game State for Status > address+post updating (with the listener for pod events, so that if the Pod address get's populated later, it gets picked up)
  2. Rather than block the Ready() - use your approach, which is to just populate the address details on RequestReady just like you do now -- no reason to block at all -- then if the container comes up faster than the Pod update event, then the Status > Address etc still get's populated, and everything is fine -- which is definitely much simpler.

Or is that what you were already thinking? 😄

@markmandel
Copy link
Member

Updated with the first pass on the above approach:
https://github.com/markmandel/agones/tree/bug/no-node-pod

(Still need to fix tests and make sure everything works -- but you can get the gist)

@victor-prodan
Copy link
Contributor Author

Yes, I took the liberty of leaving a couple of comments 😄

markmandel added a commit to markmandel/agones that referenced this issue Sep 18, 2018
This fix does several things:

1. Breaks out ip and port setting of the `Status` into its own state, to make
    implementation and testing easier.
1. Listen for Pod updates, and if a Pod is a `GameServer` pod that has had
    its `NodeName` set, then queue the backing `GameServer`
1. Finally - if Ready() gets called before any of this, populate the `Status` ip
   and port values at this time as well.

Fixes googleforgames#354
markmandel added a commit to markmandel/agones that referenced this issue Sep 19, 2018
This fix does several things:

1. Breaks out ip and port setting of the `Status` into its own state, to make
    implementation and testing easier.
1. Listen for Pod updates, and if a Pod is a `GameServer` pod that has had
    its `NodeName` set, then queue the backing `GameServer`
1. Finally - if Ready() gets called before any of this, populate the `Status` ip
   and port values at this time as well.

Fixes googleforgames#354
cyriltovena pushed a commit that referenced this issue Sep 20, 2018
This fix does several things:

1. Breaks out ip and port setting of the `Status` into its own state, to make
    implementation and testing easier.
1. Listen for Pod updates, and if a Pod is a `GameServer` pod that has had
    its `NodeName` set, then queue the backing `GameServer`
1. Finally - if Ready() gets called before any of this, populate the `Status` ip
   and port values at this time as well.

Fixes #354
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/user-experience Pertaining to developers trying to use Agones, e.g. SDK, installation, etc kind/bug These are bugs.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants