Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Timeouts on calls to docker api #12310

Merged
merged 4 commits into from
Jun 4, 2019
Merged

Conversation

marqc
Copy link
Contributor

@marqc marqc commented May 28, 2019

Adds timeouts on calls to docker api:

  • 10 seconds on each list and inspect containers call (on start and discovery container event)
  • 60 minute for watch call
  • active ticker restarting watch call when no events are received within 10 minutes

Docker has long history of problems with unresponsive API. If unlucky docker watcher can stuck silently forever on calls to docker API. Adding some timeouts should prevent possibility of this happening.

@marqc marqc requested a review from a team as a code owner May 28, 2019 10:08
@elasticmachine
Copy link
Collaborator

Since this is a community submitted pull request, a Jenkins build has not been kicked off automatically. Can an Elastic organization member please verify the contents of this patch and then kick off a build manually?

@marqc marqc force-pushed the docker_timeouts branch from dfe3af9 to 68a3133 Compare May 29, 2019 08:48
@@ -316,7 +338,10 @@ func (w *watcher) listContainers(options types.ContainerListOptions) ([]*Contain
// If there are no network interfaces, assume that the container is on host network
// Inspect the container directly and use the hostname as the IP address in order
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This assumption is actually incorrect. Docker can have some of known network modes: default (bridge), host and container (usually used in cloud platforms like kubernetes).

It can be determined by checking .HostConfig.NetworkMode property:

$ sudo curl -s --fail -m3 -XGET --unix-socket /var/run/docker.sock http://localhost/containers/json | jq -r .[].HostConfig.NetworkMode
container:b79efb8d2710b3fb2ae6b9fff5e1cf6f8e841301df9a355371c197156d72c3d4
container:a7fb40e3f13684e91f26c1ed26096e0d39df5d7860db31bab5f924cec5838366
default
default
default
host
container:9f187b5c30cb2ceb43907a2513e093f1039be42c230c33d70a601e9c3173dc5d
default
default

if network mode is container:<ID> this <ID> should be inspected to fetch from standard config .NetworkSettings.Networks like above.

But this is subject for separate PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sounds about right! are you willing to open a new PR? 😇

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will try to take a shot on this next week

case <-tickChan.C:
if time.Since(time.Unix(w.lastValidTimestamp, 0)) > 10*time.Minute {
logp.Info("No events received withing 10 minutes, restarting watch call")
w.lastValidTimestamp = time.Now().Unix()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why updating this here? If watch call has gone mad maybe we missed some previous events?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed code to use separate var for tracking watch activity

return
}
}

tickChan.Stop()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this won't be called if return is hit before (on watcher stop), you could probably use defer stop after starting it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed

}
}

func (w *watcher) listContainers(options types.ContainerListOptions) ([]*Container, error) {
containers, err := w.client.ContainerList(w.ctx, options)
logp.Debug("docker", "List containers")
ctx, cancel := context.WithTimeout(w.ctx, 10*time.Second)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you please make all new timeout constants? this should make it easier to fine tune

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed

Copy link
Contributor

@exekias exekias left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for taking this! really appreciated 🎉 🎉

@exekias exekias added Team:Integrations Label for the Integrations team containers Related to containers use case enhancement review labels May 31, 2019
@exekias exekias self-assigned this May 31, 2019
@exekias
Copy link
Contributor

exekias commented Jun 4, 2019

Thank you for contributing!! 🎉

jenkins, test this please

@exekias
Copy link
Contributor

exekias commented Jun 4, 2019

waiting for green

@exekias exekias merged commit 1ade617 into elastic:master Jun 4, 2019
andrewvc pushed a commit to andrewvc/beats that referenced this pull request Jun 12, 2019
* Add timeouts on calls to docker api
@marqc
Copy link
Contributor Author

marqc commented Jul 2, 2019

@exekias how about including this in next 7.x release?

@exekias
Copy link
Contributor

exekias commented Jul 3, 2019

definitely! this made the cut for 7.3, so it will be part of it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
containers Related to containers use case enhancement review Team:Integrations Label for the Integrations team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants