Skip to content
This repository has been archived by the owner on May 23, 2024. It is now read-only.

UDP connection returns "write: connection refused" #393

Closed
masterada opened this issue May 30, 2019 · 3 comments · Fixed by #520
Closed

UDP connection returns "write: connection refused" #393

masterada opened this issue May 30, 2019 · 3 comments · Fixed by #520

Comments

@masterada
Copy link

Requirement - what kind of business use case are you trying to solve?

We use jaeger in a production kubernetes cluster. Agents run as daemonset with an UDP hostPort for 6831. Sometimes when an agent is restarted, the service that was calling it returns a "write: connection refused", and from then on it keeps returning the same error until the service is restarted.

Problem - what in Jaeger blocks you from solving the requirement?

This causes the service not reporting spans correctly. The error log comes from:

r.logger.Error(fmt.Sprintf("error when flushing the buffer: %s", err.Error()))

Proposal - what do you suggest to solve the problem or improve the existing situation?

I suggest recreating the connection in https://github.com/jaegertracing/jaeger-client-go/blob/master/utils/udp_client.go#L91 if an error is returned (or maybe after a certain number of errors?). Maybe an calling Temporary() bool and recreating only if the error is not temporary would work even better.

Any open questions to address

It's a rare issue and i can't really reproduce it. On the other hand i don't think my suggestion would break anything. I'm also open to any suggestions on how i could handle this issue in my code.

@yurishkuro
Copy link
Member

UDP is a connectionless protocol. The fact that Go has such a thing as UDPConn doesn't mean there's an actual connection to the server, it's just a representation of the local socket. There's an explanation on SO of how such connectionless protocol can report errors like "connection refused".

So I am skeptical about "it keeps returning the same error until the service is restarted" - it maybe just a temporary condition. I just tried it locally, and I did get the errors for some spans after restarting all-in-one, but then those errors went away and traces were reported fine.

@cuihaikuo
Copy link

I have the same problem, it's hard to find how long this "temporary condition" will stay, I believe this problem is related to some system udp configurations.

@cuihaikuo
Copy link

This problem is caused by conntrack, the linux kernel keeps track of each connection, when you restart jaeger, the cluster ip of jaeger-agent has changed, but the track is still stored on your service host, so your service sent spans to the non-existing ip, there is a conntrack configuration called ip_conntrack_udp_timeout which default value is 30s, that is to say, if your service has not been visited for 30s, the track will be expired and the service will return to normal, that is the "temporary condition", and you can manully delete the track by sudo conntrack -D -p udp.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants