Added close on the channel after a port failure #276

mmichailidis · 2021-11-19T09:29:51Z

During a kubernetes deployment graylog may change location if the original pod closes and a new one opens. The DNS assigned to it will be updated appropriately but the gelf will never try to reconnect.

After a lot of debugging using the latest logstash-gelf and

openjdk 11.0.9.1 2020-11-04
OpenJDK Runtime Environment (build 11.0.9.1+1-Ubuntu-0ubuntu1.20.04)
OpenJDK 64-Bit Server VM (build 11.0.9.1+1-Ubuntu-0ubuntu1.20.04, mixed mode, sharing)

It was identified that in the UDPSender.java and lines 54-58 there was a strange issue. The first isConnected() was correctly reporting false but the inner isConnected() under the connect() function was incorrectly reporting true. The strange part was that the line that was forcing isConnected() to return false was the myChannel.read(byteBuffer) which was throwing an exception.

In order to resolve this, the channel will be closed as soon as the port is unreachable (as there is no need to keep it open with an unreachable port ).

In order to test this ( as I couldn't simple test it with code. ) we did test it outside kubernetes for easier control.
2 graylog installations were deployed on 2 separate machines ( A , B ) and a new hosts entry was inserted into the machine running the java app with the gelf. the hosts entry was pointing to one of the machines (in which graylog was running) and in the other graylog was off. After the first 3-4 logs produced and correctly published to graylog the hosts entry was changed to point to the B machine and closed the graylog of A / opened graylog of B. As a result the messages were never redirected to the new machine without this change.

mp911de · 2021-11-22T08:18:39Z

Thanks for submitting a pull request. Would it generally make sense to close the socket if an exception has happened?

mmichailidis · 2021-11-22T08:35:09Z

Was actually thinking about it but the IOException is quite generic and it could be something random which would not require re-opening the channel. But at the same time close() is a no-op if the channel is already closed resulting in 0 overhead from this extra call.
As a result thinking about it now with clear mind it makes sense to close the channel on exception. Should I update the PR ?

mp911de · 2021-11-22T08:48:45Z

Thanks for your opinion. I primarily wanted to get another perspective, I'm happy to update the PR on my side.

I/O exceptions can happen due to obvious network reasons (connection reset by peer, TCP retransmit timeout) but also some server-side lagging (read timeout). I assume that read timeouts are a fine indicator for closing the connection. Network partitions or forced disconnects are as well a good justification to close an open socket early on.

mmichailidis · 2021-11-22T08:51:49Z

Yes, you are right. Its better to early close and retry the connection instead of hanging to a problematic one.

Added close on the channel after a port failure

db0bbb3

mp911de added the type: bug A general bug label Nov 22, 2021

mp911de added this to the 1.15.0 milestone Jan 21, 2022

mp911de merged commit fa6b152 into mp911de:main Jan 21, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added close on the channel after a port failure #276

Added close on the channel after a port failure #276

mmichailidis commented Nov 19, 2021

mp911de commented Nov 22, 2021

mmichailidis commented Nov 22, 2021

mp911de commented Nov 22, 2021

mmichailidis commented Nov 22, 2021

Added close on the channel after a port failure #276

Added close on the channel after a port failure #276

Conversation

mmichailidis commented Nov 19, 2021

mp911de commented Nov 22, 2021

mmichailidis commented Nov 22, 2021

mp911de commented Nov 22, 2021

mmichailidis commented Nov 22, 2021