Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Backport TCP window, client closing fixes to 6.0.x #50

Merged

Conversation

yaauie
Copy link
Contributor

@yaauie yaauie commented Nov 3, 2022

Because 6.1.0 changed this plugin's dependency on Logstash to require Logstash 8 and we still commercially support Logstash 7.13+, I have branched from 6.0.2 and am backporting relevant fixes to the 6.0.x series.

This is not typical SEMVER, and I would like to find a way to separately get 6.1.x of this plugin running on Logstash 7, even if it is without one or more of the features that were introduced for the 6.1 series of this plugin (e.g., works unless you try to use the new feature on LS7).


Here, we manually backport two main fixes, each in its own commit for clarity:

We also:

Unfortunately, in addition to breaking compatibility with LS 7.x, #47 also introduced two bugs that were fixed in #49 (repeating thread names, logger access), one more that is first-addressed here (crash while closing plugin), and will need to be forward-ported) so the lines are muddled.

I have elected to group whole fixes by commit as much as possible, instead of doing true backports.

@yaauie yaauie changed the title Backport tcp window support to 6 0 x Backport TCP window, client closing fixes to 6.0.x Nov 3, 2022
@yaauie yaauie marked this pull request as draft November 3, 2022 17:59
@yaauie yaauie force-pushed the backport-tcp-window-support-to-6-0-x branch from a4c726e to 879be71 Compare November 3, 2022 18:10
Copy link
Contributor Author

@yaauie yaauie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a minimal set of specs and found multiple additional bugs around shutdown sequencing.

They are all fixed here, and will need to be forward-ported to main/6.1.

if server?
# server-mode clean-up
@closed.make_true
@server_socket.shutdown rescue nil if @server_socket
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will need to be forward-ported to main/6.1 series, which sends TCPServer#close instead of this TCPServer#shutdown, and causes the plugin to crash during shutdown because a blocking call on TCPServer#accept throws an IOError when the server is closed out of under it.

end
else
# client-mode clean-up
@client_socket&.close
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All related @client_socket bits will need to be forwarded to main/6.1-series. It cannot be a local variable in register because we need to be able to close it in our close method


# Now send the payload
client_socket.syswrite(payload) if w.any?
@logger.trace("transmitting #{payload.bytesize} bytes", host: @host, port: @port, socket: @client_socket&.peer) if @logger.trace? && payload && !payload.empty?
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logging should be forward-ported to main/6.1 series

:exception => e, :backtrace => e.backtrace)
client_socket.close rescue nil
client_socket = nil
log_warn "client socket failed:", e, host: @host, port: @port, socket: @client_socket&.peer
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TCPSocket#to_s is useless object inspection, so we should send TCPSocket#peer, which gives us useful information about the socket we are connected to. This will need to be forward-ported to main/6.1 too.

@@ -73,4 +73,109 @@
end
end
end

context 'client mode' do
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all new specs will need to be forward-ported to main/6.1 series as well.

@yaauie yaauie force-pushed the backport-tcp-window-support-to-6-0-x branch from f0b8b20 to ea5c5c9 Compare November 3, 2022 23:25
@yaauie yaauie force-pushed the backport-tcp-window-support-to-6-0-x branch 2 times, most recently from 8f6b61a to dd06530 Compare November 4, 2022 00:30
@yaauie yaauie force-pushed the backport-tcp-window-support-to-6-0-x branch from dd06530 to 3a6113d Compare November 4, 2022 00:33
@yaauie
Copy link
Contributor Author

yaauie commented Nov 4, 2022

CI is failing intermittently for the cross-ported fixes for server mode, and I am having a very difficult time figuring out why as everything is very consistently green in local docker tests (on my M1 Mac) with identical parameters. Every single IO we have control over is written in a way that makes it tolerant to partial-writes, and still we are occasionally seeing silent truncation on the 100MB test.

Since 3a6113d all of the failures have been in the Logstash 7 series, so I am beginning to think that I have found a bug in Jruby.

@yaauie
Copy link
Contributor Author

yaauie commented Nov 4, 2022

🤦🏼 when I reduced the amount of memory available to my local docker, I was able to reproduce.

In the test code I was reading until EOF instead of until the IO was closed, and the memory pressure slowed down reads enough that I hit EOF before the writer was done writing.

EDIT: that wasn't it either. Because server-mode's Outputs::TCP#receive does not block, we need to wait to send close to the plugin until the payload is "probably" fully-written. We do this by blocking until the StringIO that is receiving the payload is approximately the right size.

because server-mode's TCP#receive does not block, we need to wait until the
client thread has popped the payload from its queue and sent it over the
socket before telling the TCP output to close, or else our close interrupts
an in-flight event and our specs get a partial payload.
@yaauie yaauie marked this pull request as ready for review November 4, 2022 07:55
@andsel andsel self-requested a review November 4, 2022 08:02
Copy link

@andsel andsel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, I've left a comment on a doubt.

lib/logstash/outputs/tcp.rb Outdated Show resolved Hide resolved
lib/logstash/outputs/tcp.rb Outdated Show resolved Hide resolved
@andsel
Copy link

andsel commented Nov 4, 2022

This is a follow up thought I had reviewing the code, but it's more related to the plugin implementation than this PR. I would really appreciate to listen to your thoughts.

I think we have a problem in the shutdown of client threads in server-mode.
Each thread execute a busy loop in Client#run, without any guard on exit condition, mainly doing

loop do
  begin
    payload = @queue.pop
    @socket.write(payload)
  rescue 
    break  
  end
end

The close method of the Client is:

def close
    @socket.close
end    

The client threads could be blocked in 2 places:

  • queue.pop
  • socket.write
    If it's blocked on the queue.pop, because queue is empty, then the close method doesn't have any effect. In this case the thread should be interrupted (I think to something like Java Thread.interrupt() ) to force it exit. Plus the busy loop should test for interrupted test to exit.
    If it's blocked on socket.write then the socket.close generate an exception in the loop which make it exit.
    Also, the plugin's stop method, shutdown the server socket, stopping the acceptance of new connections and the spawn of new client threads, but doesn't join the already created threads. This is implicitly done by the close of the socket, because being blocking call, we know that when it returns, the socket is closed, the Client received an error and exited, but I think we depend too much on the scheduler here. I think it could happen that plugin terminated, but one of the client threads has still to reach the socket.write that generates the exception and force it to break the loop.

So my point is that we need to improve the shutdown sequence, interrupting the thread and joining it.

@yaauie
Copy link
Contributor Author

yaauie commented Nov 4, 2022

So my point is that we need to improve the shutdown sequence, interrupting the thread and joining it.

I agree. There are a lot of places in this plugin that need improvement. In server mode the queue for each client is also unbounded. Lots of room for improvement.

In the end, though, this PR makes the plugin significantly better than it was before and I would rather deliver these improvements immediately and chase down the further fixes separately.


It's not clean to me why we need to sleep 10ms if the payload contains data

You are correct; both IO#write and IO#syswrite calls should block until the the underlying IO is writable, and so the backoff sleep isn't strictly necessary. I'll back that out.


I'll then merge this PR, forward-port the relevant bits to the 6.1 series, and then file separate work to chase down these other issues. (including hopefully making 6.1 compatible with Logstash 7.x again so that we don't have to maintain two somewhat-significantly-diverged branches).

@andsel
Copy link

andsel commented Nov 4, 2022

I'm ok to merge this PR as it's. A lot of work has been done!

My last comment was intended to open the discussion on what I've seen reviewing it, but that's not related to this specific PR.

@yaauie yaauie merged commit 379d6ca into logstash-plugins:6.0.x Nov 4, 2022
@yaauie yaauie deleted the backport-tcp-window-support-to-6-0-x branch November 4, 2022 17:23
@yaauie yaauie mentioned this pull request Nov 4, 2022
6 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants