-
Notifications
You must be signed in to change notification settings - Fork 180
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: Add full compatibility to Fiber.scheduler of Ruby-3.0 #397
Conversation
lib/pg/connection.rb
Outdated
if Fiber.respond_to?(:scheduler) && Fiber.scheduler | ||
# If a scheduler is set use it directly. | ||
# This is necessary since IO.select isn't passed to the scheduler. | ||
events = Fiber.scheduler.io_wait(socket_io, IO::READABLE | IO::WRITABLE, nil) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to use socket_io.wait
here? You shouldn't need to multiplex (i.e. Fiber.respond_to?(:scheduler)
) here if you use the general method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried IO#wait, but although the documentation says that it returns the event mask, it returns the IO object instead. Since I have to differentiate between writable and readable, this interface is bad. Also I noticed some differences between Ruby versions that I didn't investigated deeper.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm I think it should return the event mask, I’ll need to check.
end | ||
|
||
module Helpers | ||
class Scheduler |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While copying the implementation is okay, I'd suggest just using a gem like async
here instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like neither of them. Copying is bad, since bugs like this are distributed and fixed several times in different ways. And on the other hand Async is a big gem with a whole ecosystem and ruby-2.x compatibility, although I just need a simple scheduler of around 250 lines that I'm able to understand in case of failures.
IMHO it's a faulty design to add a scheduler callback API without adding a simple reference implementation to ruby. Don't get me wrong - it's a great design to have a simple callback API which allows varying implemenations and Async is a great production ready foundation. But there should be a simple scheduler class in the stdlib of ruby. I added my thought to: https://bugs.ruby-lang.org/issues/18004#note-4
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Matz wanted to incorporate Async into Ruby as a default gem but I said no. I want everyone to have equal opportunity to implement the scheduler. Maybe it was a bad choice but I want to encourage other engineers (including yourself) to create their own schedulers.
We could try to make a reference design but I think it would create soft-engineering challenges, including encouraging people to use it in a way it would not be suitable, also ossifying a particular implementation into Ruby core.
# ---------------------------------------- | ||
|
||
module Helpers | ||
class TcpGateScheduler < Scheduler |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's an interesting idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And it works pretty well. It showed for instance that IO#readpartial
on Windows doesn't trigger the scheduler here in this failing test case. It's a cool feature that the scheduler callback interface allows "abuse" like this.
In fact, not having a reliable test method that shows me potential blocking issues in a timing insensitive way, was a blocker for me to implement this PR. How do you test for blocking issues?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess you cannot know for certain that some code does or does not block. But we do have one "metric" for blocking, rb_nogvl
. But this isn't reliable, even if it can be useful.
If we are comfortable with assuming "blocking" = "release GVL", then this will catch those cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there is some room for improvement but it looks good so far.
One high level design issue for your consideration. The goal of async is to make asynchronous usage transparent to the user. i.e. there should not need to be a toggle or flag to control whether you use async
or sync
implementations. Essentially, when the fiber scheduler is not active, the operations naturally block, but when running in an asynchronous context, the operations become non-blocking. I see that you have some kind of flag to control this as well as separate implementations of the methods.
I do understand there can be performance trade-offs, but I believe it would be better if you have a single design which can handle both ways. It is almost certainly a bug to use the asynchronous interface in a non-scheduler context, and equally it's almost certainly a bug to use the synchronous interface in an asynchronous context. Even in that case, you want to avoid calling into a FFI which performs blocking operations without releasing the GVL - so even in synchronous operations, you are better off using things like IO#wait
before invoking the FFI since it will allow the Ruby VM to do a better job of scheduling threads.
@ioquatix Thank you for reviewing my PR!
I think this is misunderstood. There is only one implementation that is intended to be active - the fiber compatible one. This flag is only for debugging and comparison (performance and semantics) of nonblocking vs. blocking implementations. It is not intended for any production use (that's why it is a big global switch). I'll add some documentation to make that clear. This design of async/sync method pairs is well proved and tested, since it was first released 3 years ago. And it helped to track down some issues with the async implementation and get it in line with the sync behavior. This PR completes the work for all potentially blocking methods beyond just exec and co. |
spec/helpers/tcp_gate_scheduler.rb
Outdated
@@ -107,15 +105,19 @@ def write(amount=5) | |||
read_str = @internal_io.read_nonblock(len) | |||
print_data("write fd:#{@internal_io.fileno}->#{@external_io.fileno}", read_str) | |||
@external_io.write(read_str) | |||
if until_writeable | |||
res = IO.select(nil, [until_writeable], nil, 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IO.select
is not compatible with the fiber scheduler FYI.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, but the timeout is zero here, so that it's always nonblocking. The mechanism is so, that in the write direction only as much data is transferred as necessary to make the observed IO writable again. This should trigger any blocking issues while writing at best.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could just use until_writeable.wait_writable(0)
?
ea99287
to
c634707
Compare
055f158
to
ca1bc72
Compare
lib/pg/connection.rb
Outdated
# Use pure Ruby address resolver to avoid blocking of the scheduler. | ||
# `IPSocket.getaddress` isn't fiber aware before ruby-3.1. | ||
require "resolv" | ||
Resolv.getaddress(host) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is interesting for compatibility.
c07b753
to
ef65121
Compare
908c8e3
to
f7cc2f1
Compare
As happend on Windows: 1) with a Fiber scheduler can send lots of data per put_copy_data Failure/Error: super ThreadError: deadlock; recursive locking ./spec/helpers/tcp_gate_scheduler.rb:209:in `write' ./spec/helpers/tcp_gate_scheduler.rb:209:in `puts' ./spec/helpers/tcp_gate_scheduler.rb:209:in `puts' ./spec/helpers/tcp_gate_scheduler.rb:209:in `puts' ./spec/helpers/tcp_gate_scheduler.rb:213:in `io_wait' ./spec/helpers/tcp_gate_scheduler.rb:209:in `write' ./spec/helpers/tcp_gate_scheduler.rb:209:in `puts' ./spec/helpers/tcp_gate_scheduler.rb:209:in `puts' ./spec/helpers/tcp_gate_scheduler.rb:209:in `puts' ./spec/helpers/tcp_gate_scheduler.rb:213:in `io_wait' ./lib/pg/connection.rb:405:in `wait_for_flush' ./lib/pg/connection.rb:405:in `async_put_copy_data' ./spec/pg/scheduler_spec.rb:185:in `block (5 levels) in <top (required)>' ./spec/pg/scheduler_spec.rb:184:in `times' ./spec/pg/scheduler_spec.rb:184:in `block (4 levels) in <top (required)>' ./lib/pg/connection.rb:242:in `copy_data' ./spec/pg/scheduler_spec.rb:181:in `block (3 levels) in <top (required)>' ./spec/pg/scheduler_spec.rb:52:in `block (2 levels) in run_with_scheduler'
... and respect them after accept. It can happen, that a connection is established and data is sent, before the TcpGateScheduler accepts the internal_io. In this case the events got lost and data wasn't transferred.
It is questionable how valueable Fiber.scheduler is on Windows, given that most IO doesn't go through the scheduler. But this way we satisfy our test suite at least.
This is a workaround for truffleruby < 21.3.0. The proposed upstream fix is here: oracle/truffleruby#2444 Although it works with any of the Socket classes, revert to BasicSocket, since this is the common base class of TCPSocket and UNIXSocket.
It retrieves the passowrd algorithm in a scheduler compatible way, if it isn't passed as the third parameter.
Option --disable-gvl-unlock is more understandable. Benchmarks didn't show a measurable difference between unlocking enabled/disabled, so keeping it enabled is safer. Should there still be any blocking function calls, GVL unlocking allows to run ruby threads in parallel.
73042e7
to
6b4cd3f
Compare
It happens on Windows sometimes, when writing to stdout per puts. In this case `io` isn't a socket and treating it as such results in EBADF: Errno::EBADF: Bad file descriptor - not a socket file descriptor ./spec/helpers/tcp_gate_scheduler.rb:223:in `for_fd' ./spec/helpers/tcp_gate_scheduler.rb:223:in `io_wait' ./spec/helpers/tcp_gate_scheduler.rb:214:in `write' ./spec/helpers/tcp_gate_scheduler.rb:214:in `puts'
The resolv library resolves differently than the system resolver. In general it prefers IPv4 over IPv6. This leads to the situation, that the server socket of the TcpGateScheduler is bound to IPv6, but PG.connect tries to use IPv4 instead. This happened on Windows-10 and Windows-Server 2016 with Ruby-3.0 and with no entries in the /etc/hosts file. I tried to align the results of resolv and system library, but didn't find out how the system resolver decides between IPv4 and IPv6. So the workaround is to use a second thread and use the system resolver on Ruby-3.0.
Truffleruby-21.1.0 currently fails on Github Actions like here: https://github.com/larskanis/ruby-pg/runs/3766520041?check_suite_focus=true However it works with the same version on my local laptop, on travis-ci and with the current truffleruby-head version on Github. Since it looks like some issue that's already fixed, this commit allows truffleruby to fail on github for now.
Wow what an awesome improvement! I learned a ton about Ruby 3 just from reading this PR. We are now way past due for a release; should I put one together this weekend? |
This is amazing well done! |
end | ||
oopts[:hostaddr] = hostaddrs.join(",") if hostaddrs.any? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like this is moving hostname lookups out of libpq which the docs say "may cause PQconnectPoll
to block for a significant amount of time."
If so, it might be good to document that reset() does no lookups when it reconnects.
iopts = URI.decode_www_form(uri_match['query']).to_h.transform_keys(&:to_sym) | ||
end | ||
# extract "host1,host2" from "host1:5432,host2:5432" | ||
iopts[:host] = uri_match['hostports'].split(",").map { |hp| hp.split(":", 2)[0] }.join(",") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this do the right thing when the URI has an IPv6 address that contains :
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for checking this! You're right, just splitting on ":" was too opportunistic. #402 should fix this.
PQflush() changes behaviour depending on PQsetnonblocking(), so we should change it accordingly. This removes the need for the private method wait_for_flush, which did essentially the same as PQflush in blocking mode. This is a leftover of #397.
Okay, sounds good. |
rb_warning( "Failed to set the default_internal encoding to %s: '%s'", | ||
encname, PQerrorMessage(conn) ); | ||
pgconn_set_internal_encoding_index( self ); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@larskanis Sorry for resurrecting this ancient PR. But was removing pgconn_set_internal_encoding_index
here intentional?
So this has the side effect of internal_encoding
not being set correctly in the case the set_client_encoding
call fails with an error for any reason.
This results in UTF-8 strings being interpreted as SQL_ASCII e.g
™™™™™™
becomes
\xE2\x84\xA2\xE2\x84\xA2\xE2\x84\xA2\xE2\x84\xA2\xE2\x84\xA2\xE2\x84\xA2
This not the case in version 1.2.3
The aim of this PR is to make PG::Connection fully Fiber.scheduler compatible. That means that all methods that possibly could block, block through the scheduler
io_wait
callback. Internally we switched to async-methods in pg-1.1 for allexec
methods. This was a good preparation and half of the work for this PR.This PR follows the strategy of pg-1.1 and adds more
async_*
methods, renames the blocking methods tosync_*
and usesPG.async_api
to switch between them.async_*
versions will be the default in the future.Compatibility to ruby < 3.0 is not expected for Fiber.scheduler or Async gem compatibility (but for the rest of the gem). Planned to be merged to pg-1.3.
IO#readpartial
doesn't call the scheduler -> useread_nonblock
instead,PG::Connection#block
implemented in ruby)async_send
methods (async_send_query_prepared
etc.)async_exec
methods (async_prepare
etc.)async_put_copy_data
,async_put_copy_end
async_get_copy_data
async_get_result
,async_get_last_result
discard_results
async_reset
reset the backend connectionasync_set_client_encoding
async_cancel
PQencryptPasswordConn
can block if no algorithm is givenPQping
- impossible to implement with nonblocking IO -> uses thread and releases GVLPQconnectStartParams
can block if optionhostaddr
isn't set--disable-gvl-unlock
Fixes #342
/cc @ioquatix