Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

1.3.x no longer honors connect_timeout when server is unresponsive #450

Closed
drdrsh opened this issue Apr 7, 2022 · 5 comments · Fixed by #459
Closed

1.3.x no longer honors connect_timeout when server is unresponsive #450

drdrsh opened this issue Apr 7, 2022 · 5 comments · Fixed by #459

Comments

@drdrsh
Copy link
Contributor

drdrsh commented Apr 7, 2022

Hello,

We encountered an issue where version 1.3.4 would block almost exactly 130 seconds before timing out if the server is unresponsive. This behavior was not present in 1.2.3.

I created a docker-compose stack that demonstrates this issue but here is some output first before sharing the code

For a connection string that looks like this
postgres://main_user:@pg:5432/main_db?connect_timeout=5

We get

# 1.2.3
app_1_2 | Encountered #<PG::ConnectionBad: timeout expired
app_1_2 | > after 5.010162585997023 seconds
# 1.3.4
app_1_3-1  | Encountered #<PG::ConnectionBad: could not connect to server: Connection timed out
app_1_3-1  |  Is the server running on host "192.168.240.2" and accepting
app_1_3-1  |  TCP/IP connections on port 5432?
app_1_3-1  | > after 131.28304585101432 seconds

All this time is spent waiting on this line

How to reproduce

You will need docker for that

  1. Create test.rb and docker-compose.yml
# test.rb

require 'pg'

start = Process.clock_gettime(Process::CLOCK_MONOTONIC)
begin
  PG::connect("postgres://main_user:@pg:5432/main_db?connect_timeout=5").close
rescue StandardError => e
  total_time = Process.clock_gettime(Process::CLOCK_MONOTONIC) - start
  puts "Encountered #{e.inspect} after #{total_time} seconds"
end
# docker-compose.yml
version: '3'
services:
  pg:
    image: postgres:14-bullseye
    cap_add:
      - NET_ADMIN
    environment:
      POSTGRES_HOST_AUTH_METHOD: trust
      POSTGRES_DB: main_db
      POSTGRES_USER: main_user

  app_1_2:
    image: ruby
    restart: always
    volumes:
      - ./test.rb:/home/test.rb
    links:
      - pg
    command: bash -c "gem install pg --version=1.2.3 && ruby /home/test.rb"

  app_1_3:
    image: ruby
    restart: always
    volumes:
      - ./test.rb:/home/test.rb
    links:
      - pg
    command: bash -c "gem install pg && ruby /home/test.rb"
  1. In one terminal run docker-compose up

  2. In another terminal run docker-compose exec --user root pg bash -c "apt-get update && apt-get install iptables -y"

  3. Then run docker-compose exec --privileged --user root pg iptables -A INPUT -p tcp --destination-port 5432 -j DROP to render the server unresponsive.

1.2.3 times out after 5 seconds, honoring the connect_timeout parameter
1.3.4 times out after 130 seconds

@drdrsh
Copy link
Contributor Author

drdrsh commented Apr 8, 2022

The very specific 130-second wait is likely the time that takes the OS to run through the default six tcp_syn_retries.
https://blog.cloudflare.com/when-tcp-sockets-refuse-to-die/

@p64
Copy link

p64 commented Apr 13, 2022

I'm getting the same behavior. Oddly tho on a local net where ICMP HOSTUNREACH comes back, it times out more "timely". It's still not right, but it's definitely more tolerable for complete different reasons. I also tried the OS host binary to see and it does in fact respect connect_timeout.

also this unit test doesn't seem like it checks the timeout properly.

    it "honors the connect_timeout connection parameter", :postgresql_93 do
            conn = PG.connect( port: @port, dbname: 'test', connect_timeout: 11 )
            begin
                    expect( conn.conninfo_hash[:connect_timeout] ).to eq( "11" )
            ensure
                    conn.finish
            end
    end

@larskanis
Copy link
Collaborator

I missed the connect_timeout when implementing the new Fiber.scheduler compatible PG.connect . It should work however, if you change the calls of wait_readable and wait_writable to wait_readable(1) and wait_writable(1) here. This way the timeout ought to be checked once per second, which is the granularity of connect_timeout.

Unfortunately I'm correctly away from home, so that this issue has to wait another week until I can fix it.

@drdrsh
Copy link
Contributor Author

drdrsh commented Apr 15, 2022

Thanks @larskanis!

@larskanis
Copy link
Collaborator

There is a problem: My assumption in the comment above was that the timeout handling is done in PQconnectPoll() / PG::Connection.connect_poll. Unfortunately this is not true, so that the idea above doesn't work. Instead the timeout is calculated and enforced in the blocking libpq function around PQconnectPoll().

So we have to do the same in ruby-pg to make it Fiber.scheduler compatible in pg-1.3.x. But there is a problem doing so: If multiple hosts are specified in the connection string, the timeout is applied separately for each of the hosts. And in case of a timeout, libpq switches the hosts by controlling the state machine of PQconnectPoll() through internal struct members. Since they are internal, these struct members aren't available for ruby-pg. So we can apply the timeout only in the case of a single host. If multiple hosts are specified, only the first host is tried and so far there seems to be no way to switch to the next one.

I have to talk with the PostgreSQL people to resolve this issue.

larskanis added a commit to larskanis/ruby-pg that referenced this issue Oct 8, 2022
.. in favor of passing all hosts to libpq at once and instead adjust connect_timeout handling roughtly to how libpq handles it.

The problem is that libpg aborts connecting to multiple hosts, if there's a authentication failure.
But if pg imitates this behaviour, the libpq API doesn't give an exact indication, whether the connection aborted due to an authentication error or due to some other error, which continues the host iteration.
So we can not distinguish between an authentication error and other types of errors, other then by the error message.
But there's the next problem, that the error message is locale dependent and that when both client and server are running on Windows, the error message is often not correctly delivered, which is a known long standing PostgreSQL issue.

This commit therefore changes the execution back to how multiple hosts were handled similar to pg-1.3.x, but with two fixes:
1. Multiple IP addresses to one hostname are handled correctly, (fixes ged#452)
2. and connect_timeout is handled roughly like libpq. (fixes ged#450)

It's only roughly, since the timeout is not strictly per host, but per single socket event, but with a total timeout multiplied with the number-of-hosts.
Exact handling of connect_timeout like libpq is only possible if we connect host-by-host.
larskanis added a commit to larskanis/ruby-pg that referenced this issue Oct 8, 2022
.. in favor of passing all hosts to libpq at once and instead adjust connect_timeout handling roughtly to how libpq handles it.

The problem is that libpg aborts connecting to multiple hosts, if there's a authentication failure.
But if pg imitates this behaviour, the libpq API doesn't give an exact indication, whether the connection aborted due to an authentication error or due to some other error, which continues the host iteration.
So we can not distinguish between an authentication error and other types of errors, other then by the error message.
But there's the next problem, that the error message is locale dependent and that when both client and server are running on Windows, the error message is often not correctly delivered, which is a known long standing PostgreSQL issue.

This commit therefore changes the execution back to how multiple hosts were handled similar to pg-1.3.x, but with two fixes:
1. Multiple IP addresses to one hostname are handled correctly, (fixes ged#452)
2. and connect_timeout is handled roughly like libpq. (fixes ged#450)

It's only roughly, since the timeout is not strictly per host, but per single socket event, but with a total timeout multiplied with the number-of-hosts.
Exact handling of connect_timeout like libpq is only possible if we connect host-by-host.
larskanis added a commit to larskanis/ruby-pg that referenced this issue Oct 9, 2022
.. in favor of passing all hosts to libpq at once and instead adjust connect_timeout handling roughtly to how libpq handles it.

The problem is that libpg aborts connecting to multiple hosts, if there's a authentication failure.
But if pg imitates this behaviour, the libpq API doesn't give an exact indication, whether the connection aborted due to an authentication error or due to some other error, which continues the host iteration.
So we can not distinguish between an authentication error and other types of errors, other then by the error message.
But there's the next problem, that the error message is locale dependent and that when both client and server are running on Windows, the error message is often not correctly delivered, which is a known long standing PostgreSQL issue.

This commit therefore changes the execution back to how multiple hosts were handled similar to pg-1.3.x, but with two fixes:
1. Multiple IP addresses to one hostname are handled correctly, (fixes ged#452)
2. and connect_timeout is handled roughly like libpq. (fixes ged#450)

It's only roughly, since the timeout is not strictly per host, but per single socket event, but with a total timeout multiplied with the number-of-hosts.
Exact handling of connect_timeout like libpq is only possible if we connect host-by-host.
larskanis added a commit to larskanis/ruby-pg that referenced this issue Oct 9, 2022
.. in favor of passing all hosts to libpq at once and instead adjust connect_timeout handling roughtly to how libpq handles it.

The problem is that libpg aborts connecting to multiple hosts, if there's a authentication failure.
But if pg imitates this behaviour, the libpq API doesn't give an exact indication, whether the connection aborted due to an authentication error or due to some other error, which continues the host iteration.
So we can not distinguish between an authentication error and other types of errors, other then by the error message.
But there's the next problem, that the error message is locale dependent and that when both client and server are running on Windows, the error message is often not correctly delivered, which is a known long standing PostgreSQL issue.

This commit therefore changes the execution back to how multiple hosts were handled similar to pg-1.3.x, but with two fixes:
1. Multiple IP addresses to one hostname are handled correctly, (fixes ged#452)
2. and connect_timeout is handled roughly like libpq. (fixes ged#450)

It's only roughly, since the timeout is not strictly per host, but per single socket event, but with a total timeout multiplied with the number-of-hosts.
Exact handling of connect_timeout like libpq is only possible if we connect host-by-host.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants