1.3.x no longer honors connect_timeout when server is unresponsive #450

drdrsh · 2022-04-07T02:27:31Z

Hello,

We encountered an issue where version 1.3.4 would block almost exactly 130 seconds before timing out if the server is unresponsive. This behavior was not present in 1.2.3.

I created a docker-compose stack that demonstrates this issue but here is some output first before sharing the code

For a connection string that looks like this
postgres://main_user:@pg:5432/main_db?connect_timeout=5

We get

# 1.2.3
app_1_2 | Encountered #<PG::ConnectionBad: timeout expired
app_1_2 | > after 5.010162585997023 seconds
# 1.3.4
app_1_3-1  | Encountered #<PG::ConnectionBad: could not connect to server: Connection timed out
app_1_3-1  |  Is the server running on host "192.168.240.2" and accepting
app_1_3-1  |  TCP/IP connections on port 5432?
app_1_3-1  | > after 131.28304585101432 seconds

All this time is spent waiting on this line

How to reproduce

You will need docker for that

Create test.rb and docker-compose.yml

# test.rb

require 'pg'

start = Process.clock_gettime(Process::CLOCK_MONOTONIC)
begin
  PG::connect("postgres://main_user:@pg:5432/main_db?connect_timeout=5").close
rescue StandardError => e
  total_time = Process.clock_gettime(Process::CLOCK_MONOTONIC) - start
  puts "Encountered #{e.inspect} after #{total_time} seconds"
end

# docker-compose.yml
version: '3'
services:
  pg:
    image: postgres:14-bullseye
    cap_add:
      - NET_ADMIN
    environment:
      POSTGRES_HOST_AUTH_METHOD: trust
      POSTGRES_DB: main_db
      POSTGRES_USER: main_user

  app_1_2:
    image: ruby
    restart: always
    volumes:
      - ./test.rb:/home/test.rb
    links:
      - pg
    command: bash -c "gem install pg --version=1.2.3 && ruby /home/test.rb"

  app_1_3:
    image: ruby
    restart: always
    volumes:
      - ./test.rb:/home/test.rb
    links:
      - pg
    command: bash -c "gem install pg && ruby /home/test.rb"

In one terminal run docker-compose up
In another terminal run docker-compose exec --user root pg bash -c "apt-get update && apt-get install iptables -y"
Then run docker-compose exec --privileged --user root pg iptables -A INPUT -p tcp --destination-port 5432 -j DROP to render the server unresponsive.

1.2.3 times out after 5 seconds, honoring the connect_timeout parameter
1.3.4 times out after 130 seconds

The text was updated successfully, but these errors were encountered:

drdrsh · 2022-04-08T12:53:58Z

The very specific 130-second wait is likely the time that takes the OS to run through the default six tcp_syn_retries.
https://blog.cloudflare.com/when-tcp-sockets-refuse-to-die/

p64 · 2022-04-13T23:40:22Z

I'm getting the same behavior. Oddly tho on a local net where ICMP HOSTUNREACH comes back, it times out more "timely". It's still not right, but it's definitely more tolerable for complete different reasons. I also tried the OS host binary to see and it does in fact respect connect_timeout.

also this unit test doesn't seem like it checks the timeout properly.

    it "honors the connect_timeout connection parameter", :postgresql_93 do
            conn = PG.connect( port: @port, dbname: 'test', connect_timeout: 11 )
            begin
                    expect( conn.conninfo_hash[:connect_timeout] ).to eq( "11" )
            ensure
                    conn.finish
            end
    end

larskanis · 2022-04-14T15:12:24Z

I missed the connect_timeout when implementing the new Fiber.scheduler compatible PG.connect . It should work however, if you change the calls of wait_readable and wait_writable to wait_readable(1) and wait_writable(1) here. This way the timeout ought to be checked once per second, which is the granularity of connect_timeout.

Unfortunately I'm correctly away from home, so that this issue has to wait another week until I can fix it.

drdrsh · 2022-04-15T02:01:21Z

Thanks @larskanis!

larskanis · 2022-05-05T20:21:49Z

There is a problem: My assumption in the comment above was that the timeout handling is done in PQconnectPoll() / PG::Connection.connect_poll. Unfortunately this is not true, so that the idea above doesn't work. Instead the timeout is calculated and enforced in the blocking libpq function around PQconnectPoll().

So we have to do the same in ruby-pg to make it Fiber.scheduler compatible in pg-1.3.x. But there is a problem doing so: If multiple hosts are specified in the connection string, the timeout is applied separately for each of the hosts. And in case of a timeout, libpq switches the hosts by controlling the state machine of PQconnectPoll() through internal struct members. Since they are internal, these struct members aren't available for ruby-pg. So we can apply the timeout only in the case of a single host. If multiple hosts are specified, only the first host is tried and so far there seems to be no way to switch to the next one.

I have to talk with the PostgreSQL people to resolve this issue.

.. in favor of passing all hosts to libpq at once and instead adjust connect_timeout handling roughtly to how libpq handles it. The problem is that libpg aborts connecting to multiple hosts, if there's a authentication failure. But if pg imitates this behaviour, the libpq API doesn't give an exact indication, whether the connection aborted due to an authentication error or due to some other error, which continues the host iteration. So we can not distinguish between an authentication error and other types of errors, other then by the error message. But there's the next problem, that the error message is locale dependent and that when both client and server are running on Windows, the error message is often not correctly delivered, which is a known long standing PostgreSQL issue. This commit therefore changes the execution back to how multiple hosts were handled similar to pg-1.3.x, but with two fixes: 1. Multiple IP addresses to one hostname are handled correctly, (fixes ged#452) 2. and connect_timeout is handled roughly like libpq. (fixes ged#450) It's only roughly, since the timeout is not strictly per host, but per single socket event, but with a total timeout multiplied with the number-of-hosts. Exact handling of connect_timeout like libpq is only possible if we connect host-by-host.

larskanis mentioned this issue Jun 8, 2022

Handle connection timeout and multiple addresses per host #459

Merged

larskanis closed this as completed in #459 Jun 14, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

1.3.x no longer honors connect_timeout when server is unresponsive #450

1.3.x no longer honors connect_timeout when server is unresponsive #450

drdrsh commented Apr 7, 2022 •

edited

Loading

drdrsh commented Apr 8, 2022 •

edited

Loading

p64 commented Apr 13, 2022 •

edited

Loading

larskanis commented Apr 14, 2022

drdrsh commented Apr 15, 2022

larskanis commented May 5, 2022

1.3.x no longer honors connect_timeout when server is unresponsive #450

1.3.x no longer honors connect_timeout when server is unresponsive #450

Comments

drdrsh commented Apr 7, 2022 • edited Loading

How to reproduce

drdrsh commented Apr 8, 2022 • edited Loading

p64 commented Apr 13, 2022 • edited Loading

larskanis commented Apr 14, 2022

drdrsh commented Apr 15, 2022

larskanis commented May 5, 2022

drdrsh commented Apr 7, 2022 •

edited

Loading

drdrsh commented Apr 8, 2022 •

edited

Loading

p64 commented Apr 13, 2022 •

edited

Loading