Socket removed after reload, part II #1842

nateberkopec · 2019-07-11T20:05:55Z

Continuing the discussion from #1775, it seems like the fix in 4.0.1 may not have done the trick for everyone.

I made a mistake here by not forking the discussion into two parts, because some people were reporting problems on 3.12 and some were not. We may have two separate issues here. I'd like to reset the discussion to here with what people are experiencing on 4.0.1.

nateberkopec · 2019-07-11T20:07:31Z

@mcg and anyone else experiencing this on 4.0.1/4.0.0 - please drop a log in the discussion here so we can get a better idea of what we're dealing with.

mcg · 2019-07-11T20:16:07Z

What logs do you need? Using 4.0.1 and having Puma restart the socket is still missing.

calmyournerves · 2019-07-11T20:18:43Z

I just did a quick test on our staging server and it looks like 4.0.1 fixed the issue for us. To clarify: For us, it worked on 3.12, it didn't work on 4.0.0, and it works again on 4.0.1.

Will do some more tests tomorrow and report back if we see any issues.

Edit: We're using Ruby 2.6, if that could be of relevance.

mcg · 2019-07-11T20:36:56Z

To clarify our setup, we are using Puma socket activation as described(https://github.com/puma/puma/blob/master/docs/systemd.md#socket-activation) and issuing a hot restart as:

# A "hot" restart, with systemd keeping puma.socket listening and
# providing to the new puma (master) instance.
systemctl restart puma.service

This worked with 3.12, didn't with 4.0.0 and 4.0.1. Socket is gone with the restart. Ruby is 2.6.3.

Fudoshiki · 2019-07-12T03:11:47Z

Can you provide puma.service config

systemctl restart puma.service works fine with 4.0.1 with my config

ruby 2.6.3
rails 6.1
puma 4

config/puma.rb

# frozen_string_literal: true

# Puma can serve each request in a thread from an internal thread pool.
# The `threads` method setting takes two numbers: a minimum and maximum.
# Any libraries that use thread pools should be configured to match
# the maximum value specified for Puma. Default is set to 5 threads for minimum
# and maximum; this matches the default thread size of Active Record.
#
max_threads_count = ENV.fetch('RAILS_MAX_THREADS') { 5 }
min_threads_count = ENV.fetch('RAILS_MIN_THREADS') { max_threads_count }
threads min_threads_count, max_threads_count

# Specifies the `environment` that Puma will run in.
#
environment ENV.fetch('RAILS_ENV') { 'development' }
# early_hints true

# Specifies the number of `workers` to boot in clustered mode.
# Workers are forked webserver processes. If using threads and workers together
# the concurrency of the application would be max `threads` * `workers`.
# Workers do not work on JRuby or Windows (both of which do not support
# processes).
#
workers ENV.fetch('WEB_CONCURRENCY') { 2 }

# Use the `preload_app!` method when specifying a `workers` number.
# This directive tells Puma to first boot the application and load code
# before forking the application. This takes advantage of Copy On Write
# process behavior so workers use less memory. If you use this option
# you need to make sure to reconnect any threads in the `on_worker_boot`
# block.
#
# preload_app!

# Allow workers to reload bundler context when master process is issued
# a USR1 signal. This allows proper reloading of gems while the master
# is preserved across a phased-restart. (incompatible with preload_app)
# (off by default)
#
prune_bundler

# Bind unix domain socket
bind ENV.fetch('PUMA_SOCK') { 'unix://tmp/sockets/puma.sock' }

# Use `path` as the file to store the server info state. This is
# used by `pumactl` to query and control the server.
#
state_path ENV.fetch('SERVER_STATE') { 'tmp/pids/puma.state' }

# === Puma control rack application ===

# Start the puma control rack application on `url`. This application can
# be communicated with to control the main server. Additionally, you can
# provide an authentication token, so all requests to the control server
# will need to include that token as a query parameter. This allows for
# simple authentication.
#
# Check out https://github.com/puma/puma/blob/master/lib/puma/app/status.rb
# to see what the app has available.
#
activate_control_app ENV.fetch('PUMACTL_SOCK') { 'unix://tmp/sockets/pumactl.sock' }

# Store the pid of the server in the file at `path`.
#
pidfile 'tmp/pids/puma.pid'

# Change the default timeout of workers
#
worker_timeout 10

# Allow puma to be restarted by `rails restart` command.
plugin :tmp_restart

adamcreekroad · 2019-07-12T18:05:40Z

With the latest 4.0.1 I still get the missing socket file error

Errno::ENOENT: No such file or directory @ apply2files - /home/deploy/sockets/puma.app-staging.sock
  /var/www/app/staging/shared/bundle/ruby/2.6.0/gems/puma-4.0.1/lib/puma/launcher.rb:222:in `unlink'
  /var/www/app/staging/shared/bundle/ruby/2.6.0/gems/puma-4.0.1/lib/puma/launcher.rb:222:in `block in close_binder_listeners'
  /var/www/app/staging/shared/bundle/ruby/2.6.0/gems/puma-4.0.1/lib/puma/launcher.rb:218:in `each'
  /var/www/app/staging/shared/bundle/ruby/2.6.0/gems/puma-4.0.1/lib/puma/launcher.rb:218:in `close_binder_listeners'
  /var/www/app/staging/shared/bundle/ruby/2.6.0/gems/puma-4.0.1/lib/puma/cluster.rb:414:in `block in setup_signals'
  /var/www/app/staging/shared/bundle/ruby/2.6.0/gems/puma-4.0.1/lib/puma/cluster.rb:62:in `sleep'
  /var/www/app/staging/shared/bundle/ruby/2.6.0/gems/puma-4.0.1/lib/puma/cluster.rb:62:in `block in stop_workers'
  /var/www/app/staging/shared/bundle/ruby/2.6.0/gems/puma-4.0.1/lib/puma/cluster.rb:50:in `loop'
  /var/www/app/staging/shared/bundle/ruby/2.6.0/gems/puma-4.0.1/lib/puma/cluster.rb:50:in `stop_workers'
  /var/www/app/staging/shared/bundle/ruby/2.6.0/gems/puma-4.0.1/lib/puma/cluster.rb:546:in `run'
  /var/www/app/staging/shared/bundle/ruby/2.6.0/gems/puma-4.0.1/lib/puma/launcher.rb:186:in `run'
  /var/www/app/staging/shared/bundle/ruby/2.6.0/gems/puma-4.0.1/lib/puma/cli.rb:80:in `run'
  /var/www/app/staging/shared/bundle/ruby/2.6.0/gems/puma-4.0.1/bin/puma:10:in `<top (required)>'
  /var/www/app/staging/shared/bundle/ruby/2.6.0/bin/puma:23:in `load'
  /var/www/app/staging/shared/bundle/ruby/2.6.0/bin/puma:23:in `<top (required)>'

But the systemd service reports all is well

● puma.app-staging.service - puma (app staging)
   Loaded: loaded (/etc/systemd/system/puma.app-staging.service; enabled; vendor preset: enabled)
   Active: active (running) since Fri 2019-07-12 17:58:20 UTC; 2min 35s ago
  Process: 12819 ExecStop=/home/deploy/.rbenv/shims/bundle exec pumactl -S /home/deploy/pids/puma.app-staging.state stop (code=exited, status=0/SUCCESS)
 Main PID: 12833 (bundle)
    Tasks: 54 (limit: 4915)
   CGroup: /system.slice/puma.app-staging.service
           ├─12833 puma 4.0.1 (unix:///home/deploy/sockets/puma.app-staging.sock) [app-staging]
           ├─12889 puma: cluster worker 0: 12833 [app-staging]
           └─12905 puma: cluster worker 1: 12833 [app-staging]

Jul 12 17:58:20 ip-172-30-1-149 systemd[1]: Started puma (app staging).
Jul 12 17:58:20 ip-172-30-1-149 puma[12833]: [12833] Puma starting in cluster mode...
Jul 12 17:58:20 ip-172-30-1-149 puma[12833]: [12833] * Version 4.0.1 (ruby 2.6.3-p62), codename: 4 Fast 4 Furious
Jul 12 17:58:20 ip-172-30-1-149 puma[12833]: [12833] * Min threads: 16, max threads: 16
Jul 12 17:58:20 ip-172-30-1-149 puma[12833]: [12833] * Environment: staging
Jul 12 17:58:20 ip-172-30-1-149 puma[12833]: [12833] * Process workers: 2
Jul 12 17:58:20 ip-172-30-1-149 puma[12833]: [12833] * Preloading application
Jul 12 17:58:27 ip-172-30-1-149 puma[12833]: [12833] * Activated unix:///home/deploy/sockets/puma.app-staging.sock
Jul 12 17:58:27 ip-172-30-1-149 puma[12833]: [12833] Use Ctrl-C to stop

mcg · 2019-07-12T19:08:36Z

@Fudoshiki noticed you're using prune_bundler using preload_app! here.

puma.rb:

directory '/var/app/current'
environment 'staging'

pidfile '/var/app/shared/tmp/pids/puma.pid'
state_path '/var/app/shared/tmp/pids/puma.state'
stdout_redirect '/var/app/shared/log/puma_access.log', '/var/app/shared/log/puma_error.log', true

workers 2
threads 1, 2

bind 'unix:///var/app/shared/tmp/sockets/puma.sock'
activate_control_app 'tcp://127.0.0.1:9293', { auth_token: '...' }

preload_app!

on_restart do
  puts 'Refreshing Gemfile'
  ENV["BUNDLE_GEMFILE"] = '/var/app/current/Gemfile'
end

before_fork do
  ActiveRecord::Base.connection_pool.disconnect!
end

on_worker_boot do
  SemanticLogger.reopen if defined?(SemanticLogger)

  ActiveSupport.on_load(:active_record) do
    ActiveRecord::Base.establish_connection
  end
end

puma.service:

[Unit]
Description=Puma HTTP Server
After=network.target

Requires=puma.socket

[Service]
Type=simple

User=appuser
Group=appuser

WorkingDirectory=/var/app/current

Environment=PUMA_DEBUG=1

ExecStart=/var/app/rvm1scripts/rvm-auto.sh . bundle exec --keep-file-descriptors puma -C /var/app/shared/config/puma.rb

Restart=always

TimeoutStopSec=180

[Install]
WantedBy=multi-user.target

puma.socket:

[Unit]
Description=Puma HTTP Server Accept Sockets

[Socket]
SocketUser=appuser
SocketGroup=appuser
ListenStream=/var/app/shared/tmp/sockets/puma.sock

NoDelay=true
ReusePort=true
Backlog=1024

[Install]
WantedBy=sockets.target

puma.service logs on a restart:

Jul 12 19:00:54: % Registered unix:/var/app/shared/tmp/sockets/puma.sock for activation from LISTEN_FDS
Jul 12 19:00:54: [1424] Puma starting in cluster mode...
Jul 12 19:00:54: [1424] * Version 4.0.1 (ruby 2.6.3-p62), codename: 4 Fast 4 Furious
Jul 12 19:00:54: [1424] * Min threads: 1, max threads: 2
Jul 12 19:00:54: [1424] * Environment: staging
Jul 12 19:00:54: [1424] * Process workers: 2
Jul 12 19:00:54: [1424] * Preloading application
Jul 12 19:01:05: [1424] * Activated unix:///var/app/shared/tmp/sockets/puma.sock
Jul 12 19:01:05: [1424] Use Ctrl-C to stop

nateberkopec · 2019-07-18T01:25:05Z

Our tests are sort of screwed up now because we never unlink any created unix socket FDs.

I feel like we probably should have reverted #1685 rather than merged #1829.

See for example https://travis-ci.org/puma/puma/jobs/557439675#L490

Fudoshiki · 2019-07-18T05:28:11Z

Maybe revert both (#1685 and #1829) and rewrite #1685

nateberkopec · 2019-07-18T11:55:43Z

#1851

MSP-Greg · 2019-07-18T18:55:52Z

Nate and others,

Re 1851, I think there are four general issues to be dealt with, which may require additional PR's:

A. fixing the issue here
B. getting tests to show breakage re this issue
C. making sure Puma is unlinking files when it should
D. getting tests to be accurate, clean, repeatable, immune to 'leakage' from test to test, and removing all test artifacts.

By accurate and clean, right now some tests are passing with IO errors, but the tests don't reflect that. Similar to what you said in issue #1841.

If PR #1851 actually fixes A, working on the remaining three becomes easier.

So, if you think the changes to binder.rb may resolve the issue, is it time to plead with some people to test this issue with the PR branch?

rromanchuk · 2019-07-19T01:37:18Z

Looks like you guys are on it, but just another report...Just discovered this today, using a pretty vanilla setup with simple type systemd, it's a little bit subtle because it does eventually recover when systemd starts it back up again. I only noticed a bunch request failing and then recovering soon after. This type of error is the most common user config error so i just assumed i regressed something in my deploy. Let me know if theres a branch you guys want tested and I'll throw some traffic on it.

[Unit]
Description=Puma HTTP Server
After=network.target

[Service]
Type=simple

User=ubuntu

Environment=STATSD_HOST=127.0.0.1

WorkingDirectory={{deploy_directory}}/current
ExecStart=/home/ubuntu/.rbenv/bin/rbenv exec bundle exec puma -C {{deploy_directory}}/shared/config/puma/production.rb

SyslogIdentifier=puma
StandardOutput=syslog
StandardError=syslog

Restart=always

[Install]
WantedBy=multi-user.target

#! /usr/bin/env puma

{% if inventory_hostname in groups['tag_Role_web_worker']|default([]) %}
workers 3
{% else %}
workers 4
{% endif %}


threads_count = Integer({{max_puma_thread}})
threads threads_count, threads_count

plugin :statsd

bind 'unix://{{deploy_directory}}/shared/tmp/sockets/puma_{{app_name}}.sock'
environment 'production'

directory '{{deploy_directory}}/current'

rackup      DefaultRackup

preload_app!

# [.....trimmed....]

MSP-Greg · 2019-07-19T04:05:19Z

@rromanchuk

The time stamps seem to indicate so, but just to clarify, your post here #1851 (comment) shows we've now got the above example working correctly? Thanks again.

nateberkopec · 2019-08-06T18:22:51Z

Linking #1886

nateberkopec · 2019-08-16T09:51:33Z

So, to summarize:

#1775 reported that, during phased restart w/Unix sockets, Puma blew up with ENOENT on this line. The puma.sock socket was also removed (not just closed). A minimal reproduction case was provided by @Fudoshiki: pumactl -S tmp/pids/puma.state -F config/puma.rb phased-restart removes puma.sock.

#1685 was pointed to as the cause.

We merged #1829 as a fix, but that was raising errors in the tests and now some people are getting a new ENOENT on this line.

It feels like Binder#close is overloaded with responsibility, and is being called in different scenarios which it isn't intended for. That's why the fix at #1829 "fixed" the issue for one use case but broke it (or didn't fix it) for others.

So, we need to a) close sockets during restarts and not accept new requests but also b) not remove the socket file during a restart.

Options:

Merge fix Binder#close #1886.
Merge fix Binder#close via rescue addition #1899.
Revert Do not accept new requests on shutdown #1685 and fix #1775 #1829, and rewrite Do not accept new requests on shutdown #1685. This is currently my favored option.
Something else?

rromanchuk · 2019-08-22T15:47:23Z

Ok I just came back to this and and can confirm this is still happening on * Version 4.1.0 (ruby 2.6.3-p62), codename: Fourth and One, with active traffic from the load balancer sudo systemctl restart puma begins dropping connections.

When using 3.12.1, with the same exact configuration, no connections are dropped.

puma.service.j2

[Unit]
Description=Puma HTTP Server
After=network.target
Requires=puma.socket

[Service]
Type=simple

User=ubuntu

Environment=STATSD_HOST=127.0.0.1
Environment=PUMA_DEBUG=1
Environment=PATH=/home/ubuntu/.rbenv/shims:/bin:/usr/bin

WorkingDirectory={{deploy_directory}}/current
ExecStart={{deploy_directory}}/current/bin/puma -C {{deploy_directory}}/shared/config/puma/production.rb

SyslogIdentifier=puma
StandardOutput=syslog
StandardError=syslog


Restart=always

[Install]
WantedBy=multi-user.target

puma.socket.j2

[Unit]
Description=Puma HTTP Server Accept Sockets

[Socket]
ListenStream={{deploy_directory}}/shared/tmp/sockets/puma.sock


[Install]
WantedBy=sockets.target

production.rb.j2

#! /usr/bin/env puma
environment 'production'
directory '{{deploy_directory}}/current'
rackup      DefaultRackup

# for t3.medium optimum configuration is 3 workers 10 threads
{% if inventory_hostname in groups['tag_Role_web_worker']|default([]) %}
workers 3
{% else %}
workers 3
{% endif %}


threads_count = Integer({{max_puma_thread}})
threads threads_count, threads_count

bind 'unix://{{deploy_directory}}/shared/tmp/sockets/puma.sock'
pidfile "{{deploy_directory}}/shared/tmp/pids/puma.pid"
state_path "{{deploy_directory}}/shared/tmp/pids/puma.state"



#plugin :statsd

preload_app!


# Code to run immediately before the master starts workers.
#
before_fork do
  puts "Starting workers..."
end


# Code to run in a worker before it starts serving requests.
#
# This is called everytime a worker is to be started.
#
on_worker_boot do
  puts 'On worker boot...'
  ActiveRecord::Base.establish_connection
end

# Code to run in a worker right before it exits.
#
# This is called everytime a worker is to about to shutdown.
#
on_worker_shutdown do
  puts 'On worker shutdown...'
end

# Code to run in the master right before a worker is started. The worker's
# index is passed as an argument.
#
# This is called everytime a worker is to be started.
#
on_worker_fork do
  puts 'Before worker fork...'
end

# Code to run in the master after a worker has been started. The worker's
# index is passed as an argument.
#
# This is called everytime a worker is to be started.
#
after_worker_fork do
  puts 'After worker fork...'
end

Logs on restart

Aug 22 15:30:17 web_worker2-0c2 puma[12869]: On worker shutdown...
Aug 22 15:30:17 web_worker2-0c2 puma[12869]: [12869]     worker status: pid 13000 exit 0
Aug 22 15:30:17 web_worker2-0c2 puma[12869]: [12869]     worker status: pid 13006 exit 0
Aug 22 15:30:17 web_worker2-0c2 puma[12869]: [12869]     worker status: pid 13012 exit 0
Aug 22 15:30:17 web_worker2-0c2 puma[12869]: [12869]     worker shutdown time:   0.50
Aug 22 15:30:17 web_worker2-0c2 puma[12869]: After worker fork...
Aug 22 15:30:17 web_worker2-0c2 systemd[1]: Stopped Puma HTTP Server.
-- Subject: Unit puma.service has finished shutting down
-- Defined-By: systemd
-- Support: http://www.ubuntu.com/support
-- 
-- Unit puma.service has finished shutting down.
Aug 22 15:30:17 web_worker2-0c2 systemd[1]: Started Puma HTTP Server.
-- Subject: Unit puma.service has finished start-up
-- Defined-By: systemd
-- Support: http://www.ubuntu.com/support
-- 
-- Unit puma.service has finished starting up.
-- 
-- The start-up result is RESULT.
Aug 22 15:30:17 web_worker2-0c2 sudo[16625]: pam_unix(sudo:session): session closed for user root
Aug 22 15:30:18 web_worker2-0c2 puma[16633]: % Registered unix:/var/www/foo/shared/tmp/sockets/puma.sock for activation from LISTEN_FDS
Aug 22 15:30:18 web_worker2-0c2 puma[16633]: [16633] Puma starting in cluster mode...
Aug 22 15:30:18 web_worker2-0c2 puma[16633]: [16633] * Version 4.1.0 (ruby 2.6.3-p62), codename: Fourth and One
Aug 22 15:30:18 web_worker2-0c2 puma[16633]: [16633] * Min threads: 10, max threads: 10
Aug 22 15:30:18 web_worker2-0c2 puma[16633]: [16633] * Environment: production
Aug 22 15:30:18 web_worker2-0c2 puma[16633]: [16633] * Process workers: 3
Aug 22 15:30:18 web_worker2-0c2 puma[16633]: [16633] * Preloading application

ug 22 15:30:20 web_worker2-0c2 nginx[5009]: web_worker2-0c2 nginx: 2019/08/22 15:30:20 [crit] 5009#5009: *29806 connect() to unix:///var/www/foo/shared/tmp/sockets/puma.sock failed (2: No such file or directory) while connecting to upstream, client: 172.31.11.144, server: , request: "GET /status HTTP/1.1", upstream: "http://unix:///var/www/foo/shared/tmp/sockets/puma.sock:/status", host: "x.x.x.x"
Aug 22 15:30:20 web_worker2-0c2 nginx[5009]: web_worker2-0c2 nginx: 2019/08/22 15:30:20 [crit] 5009#5009: *29806 connect() to unix:///var/www/foo/shared/tmp/sockets/puma.sock failed (2: No such file or directory) while connecting to upstream, client: 172.31.11.144, server: , request: "GET /status HTTP/1.1", upstream: "http://unix:///var/www/foo/shared/tmp/sockets/puma.sock:/500.html", host: "x.x.x.x"
Aug 22 15:30:20 web_worker2-0c2 nginx[5009]: web_worker2-0c2 nginx: 2019/08/22 15:30:20 [crit] 5009#5009: *29806 connect() to unix:///var/www/foo/shared/tmp/sockets/puma.sock failed (2: No such file or directory) while connecting to upstream, client: 172.31.11.144, server: , request: "GET /status HTTP/1.1", upstream: "http://unix:///var/www/foo/shared/tmp/sockets/puma.sock:/500.html", host: "x.x.x.x"
Aug 22 15:30:20 web_worker2-0c2 nginx[5009]: web_worker2-0c2 nginx: 2019/08/22 15:30:20 [crit] 5009#5009: *29806 connect() to unix:///var/www/foo/shared/tmp/sockets/puma.sock failed (2: No such file or directory) while connecting to upstream, client: 172.31.11.144, server: , request: "GET /status HTTP/1.1", upstream: "http://unix:///var/www/foo/shared/tmp/sockets/puma.sock:/500.html", host: "x.x.x.x"
Aug 22 15:30:20 web_worker2-0c2 nginx[5009]: web_worker2-0c2 nginx: x.x.x.x - - [22/Aug/2019:15:30:20 +0000] "GET /status HTTP/1.1" 500 186 "-" "ELB-HealthChecker/2.0"
Aug 22 15:30:21 web_worker2-0c2 puma[16633]: [16633] * Activated unix:///var/www/foo/shared/tmp/sockets/puma.sock
Aug 22 15:30:21 web_worker2-0c2 puma[16633]: [16633] ! WARNING: Detected 1 Thread(s) started in app boot:
Aug 22 15:30:21 web_worker2-0c2 puma[16633]: [16633] ! #<Thread:0x00005583a11d4b90@/var/www/foo/shared/vendor/bundle/ruby/2.6.0/gems/activerecord-6.0.0/lib/active_record/connection_adapters/abstract/connection_pool.rb:335 sleep> - /var/www/foo/shared/vendor/bundle/ruby/2.6.0/gems/activerecord-6.0.0/lib/active_record/connection_adapters/abstract/connection_pool.rb:337:in `sleep'
Aug 22 15:30:21 web_worker2-0c2 puma[16633]: [16633] Use Ctrl-C to stop
Aug 22 15:30:21 web_worker2-0c2 puma[16633]: Starting workers...
Aug 22 15:30:21 web_worker2-0c2 puma[16633]: Before worker fork...
Aug 22 15:30:21 web_worker2-0c2 puma[16633]: After worker fork...
Aug 22 15:30:21 web_worker2-0c2 puma[16633]: Before worker fork...
Aug 22 15:30:21 web_worker2-0c2 puma[16633]: After worker fork...
Aug 22 15:30:21 web_worker2-0c2 puma[16633]: Before worker fork...
Aug 22 15:30:21 web_worker2-0c2 puma[16633]: [16633] - Worker 0 (pid: 16752) booted, phase: 0
Aug 22 15:30:21 web_worker2-0c2 puma[16633]: [16633] - Worker 1 (pid: 16766) booted, phase: 0
Aug 22 15:30:21 web_worker2-0c2 puma[16633]: [16633] - Worker 2 (pid: 16773) booted, phase: 0
Aug 22 15:30:21 web_worker2-0c2 nginx[5009]: web_worker2-0c2 nginx: 2019/08/22 15:30:21 [crit] 5009#5009: *29818 connect() to unix:///var/www/foo/shared/tmp/sockets/puma.sock failed (2: No such file or directory) while connecting to upstream, client: x, server: , request: "GET /api/v2/challenges/index?page=1 HTTP/1.1", upstream: "http://unix:///var/www/foo/shared/tmp/sockets/puma.sock:/api/v2/challenges/index?page=1", host: "api.getfoo.com"

rromanchuk · 2019-08-22T16:02:55Z

The one thing i should probably look into though, is first allowing the gem to update get and get a clean start. i'm going to try one more time upgrading, but removing traffic and shutting down services, and give it a clean start

nateberkopec · 2019-09-18T11:22:58Z

Fixed in #1970

rromanchuk · 2019-09-19T21:19:57Z

Confirmed fixed here. Thanks y'all

mcg · 2019-09-24T16:37:22Z

Not fixed here. Socket is still missing on restart.

MSP-Greg · 2019-09-24T16:50:59Z

@mcg

Are you having the same issue as #1976 ? You mentioned systemd above.

I did write some tests for 'pre-existing' UNIX listeners (path or socket), I need to review that...

mcg · 2019-09-24T17:02:49Z

We are using a UNIX socket but not having the same error as #1976 ("There is already a server bound to: puma.sock"). Just that systemctl restart puma.service removes the socket.

nateberkopec · 2019-09-24T17:39:55Z

@mcg please open a new issue with repro steps along the lines of #1842 (comment)

Systemd issues are a bit harder for us to debug as they're often highly config-dependent and we need linux VMs to test them.

gingerlime · 2019-09-24T18:14:00Z

I reported #1962 and I think the same still happens. systemctl restart puma.service removes the socket. It's also hard for me to debug on our staging environment, because it's used for other things and I can't break things for too long unfortunately.

I'm not 100% sure it relates to the upgrade. It might also happen with 3.12, but that's partly because it's hard to test unfortunately.

The only solution (workaround) I found so far is to always restart both systemd services, i.e. systemctl restart puma.service puma.socket. We're also usually doing a phased-reload rather than a full restart (systemctl reload puma.service), and that seems to work ok so far.

Sorry I can't shed much more light on this. I can try a thing or two if you have any suggestions, but my experimentation window is limited at the moment.

mcg · 2019-09-24T18:32:45Z

@gingerlime We run 3.12 and haven't had this issue at all.

@nateberkopec #1988 created

gingerlime · 2019-09-24T18:37:42Z

@mcg I thought so too, but then I downgraded back to 3.12 and tried to restart but had the same issue. Perhaps it was still running 4.2.0 when the gem version changed, so it didn't pick this up or something? as I mentioned, it was a bit tricky to test, so did this in a bit of a rush.

I'll try to see if I can re-test with 3.12 in a cleaner state if it helps. But I definitely can reproduce the missing socket with 4.2.0

rromanchuk · 2019-09-24T21:08:52Z

@nateberkopec I cant remember if i posted this yet but: i'm also still not convinced this isn't my own configuration error and that there may be something subtle between the releases that expose the problem. My biggest suspicion is around my ExecStart coupled with rbenv + bundle. There is a lot of discussion around this with systemd, and I have a feeling i'm not following directions.

Note: Any wrapper scripts which exec, or other indirections in ExecStart, may result in activated socket file descriptors being closed before they reach the puma master process. For example, if using bundle exec, pass the --keep-file-descriptors flag. bundle exec can be avoided by using a puma executable generated by bundle binstubs puma. This is tracked in [#1499].

One important caveat that makes this reproduction difficult is that it also requires active inbound connections. I've made the mistake twice now where i test a deploy that is not behind a load balancer, think it's fixed then deploy to machine behind a load balancer and it explodes. From what i can tell systemctl restart puma will not reveal this issue if it's not serving any requests.

I want to create my exact ec2 setup with an AMI, just replacing with a "hello world" app, so at least someone can actually poke around. I'll just add a bash script to mock inbound requests during the restart.

nateberkopec · 2019-09-24T21:33:49Z

@rromanchuk it certainly could be. Like I said, I'd like to better understand how people use systemd with Puma so I can have a better sense of what to recommend people to make repros with. i.e. "If you do XYZ and it still reproduces, it's probably a Puma issue, if you do ABC and it doesn't reproduce, it's probably a problem with your env/config/app."

AwolDes · 2019-09-24T22:55:49Z

My company is also having a similar problem, using Ubuntu 16.04.

In v3.12, we were able to stop the puma service and start it again without issue. Using v4.2 our app serves a 500 even though the puma service is running and is stable. When the server originally boots up, puma will serve the app as expected.

The reload for Puma works fine with our config.

These are the snippets of systemd services we're using, let me know if you need any other info.

puma.service

[Unit]
Description=Puma HTTP Server - serves the core application

# After=network.target cloud-final.service puma.socket
After=cloud-final.service puma.socket

# Socket activation prior to boot, introduces a delay
Requires=puma.socket

# The service can start N many times in the interval of N length
StartLimitBurst=5
# How long the interval to count restart attempts within
StartLimitInterval=30

ConditionPathExists=/home/{{ ruby_user }}/shared/core-systemd-switches/puma-service
ConditionPathExists=/var/lib/cloud/instance/boot-finished

[Service]
# Foreground process (do not use --daemon in ExecStart or config.rb)
Type=simple

# Preferably configure a non-privileged user
User={{ ruby_user }}
Group={{ ruby_user }}

# Time to wait for puma to signal boot
TimeoutStartSec=60

# Specify the path to your puma application root
WorkingDirectory=/home/{{ ruby_user }}/app

PIDFile=/home/{{ ruby_user }}/shared/puma.pid

EnvironmentFile=/home/{{ ruby_user }}/shared/userdata.env

# The command to start Puma
ExecStart=/bin/bash -lc 'bundle exec --keep-file-descriptors puma \
                            -C /home/{{ ruby_user }}/app/config/puma.rb \
                            --pidfile /home/{{ ruby_user }}/shared/puma.pid'

ExecStop=/bin/bash -lc 'cat /home/{{ ruby_user }}/shared/puma.pid | xargs kill -s SIGTERM && \
                        while ps axg | grep -vw grep | grep -w puma > /dev/null; do sleep 1; done && \
                        exit 0'

# The command to hot (online) reload Puma
ExecReload=/bin/bash -lc 'cat /home/{{ ruby_user }}/shared/puma.pid | xargs kill -s SIGUSR1'

KillMode=control-group
# Important to keep this inline with the timeouts of the entire system (see ELB, Rack & Puma timeouts)
TimeoutStopSec=175
KillSignal=SIGTERM
SendSIGKILL=yes

StandardOutput=journal
StandardError=journal
SyslogIdentifier=puma

# Restart on process failure, good for prod reliability
Restart=always
# The time to sleep before restarting a service
RestartSec=1

[Install]
WantedBy=multi-user.target

puma.socket

[Unit]
Description=Puma HTTP Server Accept Sockets - network bindings for the core application

After=cloud-final.service

PartOf=puma.service

# Notify via email when service fails to start
OnFailure=unit-status-mail@%n.service

ConditionPathExists=/home/{{ruby_user}}/shared/core-systemd-switches/puma-socket
ConditionPathExists=/var/lib/cloud/instance/boot-finished

[Socket]
# AF_UNIX domain socket
ListenStream=/home/{{ ruby_user }}/shared/puma.sock

SocketUser={{ ruby_user }}
SocketGroup={{ ruby_user }}

# Might resolve keep_alive_issues
# KeepAlive=
# KeepAliveTimeSec=
# KeepAliveIntervalSec=
# KeepAliveProbes=

# Queueing Mechanisms

# Socket options matching Puma defaults - {{ socket_backlog }} is passed through as a query param in puma.rb
Backlog={{ socket_backlog }}
# Puma's documentation recommends same port & nodelay which only apply to TCP mode, we're using domain sockets

[Install]
WantedBy=sockets.target

mcg · 2019-09-25T01:24:48Z

@AwolDes Is the socket /home/{{ ruby_user }}/shared/puma.sock, removed after a Puna restart?

MSP-Greg · 2019-09-25T01:31:48Z

@AwolDes

Are you in a position to test from a branch/PR here?

The reload for Puma works fine with our config.

phased-restart/USR1 shouldn't have any issues with UNIX listeners.

Stop/Start or restart/USR2 may.

@mcg Can you check against a branch/PR?

gingerlime · 2019-09-25T07:11:32Z

Another problem I reported on #1962 and I see again with 4.2.0 is seeing these errors in the puma log after a couple of deploys, slightly sanitized:

[32189] + Gemfile in context: /var/local/app/app.d/843/Gemfile
[32189] Early termination of worker
[32266] + Gemfile in context: /var/local/app/app.d/843/Gemfile
[32266] Early termination of worker
[32290] + Gemfile in context: /var/local/app/app.d/843/Gemfile
/usr/lib/ruby/site_ruby/2.5.0/bundler/spec_set.rb:91:in `block in materialize': Could not find nio4r-2.5.1 in any of the sources (Bundler::GemNotFound)
        from /usr/lib/ruby/site_ruby/2.5.0/bundler/spec_set.rb:85:in `map!'
        from /usr/lib/ruby/site_ruby/2.5.0/bundler/spec_set.rb:85:in `materialize'
        from /usr/lib/ruby/site_ruby/2.5.0/bundler/definition.rb:170:in `specs'
        from /usr/lib/ruby/site_ruby/2.5.0/bundler/definition.rb:237:in `specs_for'
        from /usr/lib/ruby/site_ruby/2.5.0/bundler/definition.rb:226:in `requested_specs'
        from /usr/lib/ruby/site_ruby/2.5.0/bundler/runtime.rb:108:in `block in definition_method'
        from /usr/lib/ruby/site_ruby/2.5.0/bundler/runtime.rb:20:in `setup'
        from /usr/lib/ruby/site_ruby/2.5.0/bundler.rb:107:in `setup'
        from /usr/lib/ruby/site_ruby/2.5.0/bundler/setup.rb:20:in `<top (required)>'
        from /usr/lib/ruby/site_ruby/2.5.0/rubygems/core_ext/kernel_require.rb:54:in `require'
        from /usr/lib/ruby/site_ruby/2.5.0/rubygems/core_ext/kernel_require.rb:54:in `require'
        from /var/local/app/app.d/840/vendor/bundle/ruby/2.5.0/gems/puma-4.2.0/lib/puma/configuration.rb:302:in `rack_builder'
        from /var/local/app/app.d/840/vendor/bundle/ruby/2.5.0/gems/puma-4.2.0/lib/puma/configuration.rb:321:in `load_rackup'
        from /var/local/app/app.d/840/vendor/bundle/ruby/2.5.0/gems/puma-4.2.0/lib/puma/configuration.rb:246:in `app'
        from /var/local/app/app.d/840/vendor/bundle/ruby/2.5.0/gems/puma-4.2.0/lib/puma/runner.rb:158:in `app'
        from /var/local/app/app.d/840/vendor/bundle/ruby/2.5.0/gems/puma-4.2.0/lib/puma/runner.rb:165:in `start_server'
        from /var/local/app/app.d/840/vendor/bundle/ruby/2.5.0/gems/puma-4.2.0/lib/puma/cluster.rb:269:in `worker'
        from /var/local/app/app.d/840/vendor/bundle/ruby/2.5.0/gems/puma-4.2.0/lib/puma/cluster.rb:138:in `block (2 levels) in spawn_workers'
        from /var/local/app/app.d/840/vendor/bundle/ruby/2.5.0/gems/puma-4.2.0/lib/puma/cluster.rb:138:in `fork'
        from /var/local/app/app.d/840/vendor/bundle/ruby/2.5.0/gems/puma-4.2.0/lib/puma/cluster.rb:138:in `block in spawn_workers'
        from /var/local/app/app.d/840/vendor/bundle/ruby/2.5.0/gems/puma-4.2.0/lib/puma/cluster.rb:134:in `times'
        from /var/local/app/app.d/840/vendor/bundle/ruby/2.5.0/gems/puma-4.2.0/lib/puma/cluster.rb:134:in `spawn_workers'
        from /var/local/app/app.d/840/vendor/bundle/ruby/2.5.0/gems/puma-4.2.0/lib/puma/cluster.rb:204:in `check_workers'
        from /var/local/app/app.d/840/vendor/bundle/ruby/2.5.0/gems/puma-4.2.0/lib/puma/cluster.rb:487:in `run'
        from /var/local/app/app.d/840/vendor/bundle/ruby/2.5.0/gems/puma-4.2.0/lib/puma/launcher.rb:172:in `run'
        from /var/local/app/app.d/840/vendor/bundle/ruby/2.5.0/gems/puma-4.2.0/lib/puma/cli.rb:80:in `run'
        from /var/local/app/app.d/840/vendor/bundle/ruby/2.5.0/gems/puma-4.2.0/bin/puma-wild:31:in `<main>'

It seems like the app is running fine though, but those errors keep appearing.

Another odd thing here is that those logs appear in the already logrotated puma.log.1 file, and the new puma.log file is empty... So something definitely looks odd here.

My postrotate in logrotate.d simply does systemctl reload puma.service ...

gingerlime · 2019-09-25T10:46:27Z

@nateberkopec I managed to create a reproducible environment with systemd that exhibits the issue with the restart. Please take a look at https://github.com/gingerlime/puma-test-socket - it's basically a bash script that bootstraps a new Ubuntu box with ruby, puma and a sample sinatra app. It takes a bit of time to run everything, but hope it helps to then debug things etc.

(tested on Digital Ocean)

mcg · 2019-09-25T13:07:33Z

@gingerlime in your repro setup, is the socket file(/var/run/puma/puma.sock) deleted/missing on restart?

gingerlime · 2019-09-25T13:23:30Z

@mcg yep.

root@ubuntu-s-1vcpu-1gb-nyc3-01:~/puma-test-socket# systemctl start puma.socket puma.service
root@ubuntu-s-1vcpu-1gb-nyc3-01:~/puma-test-socket# ps -ef |grep pum
root       373  1183  0 10:43 pts/0    00:00:00 grep --color=auto pum
root     32747     1 10 10:43 ?        00:00:00 puma 4.2.0 (unix:///var/run/puma/puma.sock) [app]
root     32759 32747  5 10:43 ?        00:00:00 puma: cluster worker 0: 32747 [app]
root     32760 32747  5 10:43 ?        00:00:00 puma: cluster worker 1: 32747 [app]
root     32761 32747  5 10:43 ?        00:00:00 puma: cluster worker 2: 32747 [app]
root     32762 32747  5 10:43 ?        00:00:00 puma: cluster worker 3: 32747 [app]
root     32766 32747  5 10:43 ?        00:00:00 puma: cluster worker 4: 32747 [app]
root@ubuntu-s-1vcpu-1gb-nyc3-01:~/puma-test-socket# socat TCP-LISTEN:1234,reuseaddr UNIX-CLIENT:/var/run/puma/puma.sock
^Croot@ubuntu-s-1vcpu-1gb-nyc3-01:~/puma-test-socket# systemctl restart puma.service
root@ubuntu-s-1vcpu-1gb-nyc3-01:~/puma-test-socket# socat TCP-LISTEN:1234,reuseaddr UNIX-CLIENT:/var/run/puma/puma.sock
2019/09/25 10:44:30 socat[415] E connect(5, AF=1 "/var/run/puma/puma.sock", 25): No such file or directory

mcg · 2019-09-25T13:35:24Z

@gingerlime thanks for confirming.

I assume since this is closed we should discuss on #1988 ?

gingerlime · 2019-09-25T13:41:34Z

@mcg yes, things got a little scattered across several tickets, including #1988 and #1962 and here... I'm pretty sure #1988 and #1962 are duplicates anyway, although not 100% sure yet. Happy to consolidate whichever way the project thinks makes sense, and I'll try to post further comments on whichever issue is "the chosen one" :)

mcg · 2019-09-25T13:43:17Z

@MSP-Greg Can probably test yeah.

@AwolDes

Are you in a position to test from a branch/PR here?

The reload for Puma works fine with our config.

phased-restart/USR1 shouldn't have any issues with UNIX listeners.

Stop/Start or restart/USR2 may.

@mcg Can you check against a branch/PR?

nateberkopec · 2019-09-25T13:55:20Z

Yup, let's keep conversation on #1988 and #1962 respectively for my sanity. If one is confirmed as a dupe I'll close and merge the convos. But until then it's much easier for me to keep them separate.

nateberkopec added the bug label Jul 11, 2019

nateberkopec mentioned this issue Jul 11, 2019

Socket removed after reload #1775

Closed

nateberkopec mentioned this issue Sep 2, 2019

Fix #1842 #1945

Closed

1 task

This was referenced Sep 12, 2019

puma keeps writing to the logrotated file, even after being issued a phased-restart #1962

Closed

unlink UNIXSocket files, fix Binder#close #1970

Merged

nateberkopec closed this as completed Sep 18, 2019

dentarg mentioned this issue Sep 18, 2019

systemd AF_UNIX domain socket activation not working #1976

Closed

MSP-Greg mentioned this issue Sep 24, 2019

Socket removed after restart #1988

Closed

puma locked as resolved and limited conversation to collaborators Sep 25, 2019

Socket removed after reload, part II #1842

Socket removed after reload, part II #1842

Comments

nateberkopec commented Jul 11, 2019

nateberkopec commented Jul 11, 2019

mcg commented Jul 11, 2019

calmyournerves commented Jul 11, 2019 • edited Loading

mcg commented Jul 11, 2019

Fudoshiki commented Jul 12, 2019 • edited Loading

adamcreekroad commented Jul 12, 2019

mcg commented Jul 12, 2019 • edited Loading

nateberkopec commented Jul 18, 2019

Fudoshiki commented Jul 18, 2019

nateberkopec commented Jul 18, 2019

MSP-Greg commented Jul 18, 2019

rromanchuk commented Jul 19, 2019

MSP-Greg commented Jul 19, 2019

nateberkopec commented Aug 6, 2019

nateberkopec commented Aug 16, 2019

rromanchuk commented Aug 22, 2019 • edited Loading

rromanchuk commented Aug 22, 2019

nateberkopec commented Sep 18, 2019

rromanchuk commented Sep 19, 2019

mcg commented Sep 24, 2019

MSP-Greg commented Sep 24, 2019

mcg commented Sep 24, 2019

nateberkopec commented Sep 24, 2019

gingerlime commented Sep 24, 2019

mcg commented Sep 24, 2019

gingerlime commented Sep 24, 2019

rromanchuk commented Sep 24, 2019

nateberkopec commented Sep 24, 2019

AwolDes commented Sep 24, 2019

puma.service

puma.socket

mcg commented Sep 25, 2019

MSP-Greg commented Sep 25, 2019 • edited Loading

gingerlime commented Sep 25, 2019

gingerlime commented Sep 25, 2019 • edited Loading

mcg commented Sep 25, 2019

gingerlime commented Sep 25, 2019 • edited Loading

mcg commented Sep 25, 2019

gingerlime commented Sep 25, 2019

mcg commented Sep 25, 2019

nateberkopec commented Sep 25, 2019

calmyournerves commented Jul 11, 2019 •

edited

Loading

Fudoshiki commented Jul 12, 2019 •

edited

Loading

mcg commented Jul 12, 2019 •

edited

Loading

rromanchuk commented Aug 22, 2019 •

edited

Loading

MSP-Greg commented Sep 25, 2019 •

edited

Loading

gingerlime commented Sep 25, 2019 •

edited

Loading

gingerlime commented Sep 25, 2019 •

edited

Loading