Long running locks on latest good job #480

morgoth · 2022-01-06T10:04:52Z

I upgraded good_job from 2.7.4 to latest 2.8.1 and noticed long running exclusive locks.
It looks something like:

(UI provided by https://github.com/defkode/rails-pg-extras-web)

The number of locks corresponds to number of good job worker processes.

I'm not sure yet if this has some real negative effects, but we do have alerts set for long running locks and this went out this way.

After downgrade to 2.7.4 it went to normal (ie no long running exclusive locks).

bensheldon · 2022-01-06T18:48:02Z

This is expected as a result of #472 (that PR originally had no notes, I just added some). I used an Advisory Lock to detect dangling SIGKILLed process records, instead of a heartbeat/ttl strategy.

This is great to know that you're monitoring on long-lived advisory locks. That seems like good ops.

Idea: I can have the advisory lock released/renewed on a regular basis (every 5 minutes?). btw, how is lock age computed? (I can also look into rails-pg-extras-web myself, but wondered how your monitoring is set up)

morgoth · 2022-01-07T09:27:25Z

This is how it's taken from DB https://github.com/pawurb/ruby-pg-extras/blob/master/lib/ruby-pg-extras/queries/all_locks.sql#L10

This is how we exaclty monitor this:

Clockwork.every(1.minute, "monitor.pg_locks") do
  pg_results = ActiveRecord::Base.connection.execute <<-SQL
    SET intervalstyle = iso_8601;

    SELECT
      pg_stat_activity.pid,
      pg_locks.transactionid,
      pg_locks.granted,
      pg_locks.mode,
      pg_stat_activity.query AS query_snippet,
      AGE(NOW(), pg_stat_activity.query_start)
    FROM pg_locks
    JOIN pg_stat_activity ON (pg_stat_activity.pid = pg_locks.pid)
    WHERE
      pg_locks.mode IN ('ExclusiveLock', 'AccessExclusiveLock', 'RowExclusiveLock')
      AND pg_stat_activity.pid <> pg_backend_pid()
      AND pg_stat_activity.query_start < NOW() - INTERVAL '5 seconds'
    ORDER BY
      age DESC;
  SQL

  # avoid false alarms (preparing dump, autovacuums)
  filtered = pg_results.reject do |result|
    result["query_snippet"].starts_with?("select pg_start_backup") ||
    result["query_snippet"].starts_with?("COPY public") ||
    result["query_snippet"].starts_with?("autovacuum: VACUUM") ||
    result["query_snippet"].starts_with?("LISTEN good_job")
  end

  if filtered.any?
    max_duration = filtered.map { |result| ActiveSupport::Duration.parse(result["age"]) }.max.seconds.round(1)
    pids = filtered.map { |result| result["pid"] }.uniq.sort

    puts "measure#pg_locks.duration=#{max_duration} pids=#{pids}"
    Rails.logger.warn(filtered.to_a)
  else
    puts "measure#pg_locks.duration=0"
  end
end

and then based on given metric we alert if it's too high.

Note: We set it up in the past as we were having problems with Que and locks, that in the end caused deadlocks and the whole job processing was hanging. When we switched to GJ it was no longer an issue, but we left the alerting just in case.

I don't know much about pg locks, but isn't it taking the whole db connection, just for this?

morgoth · 2022-01-07T09:58:02Z

Additional issue is that the "query snippet" for this case shows only "COMMIT" so it's not even possible to filter it out as it's too generic.

bensheldon · 2022-01-08T16:46:11Z

Super helpful! The query is asking in other words "for a connection that has an active advisory lock, show when the last query was performed". So not quite the age of the advisory lock, but detecting inactive connections that still owns advisory locks.

isn't it taking the whole db connection, just for this?

The Process advisory lock is using the LISTEN connection which requires a dedicated connection. It makes sense that this would look inactive because PG's LISTEN doesn't make any additional queries (JDBC does: #479).

For your monitoring, if you're able to also check and filter on pg_stat_activity.application_name, you will find that this connection is named GoodJob::Notifier, which gets set when the dedicated connection is checked out (I'm also very open if you want to suggest a better static/canonical name like "GoodJob LISTEN" or something):

good_job/lib/good_job/notifier.rb

Lines 199 to 209 in bf5d512

    
           def with_connection 
        
             self.connection = Execution.connection_pool.checkout.tap do |conn| 
        
               Execution.connection_pool.remove(conn) 
        
             end 
        
             connection.execute("SET application_name = #{connection.quote(self.class.name)}") 
        
             yield 
        
           ensure 
        
             connection&.disconnect! 
        
             self.connection = nil 
        
           end

(aside: I think showing pg_stat_activity.application_name in rails-pg-extras-web might be a good PR).

AND I also think that this code should also update the process record every few minutes too, just to keep it fresh. I have an ingrained worry of like "doing lots and lots of updates to good_job_processes puts strain on Postgres VACUUM" which I admit is kinda irrational given the acceptance of just thrashing the good_jobs table.

morgoth · 2022-01-10T11:13:20Z

@bensheldon Thanks for the tips. With application_name I'm able to filter it out.

bensheldon mentioned this issue Jan 9, 2022

All Locks "age" accuracy and including pg_stat_activity.application_name pawurb/ruby-pg-extras#7

Closed

morgoth closed this as completed Jan 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Long running locks on latest good job #480

Long running locks on latest good job #480

morgoth commented Jan 6, 2022

bensheldon commented Jan 6, 2022

morgoth commented Jan 7, 2022

morgoth commented Jan 7, 2022

bensheldon commented Jan 8, 2022

morgoth commented Jan 10, 2022

Long running locks on latest good job #480

Long running locks on latest good job #480

Comments

morgoth commented Jan 6, 2022

bensheldon commented Jan 6, 2022

morgoth commented Jan 7, 2022

morgoth commented Jan 7, 2022

bensheldon commented Jan 8, 2022

morgoth commented Jan 10, 2022