Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Long running locks on latest good job #480

Closed
morgoth opened this issue Jan 6, 2022 · 5 comments
Closed

Long running locks on latest good job #480

morgoth opened this issue Jan 6, 2022 · 5 comments

Comments

@morgoth
Copy link
Collaborator

morgoth commented Jan 6, 2022

I upgraded good_job from 2.7.4 to latest 2.8.1 and noticed long running exclusive locks.
It looks something like:
image
(UI provided by https://github.com/defkode/rails-pg-extras-web)

The number of locks corresponds to number of good job worker processes.

I'm not sure yet if this has some real negative effects, but we do have alerts set for long running locks and this went out this way.

After downgrade to 2.7.4 it went to normal (ie no long running exclusive locks).

@bensheldon
Copy link
Owner

This is expected as a result of #472 (that PR originally had no notes, I just added some). I used an Advisory Lock to detect dangling SIGKILLed process records, instead of a heartbeat/ttl strategy.

This is great to know that you're monitoring on long-lived advisory locks. That seems like good ops.

Idea: I can have the advisory lock released/renewed on a regular basis (every 5 minutes?). btw, how is lock age computed? (I can also look into rails-pg-extras-web myself, but wondered how your monitoring is set up)

@morgoth
Copy link
Collaborator Author

morgoth commented Jan 7, 2022

This is how it's taken from DB https://github.com/pawurb/ruby-pg-extras/blob/master/lib/ruby-pg-extras/queries/all_locks.sql#L10

This is how we exaclty monitor this:

Clockwork.every(1.minute, "monitor.pg_locks") do
  pg_results = ActiveRecord::Base.connection.execute <<-SQL
    SET intervalstyle = iso_8601;

    SELECT
      pg_stat_activity.pid,
      pg_locks.transactionid,
      pg_locks.granted,
      pg_locks.mode,
      pg_stat_activity.query AS query_snippet,
      AGE(NOW(), pg_stat_activity.query_start)
    FROM pg_locks
    JOIN pg_stat_activity ON (pg_stat_activity.pid = pg_locks.pid)
    WHERE
      pg_locks.mode IN ('ExclusiveLock', 'AccessExclusiveLock', 'RowExclusiveLock')
      AND pg_stat_activity.pid <> pg_backend_pid()
      AND pg_stat_activity.query_start < NOW() - INTERVAL '5 seconds'
    ORDER BY
      age DESC;
  SQL

  # avoid false alarms (preparing dump, autovacuums)
  filtered = pg_results.reject do |result|
    result["query_snippet"].starts_with?("select pg_start_backup") ||
    result["query_snippet"].starts_with?("COPY public") ||
    result["query_snippet"].starts_with?("autovacuum: VACUUM") ||
    result["query_snippet"].starts_with?("LISTEN good_job")
  end

  if filtered.any?
    max_duration = filtered.map { |result| ActiveSupport::Duration.parse(result["age"]) }.max.seconds.round(1)
    pids = filtered.map { |result| result["pid"] }.uniq.sort

    puts "measure#pg_locks.duration=#{max_duration} pids=#{pids}"
    Rails.logger.warn(filtered.to_a)
  else
    puts "measure#pg_locks.duration=0"
  end
end

and then based on given metric we alert if it's too high.

Note: We set it up in the past as we were having problems with Que and locks, that in the end caused deadlocks and the whole job processing was hanging. When we switched to GJ it was no longer an issue, but we left the alerting just in case.

I don't know much about pg locks, but isn't it taking the whole db connection, just for this?

@morgoth
Copy link
Collaborator Author

morgoth commented Jan 7, 2022

Additional issue is that the "query snippet" for this case shows only "COMMIT" so it's not even possible to filter it out as it's too generic.

@bensheldon
Copy link
Owner

Super helpful! The query is asking in other words "for a connection that has an active advisory lock, show when the last query was performed". So not quite the age of the advisory lock, but detecting inactive connections that still owns advisory locks.

isn't it taking the whole db connection, just for this?

The Process advisory lock is using the LISTEN connection which requires a dedicated connection. It makes sense that this would look inactive because PG's LISTEN doesn't make any additional queries (JDBC does: #479).

For your monitoring, if you're able to also check and filter on pg_stat_activity.application_name, you will find that this connection is named GoodJob::Notifier, which gets set when the dedicated connection is checked out (I'm also very open if you want to suggest a better static/canonical name like "GoodJob LISTEN" or something):

def with_connection
self.connection = Execution.connection_pool.checkout.tap do |conn|
Execution.connection_pool.remove(conn)
end
connection.execute("SET application_name = #{connection.quote(self.class.name)}")
yield
ensure
connection&.disconnect!
self.connection = nil
end

(aside: I think showing pg_stat_activity.application_name in rails-pg-extras-web might be a good PR).

AND I also think that this code should also update the process record every few minutes too, just to keep it fresh. I have an ingrained worry of like "doing lots and lots of updates to good_job_processes puts strain on Postgres VACUUM" which I admit is kinda irrational given the acceptance of just thrashing the good_jobs table.

@morgoth
Copy link
Collaborator Author

morgoth commented Jan 10, 2022

@bensheldon Thanks for the tips. With application_name I'm able to filter it out.

@morgoth morgoth closed this as completed Jan 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants