-
-
Notifications
You must be signed in to change notification settings - Fork 210
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Long running locks on latest good job #480
Comments
This is expected as a result of #472 (that PR originally had no notes, I just added some). I used an Advisory Lock to detect dangling SIGKILLed process records, instead of a heartbeat/ttl strategy. This is great to know that you're monitoring on long-lived advisory locks. That seems like good ops. Idea: I can have the advisory lock released/renewed on a regular basis (every 5 minutes?). btw, how is lock age computed? (I can also look into rails-pg-extras-web myself, but wondered how your monitoring is set up) |
This is how it's taken from DB https://github.com/pawurb/ruby-pg-extras/blob/master/lib/ruby-pg-extras/queries/all_locks.sql#L10 This is how we exaclty monitor this: Clockwork.every(1.minute, "monitor.pg_locks") do
pg_results = ActiveRecord::Base.connection.execute <<-SQL
SET intervalstyle = iso_8601;
SELECT
pg_stat_activity.pid,
pg_locks.transactionid,
pg_locks.granted,
pg_locks.mode,
pg_stat_activity.query AS query_snippet,
AGE(NOW(), pg_stat_activity.query_start)
FROM pg_locks
JOIN pg_stat_activity ON (pg_stat_activity.pid = pg_locks.pid)
WHERE
pg_locks.mode IN ('ExclusiveLock', 'AccessExclusiveLock', 'RowExclusiveLock')
AND pg_stat_activity.pid <> pg_backend_pid()
AND pg_stat_activity.query_start < NOW() - INTERVAL '5 seconds'
ORDER BY
age DESC;
SQL
# avoid false alarms (preparing dump, autovacuums)
filtered = pg_results.reject do |result|
result["query_snippet"].starts_with?("select pg_start_backup") ||
result["query_snippet"].starts_with?("COPY public") ||
result["query_snippet"].starts_with?("autovacuum: VACUUM") ||
result["query_snippet"].starts_with?("LISTEN good_job")
end
if filtered.any?
max_duration = filtered.map { |result| ActiveSupport::Duration.parse(result["age"]) }.max.seconds.round(1)
pids = filtered.map { |result| result["pid"] }.uniq.sort
puts "measure#pg_locks.duration=#{max_duration} pids=#{pids}"
Rails.logger.warn(filtered.to_a)
else
puts "measure#pg_locks.duration=0"
end
end and then based on given metric we alert if it's too high. Note: We set it up in the past as we were having problems with Que and locks, that in the end caused deadlocks and the whole job processing was hanging. When we switched to GJ it was no longer an issue, but we left the alerting just in case. I don't know much about pg locks, but isn't it taking the whole db connection, just for this? |
Additional issue is that the "query snippet" for this case shows only "COMMIT" so it's not even possible to filter it out as it's too generic. |
Super helpful! The query is asking in other words "for a connection that has an active advisory lock, show when the last query was performed". So not quite the age of the advisory lock, but detecting inactive connections that still owns advisory locks.
The Process advisory lock is using the LISTEN connection which requires a dedicated connection. It makes sense that this would look inactive because PG's LISTEN doesn't make any additional queries (JDBC does: #479). For your monitoring, if you're able to also check and filter on good_job/lib/good_job/notifier.rb Lines 199 to 209 in bf5d512
(aside: I think showing AND I also think that this code should also update the process record every few minutes too, just to keep it fresh. I have an ingrained worry of like "doing lots and lots of updates to |
@bensheldon Thanks for the tips. With |
I upgraded good_job from 2.7.4 to latest 2.8.1 and noticed long running exclusive locks.
It looks something like:
(UI provided by https://github.com/defkode/rails-pg-extras-web)
The number of locks corresponds to number of good job worker processes.
I'm not sure yet if this has some real negative effects, but we do have alerts set for long running locks and this went out this way.
After downgrade to 2.7.4 it went to normal (ie no long running exclusive locks).
The text was updated successfully, but these errors were encountered: