Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Good job is using a lot of memory / ram #613

Closed
rgaufman opened this issue May 29, 2022 · 7 comments
Closed

Good job is using a lot of memory / ram #613

rgaufman opened this issue May 29, 2022 · 7 comments

Comments

@rgaufman
Copy link

rgaufman commented May 29, 2022

Hi there,

I have my good_job set up like this:

    config.good_job = {
      preserve_job_records: false,
      retry_on_unhandled_error: false,
      on_thread_error: ->(exception) { Airbrake.notify(exception) },
      execution_mode: :external,
      queues: '*',
      max_threads: 10,
      poll_interval: 15,
      shutdown_timeout: 25,
      enable_cron: true,
      logger: Rails.logger
    }

Here is a screenshot of it in action:

Screenshot 2022-05-29 at 20 01 51

There are rarely more than 3-4 jobs running in parallel.

Compared to Sidekiq running the same jobs, it seems to take almost double the ram and looks like this in top:
Screenshot 2022-05-29 at 20 02 55

Any ideas what could be responsible for this? - maybe I should disable the dashboard? - How would I troubleshoot and reduce ram usage?

@rgaufman
Copy link
Author

rgaufman commented May 29, 2022

I noticed strange behaviour, when starting puma and good_job, both jump to 30% ram use. Then Puma starts to drop quite quickly and ends at around 9%. Good_job stays at around 29% (Sidekiq before was roughly the same as Puma with the same workers) - this is 10 minutes later:

Screenshot 2022-05-29 at 20 38 41

(This is with the dashboard disabled and preserve_job_records: false)

Any ideas how to troubleshoot this and figure out why so much ram is being used?

@bensheldon
Copy link
Owner

@rgaufman thanks for sharing this! I haven't done much memory profiling of GoodJob, though I'm surprised that GoodJob would perform significantly differently than Puma. Take the following as speculation (because I haven't looked)...

GoodJob's initialization process is pretty simple:

  1. Boots the Rails framework:
    require RAILS_ENVIRONMENT_RB
    return unless GoodJob::CLI.log_to_stdout? && !ActiveSupport::Logger.logger_outputs_to?(GoodJob.logger, $stdout)
  2. Initializes a bunch of Concurrent Ruby objects (ThreadExecutors, Timers):
    notifier = GoodJob::Notifier.new
    poller = GoodJob::Poller.new(poll_interval: configuration.poll_interval)
    scheduler = GoodJob::Scheduler.from_configuration(configuration, warm_cache_on_initialize: true)
    notifier.recipients << [scheduler, :create_thread]
    poller.recipients << [scheduler, :create_thread]
    cron_manager = GoodJob::CronManager.new(configuration.cron_entries, start_on_initialize: true) if configuration.enable_cron?
  3. Queries for existing and future jobs to cache <-- this is a potential culprit, but when I built the feature I saw it use ~20MB/10k cached items.
  4. Anytime that an execution thread is awakened (e.g. to execute a job immediately over LISTEN/NOTIFY, a scheduled job from the cache, or every poll_interval) that thread takes place inside a Rails Executor which should be similar to how Puma works. One thing to flag here is that the poll_interval wakes up a thread and make an ActiveRecord query every N seconds.

Because I haven't done any memory profiling, I can't give you anything specific about how to dig into the problem. I think trying to get some stackprof points into the initialization would be a good place to start, and look at allocated objects. It sounds like there isn't a problem though with object retention (e.g. memory usage increases over time), just that there is more than expected in the first place.

BTW, here's memory usage for my reference application.

  • Web: Puma with web concurrency 1, rails max threads 5:
    Screen Shot 2022-05-30 at 7 53 32 AM
  • Worker: GoodJob with 4 threads (good job queues = -newsletters:2;newsletters:2)

Screen Shot 2022-05-30 at 7 53 41 AM

@sedubois
Copy link

sedubois commented Jul 2, 2022

Since upgrading our app from GoodJob 2.15.1 to 3.0.0 our Heroku worker memory is spiking.

Worker memory, past 2 hours:
Screenshot 2022-07-02 at 13 26 02
Worker memory, past 24 hours:
Screenshot 2022-07-02 at 13 22 48
Worker memory, past 72 hours:
Screenshot 2022-07-02 at 12 56 15

The web dyno memory usage hasn't changed. Other dependencies were also upgraded (Zeitwerk, rack, ...), but the suspect is GoodJob as it's the worker memory which has increased and an inspection of the latest memory spikes shows a close overlap with the execution time of the latest jobs in the GoodJob dashboard:

GoodJob 'Scheduled At' time (UTC) Memory spike time (1min or 10min resolution) (UTC)
10:49:18 10:49-10:51
10:36:50 10:37-10:39
09:52:47 09:52-09:59:00
09:28:26, 09:29:44 09:20-09:30
08:21, 08:22, 08:25, 08:31 08:20-08:30
07:42:40, 07:43:10, 07:46:38 07:40-07:50
07:12:07, 07:19:40 07:10-07:20
06:25:34, 06:27:04 06:20-06:30
05:45:43, 05:50:42 05:40-05:50

The above table shows that jobs are infrequent. They either send some emails or make some API calls.

Here is the GoodJob initializer, for reference (in this release we also changed the syntax from GoodJob.X = Y into Rails.application.configure do ... end):

# app/config/good_job.rb

Rails.application.configure do
  config.good_job.preserve_job_records = true
  config.good_job.retry_on_unhandled_error = false
  config.good_job.on_thread_error = -> (exception) do
    Sentry.set_extras(good_job_on_thread_error: true)
    Sentry.capture_exception(exception)
  end
end

ActionMailer::MailDeliveryJob.retry_on StandardError,
                                       wait: :exponentially_longer, attempts: Float::INFINITY
ActionMailer::MailDeliveryJob.around_perform do |_job, block|
  block.call
rescue StandardError => e
  Sentry.set_extras(action_mailer_standard_error: true)
  Sentry.capture_exception(e)
  raise
end

Any idea if/how GoodJob could cause this?

@bensheldon
Copy link
Owner

@sedubois yikes! That's not good.

I can't think of anything off the top of my head that would have blown out memory.

Looking through the Changelog, the major changes are:

  • converting GoodJob to a singular engine. I don't think this would vastly effect memory, though it does mean a few more controllers might be eager-loaded into memory at boot.
  • Zeitwerk configuration changes. Can't imagine anything here changing significantly unless Zeitwerk itself was doing something very unexpected, and i wouldn't expect that.
  • Automatically deleting old jobs was a feature changed to default enabled in GoodJob 3.0. Every 10 minutes it deletes old jobs. I think this could be the culprit, but I tried to do as much work as possible in the database rather than the app.

Let me do another sweep through the code and see what I can find.

@bensheldon
Copy link
Owner

@sedubois I found a possible culprit in #652 which has been released in v3.0.1. Please give that a try and let me know if that fixes your problems.

@sedubois
Copy link

sedubois commented Jul 3, 2022

@bensheldon yesterday I downgraded GoodJob from 3.0.0 to 2.15.1 and the memory issue had disappeared. Today I upgraded to 3.0.1 and there have also not been any memory spikes since. So it seems solved, thanks! 👍

@bensheldon
Copy link
Owner

@sedubois fantastic! 🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants