Skip to content

Release v1.3

Compare
Choose a tag to compare
@jthomale jthomale released this 08 May 17:30

This comprises changes to improve how Exporters and Celery tasks run, including tracking tasks in Redis via export.tasks.JobPlan, improved prefetching, new .env settings for better memory management, better DB connection management, and better handling of return values from exporter jobs.

Changes that May Affect You or Your Custom Catalog Api Code:

  • export.exporter.Exporter class' parallel attribute was removed. Now all exporters are assumed to be run in asynchronous fashion via Celery chords.

  • The export.exporter.Exporter class and subclass' export_records and delete_records methods' signature has changed: vals is no longer a valid kwarg.

  • New export.exporter.Exporter method: compile_vals. The default implementation should continue to work similarly to the previous release. Subclasses may implement their own method to control how the lists of return values from export_records and delete_records (for that class) are combined to pass to final_callback.

  • export.batch_exporters.AllToSolr is now deprecated. A warning will be written to the exporter log if you try to use it. It will be removed entirely in v1.4.

  • New Django and .env settings: EXPORTER_MAX_RC_CONFIG and EXPORTER_MAX_DC_CONFIG. Now you can set the max_rec_chunk and/or max_del_chunk parameters for each exporter on an env-by-env basis, to help manage memory differently for environments that may be more or less memory-constrained. (The new settings are optional: if provided, they override class attributes, which are still used by default.)

Other General Changes from v1.2 to v1.3

  • Pernicious issues with the reliability of large, long-running export jobs are finally resolved. The task that chops a job into smaller chunks and dispatches those to individual Celery tasks now creates and uses a JobPlan, cached in Redis (apart from the Celery result broker), to manage that work. Jobs are now divided into one or more chords of predictable sizes (200 chunks by default), rather than creating a single chord of potentially thousands of chunks. This ensures that errors are handled appropriately and the final_callback method always runs. Plus, now, if a job raises errors that lead to an entire chunk being skipped, that portion of the JobPlan persists in Redis, and you can see exactly what record PKs were skipped after the job completes.

  • Most exporter classes in export.basic_exporters were not using comprehensive enough lists of prefetch_related and select_related relationships, which led to some inefficiencies. BibsAndAttachedToSolr, which is the job we run nightly to sync bib records, was especially problematic. This release updates the relevant exporter attributes, which leads to some large performance gains. (To help mitigate the increased memory usage, new Django / .env settings were added to change max_rec_chunk and max_del_chunk configuration on an env-by-env, exporter-by-exporter basis,)

  • I've always had occasional weird problems with the Celery tasks trying to reuse stale database connections, which would raise exceptions. Because I didn't understand the problem well originally, my previous solution had been pretty awful--drop the default connection at the beginning of each task and then use extensive try/except blocks to try to catch OperationalError exceptions and simply retry whatever caused the exception. I did some research into the issue and discovered that it isn't an uncommon problem with Celery: Celery tasks fall outside the normal Django request cycle, and so it doesn't manage connections for you the same way. This release fixes the problem more completely: each task is wrapped in a function that calls close_if_unusable_or_obsolete on each connection before and after the task runs. This ensures that connections are closed properly before and after each task.

  • Overall, with the improvements in this release, we have seen a 4-5X increase in our production exporters' throughput. We can run BibsToSolr over our entire database (~3.2 million records) in around 5 hours. This used to be a multi-day operation.