Parallelize scrubbing a single model #13

adamstegman · 2019-06-11T17:35:34Z

When a table has tens of millions of records, scrubbing it can take a long time. I added a rake task to parallelize that task so it can go much faster.

I'm assuming the model has an ID column, which may not always be true. For now, this felt good enough. We can always add another PR!

I arbitrarily chose 256 as the number of batches; in my testing it didn't really help to go up or down from there. I'm working on getting a larger data set to test with, though, so I may update that.

I wasn't able to add tests for this—the NullDB adapter doesn't allow you to actually find records. I just tested it on a real application.

When a table has tens of millions of records, scrubbing it can take a long time. I added a rake task to parallelize that task so it can go much faster. I'm assuming the model has an ID column, which may not always be true. [#166599401]

FlaviaCarioca · 2019-06-11T17:41:58Z

lib/acts_as_scrubbable/tasks.rb

-    # if the ENV variable is set
-
-    unless ENV["SCRUB_CLASSES"].blank?
+    if ENV["SCRUB_CLASSES"].present?


jessie00chen · 2019-06-11T17:55:45Z

lib/acts_as_scrubbable/parallel_table_scrubber.rb

+
+    # create even ID ranges for the table
+    def id_ranges_for(ar_class:, count:)
+      last_id = 2147483647 # naive maximum ID value for sorting


I pulled this from the MySQL docs. Like the comment says, naive, because it depends on the data type and the database server. If we find a way to discover that information, we can add it in another PR.

lib/acts_as_scrubbable/parallel_table_scrubber.rb

anujbiyani · 2019-06-11T20:58:00Z

lib/acts_as_scrubbable/tasks.rb

+      answer = ask("Type '#{db_host}' to continue. \n".red + '-> '.white)
+      unless answer == db_host
+        @logger.error "exiting ...".red
+        exit


Do we want this rake task to be chainable? exit breaks chaining ☹️

This is an error case, so probably not.

lib/acts_as_scrubbable/tasks.rb

Parallelize scrubbing a single model

c5d6d57

When a table has tens of millions of records, scrubbing it can take a long time. I added a rake task to parallelize that task so it can go much faster. I'm assuming the model has an ID column, which may not always be true. [#166599401]

adamstegman requested review from anujbiyani, Inglorion-G and FlaviaCarioca June 11, 2019 17:35

FlaviaCarioca reviewed Jun 11, 2019

View reviewed changes

FlaviaCarioca approved these changes Jun 11, 2019

View reviewed changes

jessie00chen reviewed Jun 11, 2019

View reviewed changes

jessie00chen approved these changes Jun 11, 2019

View reviewed changes

Inglorion-G approved these changes Jun 11, 2019

View reviewed changes

anujbiyani reviewed Jun 11, 2019

View reviewed changes

PR feedback

6dcc122

adamstegman merged commit 577bc8b into master Jun 12, 2019

adamstegman deleted the chore/166599401-parallel-one branch June 12, 2019 15:29

adamstegman mentioned this pull request Jun 12, 2019

Bumping version to 1.1.0 #14

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallelize scrubbing a single model #13

Parallelize scrubbing a single model #13

adamstegman commented Jun 11, 2019 •

edited

Loading

FlaviaCarioca Jun 11, 2019

jessie00chen Jun 11, 2019

adamstegman Jun 11, 2019

anujbiyani Jun 11, 2019

adamstegman Jun 12, 2019

Parallelize scrubbing a single model #13

Parallelize scrubbing a single model #13

Conversation

adamstegman commented Jun 11, 2019 • edited Loading

FlaviaCarioca Jun 11, 2019

Choose a reason for hiding this comment

jessie00chen Jun 11, 2019

Choose a reason for hiding this comment

adamstegman Jun 11, 2019

Choose a reason for hiding this comment

anujbiyani Jun 11, 2019

Choose a reason for hiding this comment

adamstegman Jun 12, 2019

Choose a reason for hiding this comment

adamstegman commented Jun 11, 2019 •

edited

Loading