Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UniqueGenerator.clear is very slow #1186

Closed
kevinburkeomg opened this issue Apr 17, 2018 · 7 comments
Closed

UniqueGenerator.clear is very slow #1186

kevinburkeomg opened this issue Apr 17, 2018 · 7 comments

Comments

@kevinburkeomg
Copy link

kevinburkeomg commented Apr 17, 2018

Faker::UniqueGenerator.clear calls ObjectSpace.each_object, which takes about 70ms per invocation, even if there are no objects. If you are calling Faker::UniqueGenerator.clear after every test, the 70ms penalty adds up pretty quickly.

Perhaps we could store an object_created boolean, and only call ObjectSpace.each_object if that boolean indicates an object has been created.

For reference, here's the current contents of unique_generator.rb.

module Faker
  class UniqueGenerator
    def initialize(generator, max_retries)
      @generator = generator
      @max_retries = max_retries
      @previous_results = Hash.new { |hash, key| hash[key] = Set.new }
    end

    def method_missing(name, *arguments)
      @max_retries.times do
        result = @generator.public_send(name, *arguments)

        next if @previous_results[[name, arguments]].include?(result)

        @previous_results[[name, arguments]] << result
        return result
      end

      raise RetryLimitExceeded
    end

    RetryLimitExceeded = Class.new(StandardError)

    def clear
      @previous_results.clear
    end

    def self.clear
      ObjectSpace.each_object(self, &:clear)
    end
  end
end
@knittingdev
Copy link

Hiya @kevinburkeomg 👋

This problem really intrigued me. I liked trying to dive into performance problems, and this seemed like a great way to really understand how Faker works.

I've tried a few different methods to try and test possible performance enhancements (including one that I think was similar to what you recommended):

[NO CLEAR EFFECT] Test 1 - Short Circuiting Clear if @previous_results is empty

    def clear
      return @previous_results if @previous_results.empty?
      @previous_results.clear
    end

    def self.clear
      ObjectSpace.each_object(self, &:clear)
    end

[NO CLEAR EFFECT] Test 2 - Skip Calling Clear if @previous_results.empty?
This was obviously a similar test to Test 1, but worth a shot to see if there was any change in performance, and also to be more clear about how many objects were actually cleared

    def clear
      return @previous_results if @previous_results.empty?
      @previous_results.clear
    end

    def self.clear
      cleared_obj = 0
      ObjectSpace.each_object(self) do |uniq_gen|
        next if uniq_gen.instance_variable_get(:@previous_results).empty?
        uniq_gen.clear
        cleared_obj += 1
      end
      cleared_obj
    end

[NO CLEAR EFFECT & TEST FAILURE] Test 3 - Change Hash Init and Clearing
This test was the brainchild of my benchmarking when I looked at various ways to clear the previous_results hash.

Benchmark.bm do |x|
  x.report  {  nums = 1000.times.map  {  |n| [n.to_s, n]  }.to_h; nums.clear } 
  x.report  {  nums = 1000.times.map  {  |n| [n.to_s, n]  }.to_h; nums = {} }  
  x.report  {  nums = 1000.times.map  {  |n| [n.to_s, n]  }.to_h; nums = Hash.new }  
  x.report  {  nums = 1000.times.map  {  |n| [n.to_s, n]  }.to_h; nums = Hash.new(Set.new) }  
  x.report  {  nums = 1000.times.map  {  |n| [n.to_s, n]  }.to_h; nums = Hash.new { |hash, key| hash[key] = Set.new } }  
end

Come to find out, changing this segment of code (clear and init of previous_results) appears to cause some test failures. I abandoned this after realizing it wasn't going to provide much benefit.

Sadly I can't seem to get around the current performance need.

Is there a reason you need to run Faker::UniqueGenerator.clear after each test? Can you do it after a specific batch of tests (much like before(:all) in a group when testing in rspec)? This seems like it might save you some time on your test suite as there should be enough Faker data to accommodate a few tests before needing to clear.

Hopefully this was helpful. I'd love to hear any other ideas you have on how this might be accomplished.

Thanks~
Mary

@kevinburkeomg
Copy link
Author

The intent was to clean up after any test that had a unique block; I ended up fixing the problem as you suggested by clearing in a before hook in some cases, and by removing Faker in a few others.

It's not obvious that it's going to be very slow to clear a few objects from a few hashes.

@kevinburkeomg
Copy link
Author

Perhaps Faker could store a list of pointers to every object that it's seen that had unique called on it, and then when you call clear it could traverse that list, instead of calling ObjectSpace.

@knittingdev
Copy link

I think that's probably a much better course of action. I had done a lot of experimenting on trying to filter the ObjectSpace to just handle the Faker Objects. Using a pattern where it can best identify all the objects it's created might be the best. I'll try to spend an hour on this tonight to see if I can see any measurable difference. Thanks for the idea!

@MarcPer
Copy link
Contributor

MarcPer commented May 24, 2018

Hi, @knittingdev and @kevinburkeomg . Sorry to barge in, but I was interested in speeding up Faker::UniqueGenerator.clear as well, and I ended up implementing on #1246 what was suggested here: keeping a set with the generators that have used unique and then only calling clear on those.

Below are some benchmarks (in all cases, I've labeled the current method using ObjectSpace as old_clear and the new method with the Set as just clear):

  1. First case is calling Faker::Names.unique.name once every iteration and then using Faker::UniqueGenerator.clear (or Faker::UniqueGenerator.old_clear):
require "benchmark"
require "faker"

def create_clear(num_times, num_unique = 1)
  num_times.times.each do
    num_unique.times { Faker::Name.unique.name }
    Faker::UniqueGenerator.clear
  end
end

def create_old_clear(num_times, num_unique = 1)
  num_times.times.each do
    num_unique.times { Faker::Name.unique.name }
    Faker::UniqueGenerator.old_clear
  end
end

Benchmark.bm do |x|
  x.report { create_clear(10000, 1) }
  x.report { create_old_clear(10000, 1) }
end
 
# user     system      total        real
# 1.306549   0.021276   1.327825 (  1.328652)
# 31.035302   0.555962  31.591264 ( 31.602384)
  1. Second case is calling clear without any unique method call on the generator:
Benchmark.bm do |x|
  x.report { create_clear(10000, 0) }
  x.report { create_old_clear(10000, 0) }
end

# user     system      total        real
# 0.007437   0.000000   0.007437 (  0.007428)
# 27.514668   0.024085  27.538753 ( 27.545627)
  1. In the third case, I created multiple classes and called unique on each one, therefore adding multiple objects to the UniqueGenerator set:
require "benchmark"
require "faker"

def create_clear(num_times, num_unique = 1)
  num_times.times.each do
    num_unique.times { Class.new(Faker::Base).unique.name }
    Faker::UniqueGenerator.clear
  end
end

def create_old_clear(num_times, num_unique = 1)
  num_times.times.each do
    num_unique.times { Class.new(Faker::Base).unique.name }
    Faker::UniqueGenerator.old_clear
  end
end

Benchmark.bm do |x|
  x.report { create_clear(1000, 20) }
  x.report { create_old_clear(1000, 20) }
end

# user     system      total        real
# 1.405916   0.000000   1.405916 (  1.407707)
# 9.118875   0.004000   9.122875 (  9.124552)

Please let me know if any more tests could be useful.

@vbrazo
Copy link
Member

vbrazo commented May 26, 2018

Hmm interested and following the issue.

@MarcPer
Copy link
Contributor

MarcPer commented Jun 16, 2018

@knittingdev, @kevinburkeomg, @vbrazo, should the PR be merged? The change is simple enough and the performance improvement looks consistent.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants