Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(refactor): refactor toESDocument() to be more efficient #68

Merged
merged 3 commits into from
Jul 7, 2017

Conversation

missinglink
Copy link
Member

@missinglink missinglink commented Jul 7, 2017

refactor code for performance & readability

this PR removes the slow JSON serialize/deserialize code. that code converts each object to a string and then back in to an object in order to take advantage of a feature in JSON.stringify which removes empty properties.

it turns out this is potentially slow, so I refactored the code to do the same thing using a different method.

benchmarks:

#### calling constructor, all setter methods and toESDocument()

# before (after the previous PR was merged)
imported 10000 docs in 1124 ms
average speed: 0.1124 ms

# after
imported 10000 docs in 290 ms
average speed: 0.0290 ms

works out to be ~75% faster, multiplied out to 500 million records, this change could save 10+ hrs

Copy link
Member

@orangejulius orangejulius left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Holy 💩! This is huge

Document.js Outdated
};

// remove empty properties
if( _.isEmpty( doc.parent ) ){ delete doc.parent; }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wouldn't it be faster to conditionally add these properties than add and then conditionally remove them?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't test it but I think doing something like:

if( !doc.hasOwnProperty( 'parent' ) ){ doc.parent = []; }

would be less efficient as it would need to be executed once per call to addParent()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I mean don't set parent when initializing doc, then doing

if (!_.isEmpty(this.parent) { 
  doc.parent = this.parent 
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_.isEmpty is somewhat inefficient in this case because it checks for 5 other types of empty conditions before it finally gets to plain old objects. You might see better performance by just checking Object.keys(this.parent).length since you know it's always going to be an plain old object.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@trescube arguing AGAINST using lodash? you feelin ok bud?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know! It pained me to type that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agh I see what you mean, to be honest I did it like this so that the key ordering of the output matched the fixtures, I suspect that if we change the key ordering then there will be a big cleanup effort with all the unit tests fixtures across all the importers.

I don't think the delete call is very expensive so I'm happy to pay that cost in order to avoid having to fix all the fixture files :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could do it like this? it's not so pretty :(

Document.prototype.toESDocument = function() {

  var doc = {
    name: this.name,
    phrase: this.phrase
  };

  if( !_.isEmpty( doc.parent ) ){ doc.parent = this.parent; }
  if( !_.isEmpty( doc.address_parts ) ){ doc.address_parts = this.address_parts; }
  doc.center_point = this.center_point;
  if( !_.isEmpty( doc.category ) ){ doc.category = this.category; }
  doc.category = this.category;
  doc.source = this.source;
  doc.layer = this.layer;
  doc.source_id = this.source_id;

  return {
    _index: config.schema.indexName,
    _type: this.getType(),
    _id: this.getId(),
    data: doc
  };
};

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you do the performance tests for that most recent version you linked as well? i've read (can't find the link now but will look for it) that modifying an object many times like this, rather than setting as many properties as possible when the object is first created, is bad for performance. it would be interesting to see what the actual difference (if any) is

@missinglink missinglink merged commit 8048884 into master Jul 7, 2017
@tadjik1 tadjik1 mentioned this pull request Aug 7, 2017
@orangejulius orangejulius deleted the refactor branch May 16, 2018 20:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants