Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use less memory while parsing initial large payloads #32

Merged
merged 1 commit into from
Jun 15, 2022

Conversation

sw-square
Copy link
Contributor

Brief Summary

Similar to #20, the Ruby Server SDK calls Eventsource with large payloads of JSON data, large amount of initial memory usage. The culprit seems to be the use of regex to match the data.

Regex

The cause of the memory issues seems to be here

else
  case line
    when /^(\w+): ?(.*)$/
      item = process_field($1, $2)
      gen.yield item if !item.nil?
  end
end

It seems that about 20x line's length is allocated when parsing the regex, which when given 10MB+ of data can easily top 200MB.

The solution is to instead find the index of the colon itself, and then modify the line in-place, avoiding the need to match the regex on the entire input:

elsif (colon = line.index(':'))
  name = line.slice!(0...colon)
  next unless name.match?(/^\w+$/)

  # delete the colon followed by an optional space
  line.delete_prefix!(':')
  line.delete_prefix!(' ')

  item = process_field(name, line)
  gen.yield item if !item.nil?

Duplicating the line

In the process of finding the regex memory issue, another, more minor one came up here:

@data << "\n" if @have_data
@data << value
@have_data = true

When this code is executed and @have_data is false, then the entire value string is copied into @data. A simple fix for this is just:

if @have_data
	@data << "\n" << value
else
	@data = value
end
@have_data = true

This way we reuse the original string which won't end up duplicating it

@keelerm84
Copy link
Member

Thank you for this contribution!

I will try to get to this PR today, but it might not be until Monday. I didn't want to leave you hanging over the weekend without a response though.

Comment on lines 41 to 46
name = line.slice!(0...colon)
next unless name.match?(/^\w+$/)

# delete the colon followed by an optional space
line.delete_prefix!(':')
line.delete_prefix!(' ')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can definitely see how this change will greatly reduce the memory usage, particularly when the payload isn't a comment.

However, if a comment is sent, I think the next bit of logic can have its memory consumption further reduced.

Mutating the string in place seems to take more memory than calling the non-destructive delete_prefix. Maybe something like this?

Suggested change
name = line.slice!(0...colon)
next unless name.match?(/^\w+$/)
# delete the colon followed by an optional space
line.delete_prefix!(':')
line.delete_prefix!(' ')
name = line.slice(0...colon)
next unless name.match?(/^\w+$/)
# delete the colon followed by an optional space
line = line.slice(colon...).delete_prefix(':').delete_prefix(" ")

What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mutating the string in place seems to take more memory than calling the non-destructive delete_prefix

I accept that that's a finding you've seen, but it's very counter-intuitive to me in terms of how allocating a brand new string could possibly take less memory than mutating a string in place, so I'm puzzled.

Copy link
Contributor

@eli-darkly eli-darkly Jun 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And if we are OK with allocating a brand-new string, then I think it's safe to say that the most efficient approach would be to find out how many characters we want to exclude and then just call substring with an offset that is past those characters, rather than making these three separate calls to compute intermediate strings. Similarly, if we want to mutate the existing string, I would favor getting the name substring first, and then computing the offset past the colon/space and calling slice! once, instead of calling slice! followed by two delete_prefix! calls.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's really interesting that mutating in-place takes more memory, i would never have guessed that. I'm ok with this change, the regex was the big culprit, the strings were more minor

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would still prefer to be doing a single slice! here, rather than slice! followed by delete_prefix! twice. If line contains a very large piece of data (which is the use case we're concerned about in the first place) then those two delete_prefix! calls are each causing the whole buffer to be shifted left one character at a time. It should be simple enough to instead just add 1 to colon (there is no real decision being made by that first delete_prefix, we already know for sure that the character at that position is ':'), then check if the next character (if any) is a space and increment it again if so.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean, I understand that the regex usage was a bigger problem and this is better than that, but that's no reason to still be incurring unnecessary overhead in the improved code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact, we could remove one of the delete_prefix!s by simply using a ternary operator. Something along the lines of delete_prefix! space ? ': ' : ':'

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I'm saying is that you can remove both of them. There is no need to mutate the string more than once, and I'd prefer that we not do so, given that the point of these changes is to reduce unnecessary overhead when the string is very long.

name = line.slice(0...colon)
colon += 1
colon += 1 if colon < line.length && line[colon] == ' '
line.slice!(colon...)

Copy link
Contributor

@eli-darkly eli-darkly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think at this point we can merge the PR and take care of any final polishing ourselves.

@eli-darkly
Copy link
Contributor

The 6.3.3 release of the Ruby SDK includes this fix. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants