Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[THOG-315] Replace bytes.buffer with strings.builder. #533

Merged
merged 6 commits into from
May 10, 2022

Conversation

ahrav
Copy link
Collaborator

@ahrav ahrav commented May 9, 2022

What?

Replace the use of bytes.buffer with strings.Builder.

Why?

strings.Builder is a decent bit more efficient. It uses less memory and performs better overall.

How?

Replace usage directly.

Testing?

After running some profiling tests on some of the larger repos/orgs I was able to determine during the gitScan as we were adding git diffs to be scanned the memory allocation for bytest.buffer was ballooning. After changing it to use strings.builder that huge memory allocation was gone.

Screenshots (optional)

Benchmark of the two.
Screen Shot 2022-05-09 at 7 39 15 AM

Anything Else?

@ahrav ahrav requested a review from a team May 9, 2022 14:59
@@ -61,7 +61,7 @@ func (s Scanner) FromData(ctx context.Context, verify bool, data []byte) (result
continue
}
writer.Close()
req, err := http.NewRequestWithContext(ctx, "POST", "https://api.deepai.org/api/text-tagging", bytes.NewReader(body.Bytes()))
req, err := http.NewRequestWithContext(ctx, "POST", "https://api.deepai.org/api/text-tagging", bytes.NewReader([]byte(body.String())))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about the fact that we're now converting a string back into bytes? I suspect the benchmarks did not capture the full picture.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before it was just copying bytes around I think

Copy link
Collaborator Author

@ahrav ahrav May 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea actually the performance benefit here might not be worth the change.

I was more curious about this portion:

for _, frag := range file.TextFragments {
var sb strings.Builder
newLineNumber := frag.NewPosition
for _, line := range frag.Lines {
if line.Op == gitdiff.OpAdd {
sb.WriteString(line.Line)
}
}
log.WithField("fragment", sb.String()).Trace("detecting fragment")
metadata := s.sourceMetadataFunc(fileName, email, hash, when, urlMetadata, newLineNumber)
chunksChan <- &sources.Chunk{
SourceName: s.sourceName,
SourceID: s.sourceID,
SourceType: s.sourceType,
SourceMetadata: metadata,
Data: []byte(sb.String()),
Verify: s.verify,
}

Because when i was profiling with large repos the WriteString was eating a lot of memory in comparison to strings.Builder.

Copy link
Collaborator Author

@ahrav ahrav May 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here even with the additional conversion to []byte(sb.String()) benchmarks show it being significantly faster.
This converts the strings.Builder back into []bytes as part of the function.
Screen Shot 2022-05-09 at 10 05 39 AM

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Screen Shot 2022-05-09 at 10 16 59 AM

@mcastorina
Copy link
Collaborator

It looks like some profiling files got committed by accident: pkg/engine/profile.out and pkg/engine/memprofile.out.

@ahrav
Copy link
Collaborator Author

ahrav commented May 9, 2022

It looks like some profiling files got committed by accident: pkg/engine/profile.out and pkg/engine/memprofile.out.

THanks! Not sure how those slipped in.

@dustin-decker
Copy link
Contributor

dustin-decker commented May 9, 2022

I could imagine the the git source change being a small improvement because it's handling strings multiple times, but probably not the detector changes, because those are pure byte operations.
Hard to be certain without seeing the benchmarking code.
The detector code path is only hit when the keyword match passes anyway, so not worth much effort to shave a few nano seconds off - we have much bigger things to address first. Optimizing the git scan path is far more payoff than that.

Also, engine.test binary file slipped in. We should add the afformentioned files and *.test to gitignore.

@bill-rich
Copy link
Collaborator

Thats interesting that strings.Builder is so much better on memory. Do you know what its doing differently from bytes.Buffer?

@ahrav
Copy link
Collaborator Author

ahrav commented May 9, 2022

Thats interesting that strings.Builder is so much better on memory. Do you know what its doing differently from bytes.Buffer?

I believe one of the major changes is a copy check mechanism which saves a good deal of space.
https://github.com/golang/go/blob/3058d38632aea679c96cd41156b2751c97578a2d/src/strings/builder.go#L20-L26

It also prevents it's internal byte slice from escaping such that it can only grow and reset making it immutable.

@dustin-decker
Copy link
Contributor

dustin-decker commented May 9, 2022

Makes me wonder if we'd benefit from reusing the buffer and calling Reset().
Right now there are allocations for each fragment.

https://pkg.go.dev/bytes#Buffer.Reset

@ahrav
Copy link
Collaborator Author

ahrav commented May 10, 2022

Makes me wonder if we'd benefit from reusing the buffer and calling Reset(). Right now there are allocations for each fragment.

https://pkg.go.dev/bytes#Buffer.Reset

Ohhhh indeed. I think that actually might even be better. Let me submit this and test that as well.

@ahrav ahrav merged commit e12432c into main May 10, 2022
@ahrav ahrav deleted the THOG-315-replace-bytes-buffer branch May 10, 2022 00:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

4 participants