[pkg/stanza] - Use trie for previous poll readers #29106

VihasMakwana · 2023-11-10T19:13:04Z

Description: This is my attempt to use Trie to store readers from previous poll cycles.

VihasMakwana · 2023-11-14T07:01:47Z

@djaglowski I was thinking, can we constraint our trie to accept only comparable types? This way it would lead to read-friendly code.

swiatekm · 2023-11-14T10:45:05Z

What is the actual benefit of this? Are we looking for performance improvements in looking for existing readers? Intuitively, the fastest way of doing this would be to drop the bytes completely and just store a hash and the length of the file prefix it was calculated from. Then we'd also skip having to base64 encode all of these fingerprints to be able to store them in JSON, which I suspect costs more CPU time than the actual matching.

VihasMakwana · 2023-11-14T11:33:15Z

@swiatekm-sumo how would we determine the length of file prefix when we re-discover a file in future poll cycles? We do have hashes stored in history, but how would we determine a "match"?

I'm trying to understand following scenario:
Let's say we have,
file1: "hello1"
hash1:"abcdxyz", length of prefix: 6

we then store the hash and prefix in our pretty little array and move to next poll cycle.

file1 becomes: "hello1hello2"
updated hash would be completely different.

in this poll cycle, how would we establish that we have already seen the file?
I can think of only one way i.e. loop through the array, calculate hash till the previous length and see if there's any match, right? Am I understanding it correctly?

VihasMakwana · 2023-11-14T11:34:16Z

@djaglowski I will add benchmark comparisons in PR descriptions.

swiatekm · 2023-11-14T12:46:29Z

I can think of only one way i.e. loop through the array, calculate hash till the previous length and see if there's any match, right? Am I understanding it correctly?

That's the basic idea, yeah. You have a set of old readers from the previous cycle, and those readers have fingerprints with lengths {x, y, z}, ordered by size. So you calculate fingerprints for your new readers up to x, y and z lengths respectively, and compare at each level. This may seem wasteful, but I think it'd be more performant in practice:

Hashes are calculated iteratively byte-by-byte anyway, so you don't incur any cost for stopping at a particular length.
Hashes are actually just int64s, so comparisons are very fast.
In the vast majority of cases, the set of lengths will be very small. It's very rare to have a lot of files smaller than the fingerprint size.

Admittedly, I haven't tested this, but I wanted to point out it's an option if we're going down the path of adding a trie just to be able to compare fingeprints more efficiently. Storing the whole fingerprint is an awkward solution to begin with in my opinion, but its primary value is that it's very simple. If we're willing to make it more complex, then we should consider alternatives.

VihasMakwana · 2023-11-14T13:23:55Z

you know what, now that you've pointed this idea I might give it a try and compare results. Thanks for pointing this out.
Alternatively, I can close this PR if you're already planning to work on this. Just let me know!

swiatekm · 2023-11-14T13:30:34Z

I hadn't started anything, and won't this week, so feel free to give it a shot! I'm cool with this PR staying, though we should probably move this discussion to an issue.

github-actions · 2023-11-29T05:20:33Z

This PR was marked stale due to lack of activity. It will be closed in 14 days.

github-actions · 2023-12-14T05:19:36Z

Closed as inactive. Feel free to reopen if this PR is still being worked on.

init: trie

a5331d2

github-actions bot added the pkg/stanza label Nov 10, 2023

github-actions bot requested a review from djaglowski November 10, 2023 19:13

Merge main

5231d39

VihasMakwana force-pushed the add-trie-whilereading branch from fda15dc to 5231d39 Compare November 11, 2023 21:37

fix: unit tests

34afe7d

VihasMakwana changed the title ~~[WIP][pkg/stanza] - Use trie for previous poll readers~~ [pkg/stanza] - Use trie for previous poll readers Nov 14, 2023

Merge branch 'main' into add-trie-whilereading

821a629

VihasMakwana force-pushed the add-trie-whilereading branch from 01a3bdd to 821a629 Compare November 14, 2023 06:50

VihasMakwana marked this pull request as ready for review November 14, 2023 06:51

VihasMakwana requested a review from a team November 14, 2023 06:51

github-actions bot assigned bogdandrutu Nov 14, 2023

VihasMakwana mentioned this pull request Nov 14, 2023

[pkg/stanza] - Performance improvements while comparing fingerprints in fileconsumer #29273

Closed

github-actions bot added the Stale label Nov 29, 2023

github-actions bot closed this Dec 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[pkg/stanza] - Use trie for previous poll readers #29106

[pkg/stanza] - Use trie for previous poll readers #29106

VihasMakwana commented Nov 10, 2023 •

edited

Loading

VihasMakwana commented Nov 14, 2023

swiatekm commented Nov 14, 2023

VihasMakwana commented Nov 14, 2023 •

edited

Loading

VihasMakwana commented Nov 14, 2023

swiatekm commented Nov 14, 2023 •

edited

Loading

VihasMakwana commented Nov 14, 2023

swiatekm commented Nov 14, 2023

github-actions bot commented Nov 29, 2023

github-actions bot commented Dec 14, 2023

[pkg/stanza] - Use trie for previous poll readers #29106

[pkg/stanza] - Use trie for previous poll readers #29106

Conversation

VihasMakwana commented Nov 10, 2023 • edited Loading

VihasMakwana commented Nov 14, 2023

swiatekm commented Nov 14, 2023

VihasMakwana commented Nov 14, 2023 • edited Loading

VihasMakwana commented Nov 14, 2023

swiatekm commented Nov 14, 2023 • edited Loading

VihasMakwana commented Nov 14, 2023

swiatekm commented Nov 14, 2023

github-actions bot commented Nov 29, 2023

github-actions bot commented Dec 14, 2023

VihasMakwana commented Nov 10, 2023 •

edited

Loading

VihasMakwana commented Nov 14, 2023 •

edited

Loading

swiatekm commented Nov 14, 2023 •

edited

Loading