-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[pkg/stanza] - Use trie for previous poll readers #29106
[pkg/stanza] - Use trie for previous poll readers #29106
Conversation
fda15dc
to
5231d39
Compare
01a3bdd
to
821a629
Compare
@djaglowski I was thinking, can we constraint our trie to accept only comparable types? This way it would lead to read-friendly code. |
What is the actual benefit of this? Are we looking for performance improvements in looking for existing readers? Intuitively, the fastest way of doing this would be to drop the bytes completely and just store a hash and the length of the file prefix it was calculated from. Then we'd also skip having to base64 encode all of these fingerprints to be able to store them in JSON, which I suspect costs more CPU time than the actual matching. |
@swiatekm-sumo how would we determine the length of file prefix when we re-discover a file in future poll cycles? We do have hashes stored in history, but how would we determine a "match"? I'm trying to understand following scenario: we then store the hash and prefix in our pretty little array and move to next poll cycle. file1 becomes: "hello1hello2" in this poll cycle, how would we establish that we have already seen the file? |
@djaglowski I will add benchmark comparisons in PR descriptions. |
That's the basic idea, yeah. You have a set of old readers from the previous cycle, and those readers have fingerprints with lengths {x, y, z}, ordered by size. So you calculate fingerprints for your new readers up to x, y and z lengths respectively, and compare at each level. This may seem wasteful, but I think it'd be more performant in practice:
Admittedly, I haven't tested this, but I wanted to point out it's an option if we're going down the path of adding a trie just to be able to compare fingeprints more efficiently. Storing the whole fingerprint is an awkward solution to begin with in my opinion, but its primary value is that it's very simple. If we're willing to make it more complex, then we should consider alternatives. |
you know what, now that you've pointed this idea I might give it a try and compare results. Thanks for pointing this out. |
I hadn't started anything, and won't this week, so feel free to give it a shot! I'm cool with this PR staying, though we should probably move this discussion to an issue. |
This PR was marked stale due to lack of activity. It will be closed in 14 days. |
Closed as inactive. Feel free to reopen if this PR is still being worked on. |
Description: This is my attempt to use Trie to store readers from previous poll cycles.