Skip to content
This repository has been archived by the owner on Apr 18, 2024. It is now read-only.

feat: modify pool tests #153

Merged
merged 44 commits into from
Oct 10, 2023
Merged

feat: modify pool tests #153

merged 44 commits into from
Oct 10, 2023

Conversation

AmeanAsad
Copy link
Contributor

No description provided.

node_heap.go Outdated
for i := 0; i < n && i < len(nh.Nodes); i++ {
node := nh.Nodes[i]
m = append(m, node)
for i := 0; i < n; i++ {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@willscott noticed a slight issue with the use of heaps here.

The heap will keep itself sorted upon insertion and removal of new nodes. However, the existing nodes in the heap have dynamically changing values and the heap does not continuously sort itself every time one of those values changes.

The strategy I've implemented here is to select the top N nodes from the heap, to use .Pop() n times then to retrieve the top nodes, then add them back using .Push(). This is a similar flow to the .Fix() method that exists on the heap which is meant to reposition an item in the heap if it changes its value. Not the most efficient, but we are dealing with fairly small heap sizes here.

Also wondering if we still think that a priority queue is the best data structure to store the AllNodes tier? My understanding of the code / implementation is that the ActiveNodes is always a subset of AllNodes and they are not mutually exclusive sets like the previous "Main" and "Uknown" tier design.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@willscott One issue I noticed with my changes is that since the nodes are temporarily removed from the heap, if there are any stats being written to them concurrently, that will fail.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what do you mean by 'fail'? a temporary dropped metric isn't the end of the world, potentially.

If not a priority queue for all nodes, how would you imagine keeping track of the whole set?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what do you mean by 'fail'? a temporary dropped metric isn't the end of the world, potentially.

What i mean by "fail" is that any concurrent operation that accesses the heap while the topN selection is occurring, it might get an incorrect list. I do agree with you though, it doesn't seem that damaging specially that the "AllNodes" set is just used for testing nodes.

If not a priority queue for all nodes, how would you imagine keeping track of the whole set?

Wondering if a regular array structure would work better here. It's simpler and gives us the functionality we need (top n nodes, random nodes, node removal). We're not really using the properties of a priority queue because the values of each node in the queue are changing constantly and we have to basically re-sort the nodes every time we need to make a selection.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • We can add add/remove methods that are simple enough api.
  • i think keeping the nodes in a roughly sorted order for consideration isn't a bad idea, as we'll want to access the better ones.
  • if we run into performance issues with using the push/pop semantics on updates, we can get lazy about it and then use the heap fix when we need to get the top n.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one note on golang semantics here: you have the nh.lk.RLock() above - that indicates you're taking a "read only lock" over the data structure - but the pop and push you're proposing in your change will do a read-write operation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the tip here Will, always good to learn :)

@AmeanAsad
Copy link
Contributor Author

AmeanAsad commented Sep 1, 2023

Hey @willscott noticing that the ConsiderUpdateWeightedNode function in the hashring lib is not working entirely as intended. It's returning negative deltas for new nodes that are just getting added to the pool.

@willscott
Copy link
Contributor

Hey @willscott noticing that the ConsiderUpdateWeightedNode function in the hashring lib is not working entirely as intended. It's returning negative deltas for new nodes that are just getting added to the pool.

you believe this is a bug in the considerupdate? do you have a failing test against it that you can point me to so that i can help debug it?

@AmeanAsad
Copy link
Contributor Author

@willscott

you believe this is a bug in the considerupdate? do you have a failing test against it that you can point me to so that i can help debug it?

I think it would be helpful to just get an idea of the expected behavior here to understand what Im testing against. The example im going off is lets say I have a new node that is not currently in the hash ring and I run ConsiderUpdateWeightedNode("nodeUrl", 1), what do I expect to receive?

@willscott
Copy link
Contributor

I would expect ConsiderUpdateWeightedNode for a new node to return a negative value - you're taking part of the hash ring that would currently send it's requests to a node with known latency/performance characteristics, and proposing giving that space instead to a node with unknown characteristics. The expected outcome of that is worse.

One way that confidence gets better is if mirroring has already sent some test requests to the new node, so we have some sense that it could be good.

the other question is how to handle initial startup - we probably need to enforce some sort of minimum reasonable size where we're okay with negative updates if we fall below.

@AmeanAsad
Copy link
Contributor Author

AmeanAsad commented Sep 4, 2023

@willscott the reason why the tests are passing is because the PoolConsiderationCount is 30 so all the nodes in the test automatically get added in the pool. I changed that value earlier temporarily for testing with smaller groups of nodes to have easier debugging.

@AmeanAsad
Copy link
Contributor Author

@willscott a few things I found:

  • The MaybeSubstituteOrAdd uses the neighbor's Rate() as part of the delta formula:
		neighbor = nr.Nodes[n]
		neighborVolume := neighbor.Rate()

		// how much worse is candidate?
		diff := candidate.Priority() - neighbor.Priority()
		delta += diff * neighborVolume * float64(v)

The Rate() for nodes with no stats is always zero so the first test always fails because the delta will always be zero comparing a node that has had stats with one that has none.

  • The ConsiderUpdateWeightedNode for the first test is returning negative values when comparing a node in the hashring that has no stats vs a new node that has recorded a bunch of successes. Based on your description above, seems like this is unexpected behavior.

node_heap.go Outdated
}
for _, n := range m {
heap.Push(nh, n)
for i := 0; i < n && i < len(nh.Nodes); i++ {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought this returns an unsorted list of candidates?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if we always use heap.Push when adding new elements, then the nh.Nodes remains in sorted order?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My train of thought is that if the node(s) stats change though after they have been pushed, then nh.Nodes doesn't remain sorted because heap.push only sorts on insertion.

Also if the stats of already ordered nodes change, then I think heap.Push actually would probably never keep a sorted list becauseheap.Push would only work properly if the heap is correctly sorted when the insertion occurs.

Not sure if im missing something here maybe?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure how much churn we expect - hopefully in the steady state not too much.

the answer to changing Priority of nodes that lead to an unsorted heap is to use heap.Init at the beginning of the call to TopN - or to use heap.Fix whenever specific node priorities are changed.

Since we only call TopN in the periodic pool refresh, so once every 5 minutes, it's probably no problem to re-establish the heap ordering each time.

@AmeanAsad
Copy link
Contributor Author

@aarshkshah1992 test flakiness addressed.

cc @willscott was able to reproduce flakiness locally and made changes to the topN selection. I believe just using heap.Init only guarantees sorting of the first element.

@AmeanAsad
Copy link
Contributor Author

@willscott added a test that targets testing cache affinity. Relevant commit is here: 1975f49

The test tries to ensure that if we have a set of good performing nodes that we have previously made cid requests to, we still hit those nodes for those cids even if other "similar" performing nodes join the pool. Currently that test is failing, and we are seeing new nodes that join the pool show up as first candidate nodes for those cids instead.

@@ -15,8 +17,9 @@ import (
)

const (
nodesSize = 200
nodesSize = 6
)
Copy link
Contributor

@aarshkshah1992 aarshkshah1992 Sep 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AmeanAsad @willscott

Why do we need to do this ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is used to change the pool size desired for testing. I reduced the pool size for testing cache affinity to make it easier to debug and get the tests passing. Once we get that passing we can set it to a high / realistic number again.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, yeah that makes sense.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has already been structured so that the lower pool target is only used in the test context. A larger pool size target is used when running normally

@willscott
Copy link
Contributor

I'm going to merge this into adaptive so we have one PR tracking all of this and we can get to final merge

@willscott willscott merged commit 33db988 into feat/adaptive Oct 10, 2023
13 of 16 checks passed
@willscott willscott deleted the aa/test-simulator branch October 10, 2023 09:18
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants