-
Notifications
You must be signed in to change notification settings - Fork 295
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feat: pubsub monitoring #400
Conversation
ea31255
to
d3d06ae
Compare
License: MIT Signed-off-by: Hector Sanjuan <code@hector.link>
The monitor component should be in charge of deciding how it is best to send metrics to other peers and what that means. This adds the PublishMetric() method to the component interface and moves that functionality from Cluster main component to the basic monitor. There is a behaviour change. Before, the metrics where sent only to the leader, while the leader was the only peer to broadcast them everywhere. Now, all peers broadcast all metrics everywhere. This is mostly because we should not rely on the consensus layer providing a Leader(), so we are taking the chance to remove this dependency. Note that in any-case, pubsub monitoring should replace the existing basic monitor. This is just paving the ground. Additionally, in order to not duplicate the multiRPC code in the monitor, I have moved that functionality to go-libp2p-gorpc and added an rpcutil library to cluster which includes useful methods to perform multiRPC requests (some of them existed in util.go, others are new and help handling multiple contexts etc). License: MIT Signed-off-by: Hector Sanjuan <code@hector.link>
License: MIT Signed-off-by: Hector Sanjuan <code@hector.link>
License: MIT Signed-off-by: Hector Sanjuan <code@hector.link>
GetTTL returns duration. SetTTL should take duration too, not seconds. This removes the original SetTTL method which used seconds. License: MIT Signed-off-by: Hector Sanjuan <code@hector.link>
License: MIT Signed-off-by: Hector Sanjuan <code@hector.link>
e450b5c
to
805356a
Compare
…rics License: MIT Signed-off-by: Hector Sanjuan <code@hector.link>
License: MIT Signed-off-by: Hector Sanjuan <code@hector.link>
This makes pubsubmon the default. The basic monitor is still usable with a hidden --monitor basic flag. License: MIT Signed-off-by: Hector Sanjuan <code@hector.link>
805356a
to
8c8487d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly LGTM, just a bit of concern about the metricsMux
and util.Metrics
in both the peer_monitors
. I would like to see if it can get to the point where no lock has to be acquired in the Monitor
struct before calling checker.CheckMetrics
.
// push metrics loops and pushes metrics to the leader's monitor | ||
// pushInformerMetrics loops and publishes informers metrics using the | ||
// cluster monitor. Metrics are pushed normally at a TTL/2 rate. If an error | ||
// occurs, they are pushed at a TTL/4 rate. | ||
func (c *Cluster) pushInformerMetrics() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hsanjuan thoughts on adding jitter to both the time between broadcasts and the time between retries on error? See https://aws.amazon.com/blogs/architecture/exponential-backoff-and-jitter/ for an explanation of jitter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure but this may fit the bill: https://github.com/jpillora/backoff
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm for it, but we can make it in a different PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure thing
rpcutil/rpcutil.go
Outdated
|
||
// Multicancel calls all the provided CancelFuncs. It | ||
// is useful with "defer Multicancel()" | ||
func Multicancel(cancels []context.CancelFunc) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/Multicancel/MultiCancel
cluster_test.go
Outdated
@@ -106,7 +106,7 @@ func testingCluster(t *testing.T) (*Cluster, *mockAPI, *mockConnector, *mapstate | |||
monCfg.CheckInterval = 2 * time.Second | |||
|
|||
raftcon, _ := raft.NewConsensus(host, consensusCfg, st, false) | |||
mon, _ := basic.NewMonitor(monCfg) | |||
mon, _ := pubsubmon.New(host, monCfg) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The setupMonitor
function should also be used here so that we know that we know the tests pass with both implementations moving forward.
ipfscluster.go
Outdated
@@ -152,18 +152,26 @@ type PinAllocator interface { | |||
Allocate(c *cid.Cid, current, candidates, priority map[peer.ID]api.Metric) ([]peer.ID, error) | |||
} | |||
|
|||
// PeerMonitor is a component in charge of monitoring the peers in the cluster | |||
// and providing candidates to the PinAllocator when a pin request arrives. | |||
// PeerMonitor is a component in charge of publishing peers metrics and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/publishing peers metrics/publishing a peer's metrics
ipfscluster.go
Outdated
// use the metrics provided by the monitor as candidates for Pin allocations. | ||
// | ||
// The PeerMonitor component also provides an Alert channel which is signaled | ||
// when a metric is no longer received and the monitor judges identifies it |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Either judges
or identifies
needs to be removed, individually, both work but not together 😉
monitor/pubsubmon/pubsubmon.go
Outdated
close(mon.rpcReady) | ||
|
||
// not necessary as this just removes subscription | ||
// mon.subscription.Cancel() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need this commented code still here?
monitor/pubsubmon/pubsubmon.go
Outdated
|
||
// getPeers gets the current list of peers from the consensus component | ||
func (mon *Monitor) getPeers() ([]peer.ID, error) { | ||
// Ger current list of peers |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/Ger/Get
monitor/pubsubmon/pubsubmon.go
Outdated
} | ||
|
||
// LastMetrics returns last known VALID metrics of a given type. A metric | ||
// is only valid if it has not expired and belongs to a current cluster peers. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/peers/peer
monitor/util/metrics_window.go
Outdated
type MetricsWindow struct { | ||
last int | ||
|
||
safe bool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In what case do we want unsafe access?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in fact, always, since metrics window is only written when the outer wrappers (MetricStore now) are locked, so no need to double locking.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If that is the case, is the safe
field and all the if mw.safe { acquire lock }
statements necessary?
monitor/pubsubmon/pubsubmon.go
Outdated
continue | ||
} | ||
metrics = append(metrics, last) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick: remove blank line please
01ec64b
to
90b166e
Compare
@lanzafame I think all comments are addressed or answered so this is ready for another round. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, just some comments.
monitor/util/metrics_window.go
Outdated
type MetricsWindow struct { | ||
last int | ||
|
||
safe bool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If that is the case, is the safe
field and all the if mw.safe { acquire lock }
statements necessary?
} | ||
} | ||
|
||
func TestLogMetricConcurrent(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test is really flaky when I run them locally (MacOS).
License: MIT Signed-off-by: Hector Sanjuan <code@hector.link>
The monitors now do broadcasting and we can get metrics from the local one. License: MIT Signed-off-by: Hector Sanjuan <code@hector.link>
License: MIT Signed-off-by: Hector Sanjuan <code@hector.link>
License: MIT Signed-off-by: Hector Sanjuan <code@hector.link>
Do it in additional stage in Travis. Also, test fixes. License: MIT Signed-off-by: Hector Sanjuan <code@hector.link>
0589d7c
to
5ca8ca3
Compare
This provides a new
pubsubmon
component that implementsPeerMonitor
, but before, it makes a number of changes to thePeerMonitor
interface,basic
monitor implementation:util
module: MetricsWindow and MetricsChecker.PublishMetric
method to the interface and moving publish functionality to the monitor itself.Right now there is duplicated code between the monitors. I'm thinking what to do about that and whether I want to deprecate the
basic
monitor in a future release and delete it, or simply refactor things further.