Feat: pubsub monitoring #400

hsanjuan · 2018-05-01T16:13:32Z

This provides a new pubsubmon component that implements PeerMonitor, but before, it makes a number of changes to the PeerMonitor interface, basic monitor implementation:

Extraction of common functionality into a util module: MetricsWindow and MetricsChecker.
Adding the PublishMetric method to the interface and moving publish functionality to the monitor itself.
Adding some tests and small improvements to the codebase
Addding the new module.

Right now there is duplicated code between the monitors. I'm thinking what to do about that and whether I want to deprecate the basic monitor in a future release and delete it, or simply refactor things further.

coveralls · 2018-05-01T16:40:25Z

Coverage decreased (-0.5%) to 67.115% when pulling 927434e on feat/pubsub-monitoring into a0a0898 on master.

License: MIT Signed-off-by: Hector Sanjuan <code@hector.link>

The monitor component should be in charge of deciding how it is best to send metrics to other peers and what that means. This adds the PublishMetric() method to the component interface and moves that functionality from Cluster main component to the basic monitor. There is a behaviour change. Before, the metrics where sent only to the leader, while the leader was the only peer to broadcast them everywhere. Now, all peers broadcast all metrics everywhere. This is mostly because we should not rely on the consensus layer providing a Leader(), so we are taking the chance to remove this dependency. Note that in any-case, pubsub monitoring should replace the existing basic monitor. This is just paving the ground. Additionally, in order to not duplicate the multiRPC code in the monitor, I have moved that functionality to go-libp2p-gorpc and added an rpcutil library to cluster which includes useful methods to perform multiRPC requests (some of them existed in util.go, others are new and help handling multiple contexts etc). License: MIT Signed-off-by: Hector Sanjuan <code@hector.link>

License: MIT Signed-off-by: Hector Sanjuan <code@hector.link>

GetTTL returns duration. SetTTL should take duration too, not seconds. This removes the original SetTTL method which used seconds. License: MIT Signed-off-by: Hector Sanjuan <code@hector.link>

License: MIT Signed-off-by: Hector Sanjuan <code@hector.link>

…rics License: MIT Signed-off-by: Hector Sanjuan <code@hector.link>

License: MIT Signed-off-by: Hector Sanjuan <code@hector.link>

This makes pubsubmon the default. The basic monitor is still usable with a hidden --monitor basic flag. License: MIT Signed-off-by: Hector Sanjuan <code@hector.link>

lanzafame

Mostly LGTM, just a bit of concern about the metricsMux and util.Metrics in both the peer_monitors. I would like to see if it can get to the point where no lock has to be acquired in the Monitor struct before calling checker.CheckMetrics.

lanzafame · 2018-05-08T03:55:51Z

cluster.go

-// push metrics loops and pushes metrics to the leader's monitor
+// pushInformerMetrics loops and publishes informers metrics using the
+// cluster monitor. Metrics are pushed normally at a TTL/2 rate. If an error
+// occurs, they are pushed at a TTL/4 rate.
 func (c *Cluster) pushInformerMetrics() {


@hsanjuan thoughts on adding jitter to both the time between broadcasts and the time between retries on error? See https://aws.amazon.com/blogs/architecture/exponential-backoff-and-jitter/ for an explanation of jitter.

Not sure but this may fit the bill: https://github.com/jpillora/backoff

I'm for it, but we can make it in a different PR

lanzafame · 2018-05-08T03:57:38Z

rpcutil/rpcutil.go

+
+// Multicancel calls all the provided CancelFuncs. It
+// is useful with "defer Multicancel()"
+func Multicancel(cancels []context.CancelFunc) {


s/Multicancel/MultiCancel

lanzafame · 2018-05-08T04:13:36Z

cluster_test.go

@@ -106,7 +106,7 @@ func testingCluster(t *testing.T) (*Cluster, *mockAPI, *mockConnector, *mapstate
 	monCfg.CheckInterval = 2 * time.Second

 	raftcon, _ := raft.NewConsensus(host, consensusCfg, st, false)
-	mon, _ := basic.NewMonitor(monCfg)
+	mon, _ := pubsubmon.New(host, monCfg)


The setupMonitor function should also be used here so that we know that we know the tests pass with both implementations moving forward.

lanzafame · 2018-05-08T04:33:44Z

ipfscluster.go

@@ -152,18 +152,26 @@ type PinAllocator interface {
 	Allocate(c *cid.Cid, current, candidates, priority map[peer.ID]api.Metric) ([]peer.ID, error)
 }

-// PeerMonitor is a component in charge of monitoring the peers in the cluster
-// and providing candidates to the PinAllocator when a pin request arrives.
+// PeerMonitor is a component in charge of publishing peers metrics and


s/publishing peers metrics/publishing a peer's metrics

lanzafame · 2018-05-08T04:50:20Z

ipfscluster.go

+// use the metrics provided by the monitor as candidates for Pin allocations.
+//
+// The PeerMonitor component also provides an Alert channel which is signaled
+// when a metric is no longer received and the monitor judges identifies it


Either judges or identifies needs to be removed, individually, both work but not together 😉

lanzafame · 2018-05-08T07:24:47Z

monitor/pubsubmon/pubsubmon.go

+	close(mon.rpcReady)
+
+	// not necessary as this just removes subscription
+	// mon.subscription.Cancel()


Do we need this commented code still here?

lanzafame · 2018-05-08T07:28:47Z

monitor/pubsubmon/pubsubmon.go

+
+// getPeers gets the current list of peers from the consensus component
+func (mon *Monitor) getPeers() ([]peer.ID, error) {
+	// Ger current list of peers


lanzafame · 2018-05-08T07:29:49Z

monitor/pubsubmon/pubsubmon.go

+}
+
+// LastMetrics returns last known VALID metrics of a given type. A metric
+// is only valid if it has not expired and belongs to a current cluster peers.


s/peers/peer

lanzafame · 2018-05-08T07:31:45Z

monitor/util/metrics_window.go

+type MetricsWindow struct {
+	last int
+
+	safe       bool


In what case do we want unsafe access?

in fact, always, since metrics window is only written when the outer wrappers (MetricStore now) are locked, so no need to double locking.

If that is the case, is the safe field and all the if mw.safe { acquire lock } statements necessary?

lanzafame · 2018-05-08T07:38:12Z

monitor/pubsubmon/pubsubmon.go

+			continue
+		}
+		metrics = append(metrics, last)
+


nitpick: remove blank line please

hsanjuan · 2018-05-08T13:15:20Z

@lanzafame I think all comments are addressed or answered so this is ready for another round.

lanzafame

LGTM, just some comments.

lanzafame · 2018-05-09T00:52:49Z

monitor/util/metrics_window.go

+type MetricsWindow struct {
+	last int
+
+	safe       bool


If that is the case, is the safe field and all the if mw.safe { acquire lock } statements necessary?

lanzafame · 2018-05-09T02:14:50Z

monitor/pubsubmon/pubsubmon_test.go

+	}
+}
+
+func TestLogMetricConcurrent(t *testing.T) {


This test is really flaky when I run them locally (MacOS).

License: MIT Signed-off-by: Hector Sanjuan <code@hector.link>

The monitors now do broadcasting and we can get metrics from the local one. License: MIT Signed-off-by: Hector Sanjuan <code@hector.link>

License: MIT Signed-off-by: Hector Sanjuan <code@hector.link>

Do it in additional stage in Travis. Also, test fixes. License: MIT Signed-off-by: Hector Sanjuan <code@hector.link>

hsanjuan self-assigned this May 1, 2018

ghost added the status/in-progress In progress label May 1, 2018

hsanjuan force-pushed the feat/pubsub-monitoring branch from ea31255 to d3d06ae Compare May 2, 2018 07:07

hsanjuan added 6 commits May 7, 2018 14:24

Feat pubsubmon: Extract MetricsWindow to utils module

1886782

License: MIT Signed-off-by: Hector Sanjuan <code@hector.link>

Fix publish cancelling contexts too early.

72e1d64

License: MIT Signed-off-by: Hector Sanjuan <code@hector.link>

Monitor: extract MetricsChecker to util module

8f8e76a

License: MIT Signed-off-by: Hector Sanjuan <code@hector.link>

Types: rename metric.SetTTLDuration to metric.SetTTL

a9d6fe3

GetTTL returns duration. SetTTL should take duration too, not seconds. This removes the original SetTTL method which used seconds. License: MIT Signed-off-by: Hector Sanjuan <code@hector.link>

Basic Monitor: test Publish()

73b962f

License: MIT Signed-off-by: Hector Sanjuan <code@hector.link>

hsanjuan force-pushed the feat/pubsub-monitoring branch 2 times, most recently from e450b5c to 805356a Compare May 7, 2018 13:07

hsanjuan added 3 commits May 7, 2018 18:47

Add new pubsubmon: A monitor that uses pubsub to send and receive met…

6f84b3b

…rics License: MIT Signed-off-by: Hector Sanjuan <code@hector.link>

Enable pubsubmon in cluster e2e tests

bb8c20b

License: MIT Signed-off-by: Hector Sanjuan <code@hector.link>

Pubsubmon: enable by default when using ipfs-cluster-service

8c8487d

This makes pubsubmon the default. The basic monitor is still usable with a hidden --monitor basic flag. License: MIT Signed-off-by: Hector Sanjuan <code@hector.link>

hsanjuan changed the title ~~Feat: pubsub monitoring (wip but reviewable)~~ Feat: pubsub monitoring May 7, 2018

hsanjuan force-pushed the feat/pubsub-monitoring branch from 805356a to 8c8487d Compare May 7, 2018 17:02

lanzafame suggested changes May 8, 2018

View reviewed changes

hsanjuan force-pushed the feat/pubsub-monitoring branch from 01ec64b to 90b166e Compare May 8, 2018 09:48

lanzafame reviewed May 9, 2018

View reviewed changes

lanzafame approved these changes May 9, 2018

View reviewed changes

hsanjuan added 5 commits May 9, 2018 11:01

Monitor: more refactoring. Rename util to metrics

954ede9

License: MIT Signed-off-by: Hector Sanjuan <code@hector.link>

Cluster: do not request metrics from leader on allocate()

6159a7f

The monitors now do broadcasting and we can get metrics from the local one. License: MIT Signed-off-by: Hector Sanjuan <code@hector.link>

Monitor: address comments

e4844ca

License: MIT Signed-off-by: Hector Sanjuan <code@hector.link>

Monitor: remove safe parameter for metrics.Window

69c47fe

License: MIT Signed-off-by: Hector Sanjuan <code@hector.link>

Monitor/tests: Allow to run tests using the basic monitor.

5ca8ca3

Do it in additional stage in Travis. Also, test fixes. License: MIT Signed-off-by: Hector Sanjuan <code@hector.link>

hsanjuan force-pushed the feat/pubsub-monitoring branch from 0589d7c to 5ca8ca3 Compare May 9, 2018 09:39

hsanjuan merged commit 7b9aac9 into master May 9, 2018

ghost removed the status/in-progress In progress label May 9, 2018

hsanjuan deleted the feat/pubsub-monitoring branch May 9, 2018 11:21

hsanjuan mentioned this pull request May 14, 2018

Release 0.4.0 #372

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat: pubsub monitoring #400

Feat: pubsub monitoring #400

hsanjuan commented May 1, 2018 •

edited

Loading

coveralls commented May 1, 2018 •

edited

Loading

lanzafame left a comment

lanzafame May 8, 2018

lanzafame May 8, 2018

hsanjuan May 8, 2018

lanzafame May 8, 2018

lanzafame May 8, 2018

lanzafame May 8, 2018

lanzafame May 8, 2018

lanzafame May 8, 2018

lanzafame May 8, 2018

lanzafame May 8, 2018

lanzafame May 8, 2018

lanzafame May 8, 2018

hsanjuan May 8, 2018

lanzafame May 9, 2018

lanzafame May 8, 2018

hsanjuan commented May 8, 2018

lanzafame left a comment

lanzafame May 9, 2018

lanzafame May 9, 2018

Feat: pubsub monitoring #400

Feat: pubsub monitoring #400

Conversation

hsanjuan commented May 1, 2018 • edited Loading

coveralls commented May 1, 2018 • edited Loading

lanzafame left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hsanjuan commented May 8, 2018

lanzafame left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hsanjuan commented May 1, 2018 •

edited

Loading

coveralls commented May 1, 2018 •

edited

Loading