Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat: pubsub monitoring #400

Merged
merged 14 commits into from
May 9, 2018
Merged

Feat: pubsub monitoring #400

merged 14 commits into from
May 9, 2018

Conversation

hsanjuan
Copy link
Collaborator

@hsanjuan hsanjuan commented May 1, 2018

This provides a new pubsubmon component that implements PeerMonitor, but before, it makes a number of changes to the PeerMonitor interface, basic monitor implementation:

  • Extraction of common functionality into a util module: MetricsWindow and MetricsChecker.
  • Adding the PublishMetric method to the interface and moving publish functionality to the monitor itself.
  • Adding some tests and small improvements to the codebase
  • Addding the new module.

Right now there is duplicated code between the monitors. I'm thinking what to do about that and whether I want to deprecate the basic monitor in a future release and delete it, or simply refactor things further.

@hsanjuan hsanjuan self-assigned this May 1, 2018
@ghost ghost added the status/in-progress In progress label May 1, 2018
@coveralls
Copy link

coveralls commented May 1, 2018

Coverage Status

Coverage decreased (-0.5%) to 67.115% when pulling 927434e on feat/pubsub-monitoring into a0a0898 on master.

@hsanjuan hsanjuan force-pushed the feat/pubsub-monitoring branch from ea31255 to d3d06ae Compare May 2, 2018 07:07
hsanjuan added 6 commits May 7, 2018 14:24
License: MIT
Signed-off-by: Hector Sanjuan <code@hector.link>
The monitor component should be in charge of deciding how it is
best to send metrics to other peers and what that means.

This adds the PublishMetric() method to the component interface
and moves that functionality from Cluster main component to the
basic monitor.

There is a behaviour change. Before, the metrics where sent only to
the leader, while the leader was the only peer to broadcast them everywhere.
Now, all peers broadcast all metrics everywhere. This is mostly
because we should not rely on the consensus layer providing a Leader(), so
we are taking the chance to remove this dependency.

Note that in any-case, pubsub monitoring should replace the
existing basic monitor. This is just paving the ground.

Additionally, in order to not duplicate the multiRPC code
in the monitor, I have moved that functionality to go-libp2p-gorpc
and added an rpcutil library to cluster which includes useful
methods to perform multiRPC requests (some of them existed in
util.go, others are new and help handling multiple contexts etc).

License: MIT
Signed-off-by: Hector Sanjuan <code@hector.link>
License: MIT
Signed-off-by: Hector Sanjuan <code@hector.link>
License: MIT
Signed-off-by: Hector Sanjuan <code@hector.link>
GetTTL returns duration. SetTTL should take duration too, not seconds.
This removes the original SetTTL method which used seconds.

License: MIT
Signed-off-by: Hector Sanjuan <code@hector.link>
License: MIT
Signed-off-by: Hector Sanjuan <code@hector.link>
@hsanjuan hsanjuan force-pushed the feat/pubsub-monitoring branch 2 times, most recently from e450b5c to 805356a Compare May 7, 2018 13:07
hsanjuan added 3 commits May 7, 2018 18:47
…rics

License: MIT
Signed-off-by: Hector Sanjuan <code@hector.link>
License: MIT
Signed-off-by: Hector Sanjuan <code@hector.link>
This makes pubsubmon the default. The basic monitor is still usable
with a hidden --monitor basic flag.

License: MIT
Signed-off-by: Hector Sanjuan <code@hector.link>
@hsanjuan hsanjuan changed the title Feat: pubsub monitoring (wip but reviewable) Feat: pubsub monitoring May 7, 2018
@hsanjuan hsanjuan force-pushed the feat/pubsub-monitoring branch from 805356a to 8c8487d Compare May 7, 2018 17:02
Copy link
Contributor

@lanzafame lanzafame left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly LGTM, just a bit of concern about the metricsMux and util.Metrics in both the peer_monitors. I would like to see if it can get to the point where no lock has to be acquired in the Monitor struct before calling checker.CheckMetrics.

// push metrics loops and pushes metrics to the leader's monitor
// pushInformerMetrics loops and publishes informers metrics using the
// cluster monitor. Metrics are pushed normally at a TTL/2 rate. If an error
// occurs, they are pushed at a TTL/4 rate.
func (c *Cluster) pushInformerMetrics() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hsanjuan thoughts on adding jitter to both the time between broadcasts and the time between retries on error? See https://aws.amazon.com/blogs/architecture/exponential-backoff-and-jitter/ for an explanation of jitter.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure but this may fit the bill: https://github.com/jpillora/backoff

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm for it, but we can make it in a different PR

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure thing


// Multicancel calls all the provided CancelFuncs. It
// is useful with "defer Multicancel()"
func Multicancel(cancels []context.CancelFunc) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/Multicancel/MultiCancel

cluster_test.go Outdated
@@ -106,7 +106,7 @@ func testingCluster(t *testing.T) (*Cluster, *mockAPI, *mockConnector, *mapstate
monCfg.CheckInterval = 2 * time.Second

raftcon, _ := raft.NewConsensus(host, consensusCfg, st, false)
mon, _ := basic.NewMonitor(monCfg)
mon, _ := pubsubmon.New(host, monCfg)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The setupMonitor function should also be used here so that we know that we know the tests pass with both implementations moving forward.

ipfscluster.go Outdated
@@ -152,18 +152,26 @@ type PinAllocator interface {
Allocate(c *cid.Cid, current, candidates, priority map[peer.ID]api.Metric) ([]peer.ID, error)
}

// PeerMonitor is a component in charge of monitoring the peers in the cluster
// and providing candidates to the PinAllocator when a pin request arrives.
// PeerMonitor is a component in charge of publishing peers metrics and
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/publishing peers metrics/publishing a peer's metrics

ipfscluster.go Outdated
// use the metrics provided by the monitor as candidates for Pin allocations.
//
// The PeerMonitor component also provides an Alert channel which is signaled
// when a metric is no longer received and the monitor judges identifies it
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either judges or identifies needs to be removed, individually, both work but not together 😉

close(mon.rpcReady)

// not necessary as this just removes subscription
// mon.subscription.Cancel()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this commented code still here?


// getPeers gets the current list of peers from the consensus component
func (mon *Monitor) getPeers() ([]peer.ID, error) {
// Ger current list of peers
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/Ger/Get

}

// LastMetrics returns last known VALID metrics of a given type. A metric
// is only valid if it has not expired and belongs to a current cluster peers.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/peers/peer

type MetricsWindow struct {
last int

safe bool
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In what case do we want unsafe access?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in fact, always, since metrics window is only written when the outer wrappers (MetricStore now) are locked, so no need to double locking.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If that is the case, is the safe field and all the if mw.safe { acquire lock } statements necessary?

continue
}
metrics = append(metrics, last)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: remove blank line please

@hsanjuan hsanjuan force-pushed the feat/pubsub-monitoring branch from 01ec64b to 90b166e Compare May 8, 2018 09:48
@hsanjuan
Copy link
Collaborator Author

hsanjuan commented May 8, 2018

@lanzafame I think all comments are addressed or answered so this is ready for another round.

Copy link
Contributor

@lanzafame lanzafame left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just some comments.

type MetricsWindow struct {
last int

safe bool
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If that is the case, is the safe field and all the if mw.safe { acquire lock } statements necessary?

}
}

func TestLogMetricConcurrent(t *testing.T) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test is really flaky when I run them locally (MacOS).

hsanjuan added 5 commits May 9, 2018 11:01
License: MIT
Signed-off-by: Hector Sanjuan <code@hector.link>
The monitors now do broadcasting and we can get metrics from the
local one.

License: MIT
Signed-off-by: Hector Sanjuan <code@hector.link>
License: MIT
Signed-off-by: Hector Sanjuan <code@hector.link>
License: MIT
Signed-off-by: Hector Sanjuan <code@hector.link>
Do it in additional stage in Travis.

Also, test fixes.

License: MIT
Signed-off-by: Hector Sanjuan <code@hector.link>
@hsanjuan hsanjuan force-pushed the feat/pubsub-monitoring branch from 0589d7c to 5ca8ca3 Compare May 9, 2018 09:39
@hsanjuan hsanjuan merged commit 7b9aac9 into master May 9, 2018
@ghost ghost removed the status/in-progress In progress label May 9, 2018
@hsanjuan hsanjuan deleted the feat/pubsub-monitoring branch May 9, 2018 11:21
@hsanjuan hsanjuan mentioned this pull request May 14, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants