Simple Metrics Tracker #453

ericxtang · 2018-05-30T19:43:12Z

Is your feature request related to a problem? Please describe.
Currently we don't have a good way to track errors/metrics in the network.

Describe the solution you'd like
An opt-in service to track important errors/metrics in the network so we can identify issues early.

The proposed tool contains 2 components:

Metrics/events reporting from running Livepeer nodes (by using an off-by-default flag -reportMetrics)
Metrics/events gathering service at a hosted location (metrics.livepeer.org).

In the first iteration, 2 new events StreamCreated and StreamFinished are reported when a new stream is created/ended. This event is reported to metrics.livepeer.org, and the hosted service will subsequently try to consume the video and see whether it is consumable.

The StreamCreated event contains:

ManifestID
VideoURL
Timestamp

The StreamEnded event contains:

ManifestID
Timestamp

The hosted service will attach a timestamp to the event, and attempt to consume the video. If the video becomes un-consumable before it receives StreamEnded, we assume something is wrong, and write down the timestamp of the incident. Otherwise, we record the duration of the stream when we receive StreamEnded.

This can help us measure and diagnose video broadcasting issues in livepeer.tv. It can also serve as a extensible infrastructure for tracking other errors related to video transcoding.

Describe alternatives you've considered
Local metrics tracking system - this requires the node operators to track metrics themselves and report it manually.

The text was updated successfully, but these errors were encountered:

j0sh · 2018-05-30T20:04:15Z

With the new networking system, and especially with separate job creation, and delayed/restartable broadcasts, the notion of a "stream created" event becomes vague. I'm also a bit concerned about the resource overhead of having a metrics system actively consume streams.

Perhaps we could track the success rate of transcoded segments instead, and have the transcoder phone home its results to metrics.livepeer.org. This could be a list of segments (or one segment at a time), the status and any error descriptions, and whatever other data we can think of (eg, time spent waiting for a new segment, transfer rate, etc). This data would be both richer and more condensed than the binary success/fail result that we'd get by consuming the stream alone.

Jointly, since broadcasters will also get immediate feedback on segment transcoding, maybe it'd also be good to gather similar information from the broadcaster, in order to catch both sides of any issue.

ericxtang · 2018-05-30T20:58:11Z

@j0sh Yeah I think reporting transcoding metrics is definitely a good idea. And connecting that with broadcaster info would help create a more complete picture of the whole workflow.

The high level metric I want to see currently is the "success rate" of broadcasts, which I think correlates with stability. It'll be nice to have this measured by consuming the video because there are different steps in the broadcasting workflow that can affect the availability of the video. Assuming the scope for the Livepeer broadcasting system as RTMP->Ingest->Gateway, we should be able to use the video consuming method describe above to get some indication. The resource consumption is definitely a concern, but I think there are ways to optimize it later when we have more streams in the network (for example, always querying for the playlist and checking that it's incrementing, and only periodically downloading the actual video chunk).

j0sh · 2018-05-30T23:41:56Z

Hm, downloading the a stream doesn't give us any insight into a problem, other than the fact there is a problem somewhere based on the absence of expected data. And I suspect we'd still need to confirm that the data is indeed absent [1], which might involve some guesstimating outside the normal "does this download?" flow of video playback.

[1] For example, the manifest stops being updated, but all the segments within the manifest are available. This isn't necessarily indicative of an error.

At the end of the day, we'd still need to go hunting to track down root issues, so it'll help a lot to have more detailed reporting in various places along the flow. Not saying that we need to build that right now, but the architecture should accommodate it, and a video-client seems like a dead end beyond this one metric.

A more conventional method has a passive metrics server log the messages it gets (maybe after some light checking for validity), and we can run analytics on the log offline. This keeps things stateless for everybody, is more extensible, and we get richer data. We could still include things like manifest/segment URIs in certain messages and have a separate process actively tail the log. Other types of monitors could have their own ways of watching and handling events, which is a typical pattern with log-like data (kafka, logstash, etc).

Also note that actually pulling video from the broadcasters themselves introduces a few complexities, which would be solved by simply posting messages somewhere. (Not generally an issue for transcoders, which are expected to be better resourced.)

Broadcasters aren't expected to be publicly available, so we'd need to relay in most cases. This is another "moving part" to go wrong, especially with the current libp2p relay system
If a home user is exposing their public IP, they could still be upstream-limited, and we don't want to interfere by doubling their upstream by re-sending the video data to the metrics server.

While this would still be opt-in, the potential usefulness seems limited compared to a more conventional metrics collection system that could be run with a larger swath of users, and give us more detailed data in a less intrusive manner.

ericxtang · 2018-06-25T02:35:30Z

Deployed. You can check out the endpoint at http://metrics.livepeer.org/videos

ericxtang added Stability area: broadcasting labels May 30, 2018

ericxtang self-assigned this May 30, 2018

ericxtang closed this as completed Jun 25, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Simple Metrics Tracker #453

Simple Metrics Tracker #453

ericxtang commented May 30, 2018 •

edited

Loading

j0sh commented May 30, 2018

ericxtang commented May 30, 2018 •

edited

Loading

j0sh commented May 30, 2018

ericxtang commented Jun 25, 2018

Simple Metrics Tracker #453

Simple Metrics Tracker #453

Comments

ericxtang commented May 30, 2018 • edited Loading

j0sh commented May 30, 2018

ericxtang commented May 30, 2018 • edited Loading

j0sh commented May 30, 2018

ericxtang commented Jun 25, 2018

ericxtang commented May 30, 2018 •

edited

Loading

ericxtang commented May 30, 2018 •

edited

Loading