Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simple Metrics Tracker #453

Closed
ericxtang opened this issue May 30, 2018 · 4 comments
Closed

Simple Metrics Tracker #453

ericxtang opened this issue May 30, 2018 · 4 comments
Assignees

Comments

@ericxtang
Copy link
Member

ericxtang commented May 30, 2018

Is your feature request related to a problem? Please describe.
Currently we don't have a good way to track errors/metrics in the network.

Describe the solution you'd like
An opt-in service to track important errors/metrics in the network so we can identify issues early.

The proposed tool contains 2 components:

  1. Metrics/events reporting from running Livepeer nodes (by using an off-by-default flag -reportMetrics)
  2. Metrics/events gathering service at a hosted location (metrics.livepeer.org).

In the first iteration, 2 new events StreamCreated and StreamFinished are reported when a new stream is created/ended. This event is reported to metrics.livepeer.org, and the hosted service will subsequently try to consume the video and see whether it is consumable.

The StreamCreated event contains:

  • ManifestID
  • VideoURL
  • Timestamp

The StreamEnded event contains:

  • ManifestID
  • Timestamp

The hosted service will attach a timestamp to the event, and attempt to consume the video. If the video becomes un-consumable before it receives StreamEnded, we assume something is wrong, and write down the timestamp of the incident. Otherwise, we record the duration of the stream when we receive StreamEnded.

This can help us measure and diagnose video broadcasting issues in livepeer.tv. It can also serve as a extensible infrastructure for tracking other errors related to video transcoding.

Describe alternatives you've considered
Local metrics tracking system - this requires the node operators to track metrics themselves and report it manually.

@j0sh
Copy link
Collaborator

j0sh commented May 30, 2018

With the new networking system, and especially with separate job creation, and delayed/restartable broadcasts, the notion of a "stream created" event becomes vague. I'm also a bit concerned about the resource overhead of having a metrics system actively consume streams.

Perhaps we could track the success rate of transcoded segments instead, and have the transcoder phone home its results to metrics.livepeer.org. This could be a list of segments (or one segment at a time), the status and any error descriptions, and whatever other data we can think of (eg, time spent waiting for a new segment, transfer rate, etc). This data would be both richer and more condensed than the binary success/fail result that we'd get by consuming the stream alone.

Jointly, since broadcasters will also get immediate feedback on segment transcoding, maybe it'd also be good to gather similar information from the broadcaster, in order to catch both sides of any issue.

@ericxtang
Copy link
Member Author

ericxtang commented May 30, 2018

@j0sh Yeah I think reporting transcoding metrics is definitely a good idea. And connecting that with broadcaster info would help create a more complete picture of the whole workflow.

The high level metric I want to see currently is the "success rate" of broadcasts, which I think correlates with stability. It'll be nice to have this measured by consuming the video because there are different steps in the broadcasting workflow that can affect the availability of the video. Assuming the scope for the Livepeer broadcasting system as RTMP->Ingest->Gateway, we should be able to use the video consuming method describe above to get some indication. The resource consumption is definitely a concern, but I think there are ways to optimize it later when we have more streams in the network (for example, always querying for the playlist and checking that it's incrementing, and only periodically downloading the actual video chunk).

@j0sh
Copy link
Collaborator

j0sh commented May 30, 2018

Hm, downloading the a stream doesn't give us any insight into a problem, other than the fact there is a problem somewhere based on the absence of expected data. And I suspect we'd still need to confirm that the data is indeed absent [1], which might involve some guesstimating outside the normal "does this download?" flow of video playback.

[1] For example, the manifest stops being updated, but all the segments within the manifest are available. This isn't necessarily indicative of an error.

At the end of the day, we'd still need to go hunting to track down root issues, so it'll help a lot to have more detailed reporting in various places along the flow. Not saying that we need to build that right now, but the architecture should accommodate it, and a video-client seems like a dead end beyond this one metric.

A more conventional method has a passive metrics server log the messages it gets (maybe after some light checking for validity), and we can run analytics on the log offline. This keeps things stateless for everybody, is more extensible, and we get richer data. We could still include things like manifest/segment URIs in certain messages and have a separate process actively tail the log. Other types of monitors could have their own ways of watching and handling events, which is a typical pattern with log-like data (kafka, logstash, etc).

Also note that actually pulling video from the broadcasters themselves introduces a few complexities, which would be solved by simply posting messages somewhere. (Not generally an issue for transcoders, which are expected to be better resourced.)

  • Broadcasters aren't expected to be publicly available, so we'd need to relay in most cases. This is another "moving part" to go wrong, especially with the current libp2p relay system

  • If a home user is exposing their public IP, they could still be upstream-limited, and we don't want to interfere by doubling their upstream by re-sending the video data to the metrics server.

While this would still be opt-in, the potential usefulness seems limited compared to a more conventional metrics collection system that could be run with a larger swath of users, and give us more detailed data in a less intrusive manner.

@ericxtang
Copy link
Member Author

Deployed. You can check out the endpoint at http://metrics.livepeer.org/videos

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants