-
Notifications
You must be signed in to change notification settings - Fork 171
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Simple Metrics Tracker #453
Comments
With the new networking system, and especially with separate job creation, and delayed/restartable broadcasts, the notion of a "stream created" event becomes vague. I'm also a bit concerned about the resource overhead of having a metrics system actively consume streams. Perhaps we could track the success rate of transcoded segments instead, and have the transcoder phone home its results to metrics.livepeer.org. This could be a list of segments (or one segment at a time), the status and any error descriptions, and whatever other data we can think of (eg, time spent waiting for a new segment, transfer rate, etc). This data would be both richer and more condensed than the binary success/fail result that we'd get by consuming the stream alone. Jointly, since broadcasters will also get immediate feedback on segment transcoding, maybe it'd also be good to gather similar information from the broadcaster, in order to catch both sides of any issue. |
@j0sh Yeah I think reporting transcoding metrics is definitely a good idea. And connecting that with broadcaster info would help create a more complete picture of the whole workflow. The high level metric I want to see currently is the "success rate" of broadcasts, which I think correlates with stability. It'll be nice to have this measured by consuming the video because there are different steps in the broadcasting workflow that can affect the availability of the video. Assuming the scope for the Livepeer broadcasting system as |
Hm, downloading the a stream doesn't give us any insight into a problem, other than the fact there is a problem somewhere based on the absence of expected data. And I suspect we'd still need to confirm that the data is indeed absent [1], which might involve some guesstimating outside the normal "does this download?" flow of video playback. [1] For example, the manifest stops being updated, but all the segments within the manifest are available. This isn't necessarily indicative of an error. At the end of the day, we'd still need to go hunting to track down root issues, so it'll help a lot to have more detailed reporting in various places along the flow. Not saying that we need to build that right now, but the architecture should accommodate it, and a video-client seems like a dead end beyond this one metric. A more conventional method has a passive metrics server log the messages it gets (maybe after some light checking for validity), and we can run analytics on the log offline. This keeps things stateless for everybody, is more extensible, and we get richer data. We could still include things like manifest/segment URIs in certain messages and have a separate process actively tail the log. Other types of monitors could have their own ways of watching and handling events, which is a typical pattern with log-like data (kafka, logstash, etc). Also note that actually pulling video from the broadcasters themselves introduces a few complexities, which would be solved by simply posting messages somewhere. (Not generally an issue for transcoders, which are expected to be better resourced.)
While this would still be opt-in, the potential usefulness seems limited compared to a more conventional metrics collection system that could be run with a larger swath of users, and give us more detailed data in a less intrusive manner. |
Deployed. You can check out the endpoint at http://metrics.livepeer.org/videos |
Is your feature request related to a problem? Please describe.
Currently we don't have a good way to track errors/metrics in the network.
Describe the solution you'd like
An opt-in service to track important errors/metrics in the network so we can identify issues early.
The proposed tool contains 2 components:
-reportMetrics
)In the first iteration, 2 new events
StreamCreated
andStreamFinished
are reported when a new stream is created/ended. This event is reported to metrics.livepeer.org, and the hosted service will subsequently try to consume the video and see whether it is consumable.The
StreamCreated
event contains:The
StreamEnded
event contains:The hosted service will attach a timestamp to the event, and attempt to consume the video. If the video becomes un-consumable before it receives
StreamEnded
, we assume something is wrong, and write down the timestamp of the incident. Otherwise, we record the duration of the stream when we receiveStreamEnded
.This can help us measure and diagnose video broadcasting issues in livepeer.tv. It can also serve as a extensible infrastructure for tracking other errors related to video transcoding.
Describe alternatives you've considered
Local metrics tracking system - this requires the node operators to track metrics themselves and report it manually.
The text was updated successfully, but these errors were encountered: