Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Observability] Homepage experience (Milestone 1) #68176

Closed
formgeist opened this issue Jun 3, 2020 · 16 comments · Fixed by #69141
Closed

[Observability] Homepage experience (Milestone 1) #68176

formgeist opened this issue Jun 3, 2020 · 16 comments · Fixed by #69141
Assignees
Labels
Feature:Observability Landing - Milestone 1 Team:APM All issues that need APM UI Team support Team:Observability Team label for Observability Team (for things that are handled across all of observability) v7.9.0

Comments

@formgeist
Copy link
Contributor

formgeist commented Jun 3, 2020

Summary

As a continuation of #66931 we're looking to add a new view that will serve as the overview page when users have existing data available for either Logs, Metrics, APM or Uptime.

Design proposal

▶️ Figma prototype

01 Observability - Overview - Full

Chart panels

The overview page will consist of a number of sections per area of Observability, each containing a number of chart visualizations that will be based on a high-level data query e.g. the number of log events by log source.

Logs

Logs

The proposed chart panel for logs will be a log rate histogram grouped by the log source. We will be looking for available indices matching the default setup for the Logs app. The list of look ups will be expanded as we investigate further which indices would be interesting to auto-detect and visualize based on other 3-party log vendors, where we know we will have ECS compatible data.

Data query

The log rate visualization already exists in the Log rate tab in the Logs app.

Screenshot 2020-06-04 at 09 51 34

We will use the configured log indices in the Log settings.

Screenshot 2020-06-04 at 09 53 18

TODO: Perhaps include an example ES query to get the same data

Metrics

Screenshot 2020-06-23 at 19 38 02

The metrics section will consist of a chart panel based on system metrics aggreated on host metrics only. Kubernetes and container metrics will be looked at in future iterations.

The different aggregates will show:

  • Number of hosts
  • CPU usage (used vs. available)
  • Memory usage (used vs. available)
  • Disk used (used vs. allocated)
  • Inbound traffic MB/s
  • Outbound traffic MB/s

The progress bar visualization will indicate used vs. capacity.

Data query

TODO: Show ES query example of get the aforementioned data

APM

APM

The APM data panel will show the number of services, transactions and error rate.

Data query

  • Aggregate total number of services
  • Aggregate total number of processor.event: transaction
  • Aggregate error rate across aggregate of processor.event: transaction

Uptime

Uptime

The Uptime panel will show the number of pings over time grouped by up / down status. The stats will show the total number of monitors and show the number of up and down.

Data query

The uptime monitors visualization already exists in Uptime.

Screenshot 2020-06-04 at 11 25 28

  • Aggregate number of pings grouped by up and down
  • Aggreate total number of monitors
  • Aggregate number of monitors reporting "up"
  • Aggregate number of monitors reporting "down"

TODO: Show example of ES query

Alerts and alerts activity

Alerts chart

The alerts section will consist of two panels; the Alerts distribution showing the number of alerts triggered grouped by type.

Alerts activity

The second panel will focus on showing recent activity with direct links to alert detail views. Each alert will have a link to the alert detail page, show a total number of alert instances within the time range selected and tags.

Resources

Resources

As another section of content, we will provide users with options to go straight to the documentation, discuss forum or training resources.

News feed

News feed

The news feed will consist of Observability related blog posts and other industry-related stories.

Kibana news feeds can be set up by providing a .yml feed in the Newsfeed repository and use the Kibana news feed services to show the content.

@formgeist formgeist added Team:APM All issues that need APM UI Team support Team:Observability Team label for Observability Team (for things that are handled across all of observability) v7.9.0 Feature:Observability Landing - Milestone 1 labels Jun 3, 2020
@elasticmachine
Copy link
Contributor

Pinging @elastic/apm-ui (Team:apm)

@formgeist formgeist changed the title [Observability] Landingpage - Milestone 2 - Overview with data [Observability] Overview page with data Jun 3, 2020
@formgeist formgeist changed the title [Observability] Overview page with data [Observability] Homepage experience (Milestone 1) Jun 3, 2020
@formgeist
Copy link
Contributor Author

formgeist commented Jun 9, 2020

Design update - 9 June 2020

We received some feedback on some specific areas of the design, so I've updated the examples above. Here's a quick changelog;

view-in-app

  • Updated the Metrics data panels with a KPI style for the traffic metrics as well and added the number of hosts as per feedback from @sorantis and @cyrille-leclerc
  • Replaced the section "add data" links with a "view in app" option that will link to each individual app for further investigation.

I've also put together a quick responsive layout example of how we want to let the data panels grow while retaining a fixed width for the alert column. Allowing the primary data panels (logs, metrics etc.) to grow means inspecting the visualizations become easier on larger screens, whereas the alert visualization and activity feed don't necessarily benefit all that much from growing larger.

00 Overview - Responsive layout guideline

@sorantis
Copy link

sorantis commented Jun 9, 2020

It's worth noting that the initial scope for Metrics is Hosts only. Kubernetes and containers will not considered in future iterations.

@formgeist
Copy link
Contributor Author

It's worth noting that the initial scope for Metrics is Hosts only. Kubernetes and containers will not considered in future iterations.

Thanks @sorantis I've made a note of it in the Metrics section in the description along with more specifics around each metric. I additionally updated the traffic metrics to not show a progress bar visualization because it's simply the aggregated traffic metrics we'll show (not a typical used vs. allocated) which was indicated. Mostly due to copying over the same stat component as the others, I forgot to remove it.

@sorantis
Copy link

sorantis commented Jun 9, 2020

@formgeist what do you think about adding a tiny graph underneath the traffic metrics can show instead of a progress bar?

@formgeist
Copy link
Contributor Author

@formgeist what do you think about adding a tiny graph underneath the traffic metrics can show instead of a progress bar?

We should be able to graph a time-series chart underneath, so here's an example of using a sparkline for the traffic metrics.

Metrics

Thoughts?

@felixbarny
Copy link
Member

  • Aggregate total number of processor.event: transaction
  • Aggregate error rate across aggregate of processor.event: transaction

As we're also showing the rate of errors, maybe a more useful metric than the total number of transactions would be the rate of transactions per minute. This metric is less dependant on a secondary context which is the selected time frame and thus easier to understand as it stands on its own vs having to check what the time range is. This gets especially complicated if a time range in the past is used where you'd have do some arithmetic to know many hours are in that time range.

The same considerations apply to the log rate widget.

@sorenlouv
Copy link
Member

maybe a more useful metric than the total number of transactions would be the rate of transactions per minute

I agree, this would be easier to understand. This is also aligned with what we already show in APM.

@formgeist
Copy link
Contributor Author

@felixbarny @sqren I think both suggestions are very reasonable - let's make sure to change the data contracts with the Logs team to be able to provide log rate per second/minute instead of the aggregate count. @cauemarcondes Will you open a new issue for this with the Logs UI team?

@afgomez
Copy link
Contributor

afgomez commented Jul 1, 2020

maybe a more useful metric than the total number of transactions would be the rate of transactions per minute [...] The same considerations apply to the log rate widget.

The date histogram shows already a "Log rate per bucket size". From a user perspective, isn't that enough to get an idea of the average rate?

By using the log rate per minute instead of the count we will show two very similar metrics in two places. If we use the count, we show both total volume and rate (which, more is better, right? right?).

Is there a use case that I'm missing? Is the log rate per minute (vs per bucket size) such an interesting metric that deserves to exist on its own?

@formgeist
Copy link
Contributor Author

The date histogram shows already a "Log rate per bucket size". From a user perspective, isn't that enough to get an idea of the average rate?

As I understand it, the visualizations we've been referencing in the design is the Log entries visualization.

Screenshot 2020-07-01 at 15 40 46

The challenge I see is that the bucket size is not dynamic in the current logs visualization, it's fixed to 15 minute buckets. Not sure about the reasoning behind that decision? And if we add the Transaction rate for APM, which will be dynamic down to per minute, it'll be hard for the user to correlate the two charts if they want to. Maybe because I'm not all that familiar with the topic re: logs and rate.

@afgomez
Copy link
Contributor

afgomez commented Jul 1, 2020

it's fixed to 15 minute buckets. Not sure about the reasoning behind that decision?

I think it's related to how the ML job process the log entries, but don't quote me on that. @weltenwort can probably give you the right answer.

if we add the Transaction rate for APM, which will be dynamic down to per minute, it'll be hard for the user to correlate the two charts if they want to.

I'm querying the data for the dashboard will use whatever startTime, endTime and bucketSize are passed as a parameter. I assume other plugins will use the same parameters, so the graphs should all be equivalent for the provided time range.

Edit: Ongoing work for the query #70413

@jasonrhodes
Copy link
Member

Yeah what @afgomez said -- you can't really use the existing chart as a reference because it's tied completely to ML, and we are building something that for the overview page that doesn't use ML at all for this rate.

@sorenlouv
Copy link
Member

sorenlouv commented Jul 1, 2020

I think it's related to how the ML job process the log entries, but don't quote me on that. @weltenwort can probably give you the right answer.

Off-topic: We also ran into this for APM. We went a little overboard and interpolate the ML values when the buckets are smaller than 15minutes so it fits with our APM data - I don't think this is necessary but it's nice now we have it.

it's fixed to 15 minute bucket

I also thought that was the case but turns out the bucket size is dynamic (in this case the bucket size is 5265 minutes):

86279206-f28f5b00-bbd9-11ea-9270-3482d9c199f0

So perhaps the text that says "Bucket span: 15 minutes" should be updated to avoid confusion?

@felixbarny
Copy link
Member

The date histogram shows already a "Log rate per bucket size". From a user perspective, isn't that enough to get an idea of the average rate?

Especially if there's a lot of variability in the chart, it's not always that easy to know what the average is. If, for example, you'd want to compare the average log rate before vs after a release it will be really helpful to have that on the chart vs the user having to calculate that based on all data points in the chart.

Is there a use case that I'm missing? Is the log rate per minute (vs per bucket size) such an interesting metric that deserves to exist on its own?

I think it's even a benefit in terms of consistency if both the single metric and the date histogram chart refer to the exact same metric. I've seen this as a common practice in other dashboarding tools where you'd have certain metrics, like avg, min, max, in the legend for a graph next to the color and the label for the line. That's basically condensing all the values in the chart to a single value.

The challenge I see is that the bucket size is not dynamic in the current logs visualization, it's fixed to 15 minute buckets.

I think that ideally, the metric should be the same for the overall metric count and the metric shown in the date histogram chart. Maybe it's just me but I prefer to have normalized values that don't change as you change the date range. For example, instead of showing the number of total logs per bucket, we may normalize it to log rate per minute, no matter if the bucket size is 1m, 15m, or 5265m.

@formgeist
Copy link
Contributor Author

We had a Zoom call to discuss the above feedback and next steps. We decided to continue with showing the log rate at a fixed rate (per minute). @afgomez will handle the changes in #70413

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature:Observability Landing - Milestone 1 Team:APM All issues that need APM UI Team support Team:Observability Team label for Observability Team (for things that are handled across all of observability) v7.9.0
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants