Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

service metrics: add numeric outcome #5243

Closed
dgieselaar opened this issue May 9, 2021 · 27 comments · Fixed by #9819
Closed

service metrics: add numeric outcome #5243

dgieselaar opened this issue May 9, 2021 · 27 comments · Fixed by #9819
Assignees
Milestone

Comments

@dgieselaar
Copy link
Member

dgieselaar commented May 9, 2021

Add a numeric outcome to

  • transaction events (mapped as byte)
  • transaction metrics (mapped as byte as it has an event.outcome dimension)
  • service metrics (mapped as aggregate_metric_double, as it has no event.outcome dimension).

All fields should be mapped as doc-value only fields.

The concrete field name needs to be defined. The working title is event.outcome_numeric.

This will help with reducing dimensions while keeping the structure of transaction metrics and service metrics similar.

Mapping of event.outcome to event.outcome_numeric

event.outcome event.outcome_numeric
unknown null
failure 0
success 1

For the aggregate_metric_double, the following sub-fields should be populated:

  • sum: the number of success events
  • value_count: the number of success + error events

We can then simply do an average aggregation over the event.outcome_numeric to get the success rate. We need to validate that the avg aggregation works correctly on both the byte and aggregate_metric_double mapped fields.


Original request:
Right now we use a terms aggregation or multiple filter aggregations on event.outcome to calculate transaction error rate. If we store failure as 0 and success in 1 in a single field (and not populate it for unknown), we can get the percentage by doing an avg aggregation on that single field. This would simplify our queries and the post processing that we are doing, will probably be a small performance improvement, and would make it easier for users to build custom dashboards.

@axw
Copy link
Member

axw commented May 10, 2021

event.outcome is an ECS field with a prescribed set of allowed values, so I don't think we should be recording its value as a number. We could provide a separate field.

@dgieselaar
Copy link
Member Author

@axw yes, sorry, that is the idea. I don't really have a good suggestion for the name of the field though 😀

@simitt
Copy link
Contributor

simitt commented Nov 23, 2022

@dgieselaar I updated the original description to what we have planned now for 8.7.

@simitt simitt added this to the 8.7 milestone Nov 23, 2022
@simitt simitt changed the title Store event.outcome as 0/1 service metrics: add numeric outcome Nov 23, 2022
@simitt simitt assigned kruskall and unassigned kruskall Nov 23, 2022
@axw
Copy link
Member

axw commented Nov 23, 2022

@felixbarny @dgieselaar would either of you be able to help clarify the definition a bit please? What values do we populate, and when?

It might help if we knew how the fields would be aggregated, for each of transaction events, transaction metrics, and service metrics.

@felixbarny
Copy link
Member

done, let me know if something is missing or unclear

@carsonip
Copy link
Member

carsonip commented Dec 7, 2022

For service metrics, are we using the same field event.outcome_numeric as transaction events and transaction metrics? Does it mean that the field will have multiple types or will we have a separate field for this aggregate_metric_double type field?

@felixbarny
Copy link
Member

For service metrics, are we using the same field event.outcome_numeric as transaction events and transaction metrics?

correct

Does it mean that the field will have multiple types

That's also correct. This lets us re-use the same query across all three. That works because the avg aggregation can work both on on numeric fields as well as aggregate metric fields.

But please validate that assumption by trying out the avg aggregation on event.outcome_numeric and contrast the results with the current way of calculating the error/success rate.

@carsonip
Copy link
Member

carsonip commented Dec 8, 2022

@axw mentioned that since transaction metrics and service metrics are sharing the same internal_metrics data stream, the field event.outcome_numeric cannot be of type byte and aggregate_metric_double at the same time. Therefore, this work will involve splitting the service metrics into a separate data stream to make it possible.

@carsonip
Copy link
Member

carsonip commented Dec 8, 2022

@felixbarny A question about the Count in aggregate_metric_double:

For the aggregate_metric_double, the following sub-fields should be populated:

sum: the number of success events
value_count: the number of success + error events

Should value_count be number of success + failure, or success + failure + unknown? I assume it is the former (which is what you wrote) but just wanted to confirm.

@felixbarny
Copy link
Member

Correct, it's the former. See https://github.com/elastic/apm/blob/main/specs/agents/tracing-transactions.md#error-rate for more details.

@dgieselaar
Copy link
Member Author

@carsonip @axw is there a reason why we are not also making event.outcome_numeric an aggregate_metric_double for transaction metrics? are there significant storage concerns around doing so? can we quantify that?

@axw
Copy link
Member

axw commented Dec 11, 2022

@dgieselaar IIANM the main reason for storing as a byte is that transaction metrics have an event.outcome field. So you can group by event.outcome.

I would strongly prefer that service metrics remain a subset of transaction metrics so that we could later derive them in Elasticsearch by just dropping dimensions. So I'm +1 on storing an aggregate_metric_double for transaction metrics too.

WDYT @felixbarny?

@felixbarny
Copy link
Member

I would strongly prefer that service metrics remain a subset of transaction metrics so that we could later derive them in Elasticsearch by just dropping dimensions.

​I definitely agree with this statement.​ I just wasn't sure if it's feasible to use aggregate_metric_double for transaction metrics because it also has a event.outcome dimension.

I played around with the mapping a bit and it seems like it is feasible and actually better than mapping it to byte. We can map it like this:

event.outcome event.outcome_numeric.sum event.outcome_numeric.value_count
unknown null null
failure 0 doc_count
success doc_count doc_count
Click to expand example
DELETE felixbarny-test-outcome
PUT felixbarny-test-outcome
{
  "mappings": {
    "properties": {
      "event.outcome": {
        "type": "keyword"
      },
      "event.outcome_numeric": {
        "type": "aggregate_metric_double",
        "metrics": ["value_count", "sum"],
        "default_metric": "sum"
      }
    }
  }
}

POST felixbarny-test-outcome/_doc?refresh
{
  "_doc_count": 10,
  "event.outcome": "success",
  "event.outcome_int": 1,
  "event.outcome_numeric": {
    "value_count": 10,
    "sum": 10
  }
}

POST felixbarny-test-outcome/_doc?refresh
{
  "_doc_count": 5,
  "event.outcome": "failure",
  "event.outcome_int": 0,
  "event.outcome_numeric": {
    "value_count": 5,
    "sum": 0
  }
}


POST felixbarny-test-outcome/_doc?refresh
{
  "_doc_count": 5,
  "event.outcome": "unknown"
}


POST felixbarny-test-outcome/_search?size=0
{
  "aggs": {
    "success_rate_aggregate_metric_double": {
      "avg": {
        "field": "event.outcome_numeric"
      }
    }
  }
}

As the metrics will be so similar in structure, we need to take care that the UI isn't accidentally querying both transaction metrics and service metrics at the same time. Did you already think about what's the best way to do that? Just a query for the metricset.name? Or should service metrics better be in a different data stream? Either way, we need to make sure the UI selects the correct set of metrics.

@axw
Copy link
Member

axw commented Dec 12, 2022

Thanks @felixbarny, I think that looks good.

The other thing that's been making me feel uncomfortable is the field name. It feels a bit too abstract, and implies that it is equivalent to event.outcome -- but that only holds for the event documents. So maybe that could be renamed too.

Seems similar to transaction.duration.summary, so maybe transaction.successes? The sum is the number of successful transactions, and the value_count is the total number of transactions with a known outcome.

event.outcome transaction.successes.sum transaction.successes.value_count
unknown null null
failure 0 doc_count
success doc_count doc_count

WDYT? Not my favourite name, but I think a little better...

As the metrics will be so similar in structure, we need to take care that the UI isn't accidentally querying both transaction metrics and service metrics at the same time. Did you already think about what's the best way to do that? Just a query for the metricset.name? Or should service metrics better be in a different data stream? Either way, we need to make sure the UI selects the correct set of metrics.

I would prefer to split them out, but I would query on metricset.name either way - so it's not a requirement.

@dgieselaar
Copy link
Member Author

@axw @felixbarny we will likely use metricset.name but it'd be great if that can be a constant keyword.

@axw
Copy link
Member

axw commented Dec 12, 2022

@felixbarny and I just had a chat about the metrics. We're thinking that we should skip adding them to transaction events and transaction metrics for now, and only add them to service metrics. That's because we would still need a fallback for old documents, and we don't yet have a solution for schema evolution. @dgieselaar does that sound reasonable?

Regarding the name:

  • the name should be the same across transaction events, transaction metrics, and service metrics
  • I identified part of the source of my discomfort with event.outcome_numeric: that name implies it is a simple mapping from event.outcome (success=1, failed=0), which only holds true for transaction events and not for the metrics. OTOH, if the field describes "successful transactions", then the meaning of the field doesn't change depending on whether it's an event or a metric.

So for a concrete suggestion, I'll amend the above from transaction.successes to transaction.success_count. This will represent the number of transactions that are successful. For events it'll always be 0 (failed), 1 (success), or omitted (unknown). For metrics, sum will be success and value_count will be all the transactions that were considered (success+failed).

@carsonip
Copy link
Member

@axw There's already a success_count (as well as a failure_count) under transaction which is an int: https://github.com/elastic/apm-server/blob/main/apmpackage/apm/data_stream/internal_metrics/fields/fields.yml#L132

Do we need a new name or a migration for the existing field? Am I missing anything here?

@axw
Copy link
Member

axw commented Dec 12, 2022

@carsonip that field is only used by the existing service metrics implementation, which is technical preview and off by default. We can make a breaking change to it if we decide on that name. (If we don't decide on that name, then we would just remove the field definition.)

@carsonip
Copy link
Member

Thanks for the explanation. I'll proceed to implement it with transaction.success_count. In that case, do we want to remove transaction.failure_count? It will be weird to see 2 similarly named fields with different types.

@axw
Copy link
Member

axw commented Dec 12, 2022

@carsonip yes, we would remove transaction.failure_count since the information will be captured in the value_count sub-field of transaction.success_count.

@felixbarny
Copy link
Member

@felixbarny and I just had a chat about the metrics. We're thinking that we should skip adding them to transaction events and transaction metrics for now, and only add them to service metrics. That's because we would still need a fallback for old documents, and we don't yet have a solution for schema evolution. @dgieselaar does that sound reasonable?

I just had a chat with @sqren about this and we concluded that it might be better to add the new fields consistently now, even if we don't have a good solution for schema evolution. The sooner we add the changes, the sooner we're able to deprecate the old schema. I think there is a potential path for schema evolution with runtime fields, however, aggregate_metric_double is currently not supported for runtime fields.

@carsonip
Copy link
Member

Do we have enough information to do the same for transaction events and transaction metrics now? Will we also use an aggregate_metric_double transaction.success_count for both transaction events and transaction metrics, or a byte event.outcome_numeric?

@simitt
Copy link
Contributor

simitt commented Dec 13, 2022

My understanding is that everybody was fine with using transaction.success_count with aggregate_metric_double.

@felixbarny
Copy link
Member

The name should be the same for transaction events, transaction metrics, and service metrics. So it would be transaction.success_count. Metrics use the type aggregate_metric_double and transaction events use the type byte.

I identified part of the source of my discomfort with event.outcome_numeric: that name implies it is a simple mapping from event.outcome (success=1, failed=0), which only holds true for transaction events and not for the metrics. OTOH, if the field describes "successful transactions", then the meaning of the field doesn't change depending on whether it's an event or a metric.

I think the name should be optimized for the usage on individual events. IMHO, it's fine if the name is a little weird in the context of a metric that represent multiple individual events. When you configure rollups in ES, it would also use the aggregate_metric_double field to summarize multiple events into a single doc, using the same field name.

Sorry for the back-and-forth with the naming, Carson. For now, I'd say stick to transaction.success_count but we might want to change the name between now and the release. This should hopefully not be a lot of work.

@carsonip
Copy link
Member

Thanks @simitt @felixbarny. No problem. It should be straightforward to implement given that the rest of it is similar to #9791.

@carsonip
Copy link
Member

As pointed out by @axw, index is not a property for aggregate_metric_double: https://www.elastic.co/guide/en/elasticsearch/reference/current/aggregate-metric-double.html#aggregate-metric-double-params
Will proceed with making the byte doc-value only but not aggregate_metric_double.

@carsonip
Copy link
Member

During APM server weekly, @felixbarny and the team made the decision that we will be using event.success_count for transaction events (as byte), transaction metrics (as aggregate_metric_double) and service metrics (as aggregate_metric_double) as opposed to the proposed names event.outcome_numeric and transaction.success_count. The following PR will rename the field changed in #9791. Please correct me if I missed anything.

simitt pushed a commit that referenced this issue Dec 27, 2022
Add `event.success_count` to transaction event and transaction metrics. Change `transaction.success_count` to `event.success_count` for service metrics.

closes #5243
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
6 participants