Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable dim cap to Metrics API #1244

Closed
cijothomas opened this issue Oct 11, 2019 · 13 comments
Closed

Enable dim cap to Metrics API #1244

cijothomas opened this issue Oct 11, 2019 · 13 comments

Comments

@cijothomas
Copy link
Contributor

Problem Statement
The Metrics API currently does not offer a setting to apply dimension capping. Instead, when dimension values reach the limit, TrackValue/TryGetSeries returns false, and the value is not tracked. The expectation is that the caller is responsible for handling this situation.

Consider the following example with a metric "AnimalsSold" with 2 dimensions "Dimension1" and "Dimension". The config sets a limit of 2 to each dimensions.

var metConfig = new MetricConfiguration(seriesCountLimit: 100, valuesPerDimensionLimit:2,
                new MetricSeriesConfigurationForMeasurement(restrictToUInt32Values: false));

Metric animalsSold = client.GetMetric("AnimalsSold", "Dimension1", "Dimension2", metConfig);

// Start tracking.
animalsSold.TrackValue(100, "Dim1Value1", "Dim2Value1");
animalsSold.TrackValue(100, "Dim1Value1", "Dim2Value2");

// the following call gives 3rd unique value for dimension2, which is above the limit of 2.
animalsSold.TrackValue(100, "Dim1Value1", "Dim2Value3");
//The above call does not track the metric, and returns false.

There only recommended action for clients to do when the cap is hit is to revisit the dimension value and ensure their cardinality is limited. This prevents future issues, but the tracked values are lost, affecting Metric charts etc.
Because of this, the recommended approach of tracking metric involve checking the return value and doing something (like sending a telemetry) to indicate caps are hit.


if (! animalsSold.TrackValue(100, "Dim1Value1", "Dim2Value3"))
            {
                telemetryClient.TrackTrace("Metric value not tracked as value of one of the dimension exceeded the cap. Revisit the dimensions to ensure they are within the limits",
                                           SeverityLevel.Error);
            }

Instead of losing data when a cap is hit, some users may prefer to instead keep tracking the value, but with a special dimension value to which all the dimension values encountered after cap is hit is rolled into. In the above case, the dimension value "Dim2Value3" could be replaced with something like "Other". This approach may sound an easy solution, but it severely affects data quality, as SDK Metric System is only aware of dimension values from its current execution. Thus values rolled into "Other" will vary across application instances, across service instances or even across restarts of the same instance. In the above example, in one instance "Dim2Value3" may be rolled into "Other", but in another instance, "Dim2Value2" could be the one rolled into "Other".
When querying for this metric, it is impossible to determine what constitutes "Other". In short, if any dimension has "Other", then the metric value should not be filtered/split by that dimension.

Proposal:
Even though rolling of dimension values into a special value "Other" also has its own limitations, this may be the only approach many users want to take as it is simple to use and reason about. With proper usage of dimension values, the probability of ever hitting this is low.

Because of this, we are adding a new feature to the current metric API to do this dimension capping. At 1st stage, this will be exposed as a boolean field on MetricConfiguration, named ApplyDimensionCapping, and will be false by default, to keep current behavior. If this flag is turned on, then calls to TrackValue or TryGetDataSeries does not return false when any dimension exceeds its cap. It continues to track the values to the metric. But all the values encountered after the cap is hit will be rolled into a special value "DIMENSION_CAPPED". When querying for metric, the value "DIMENSION_CAPPED" will be like every other dimension value and its up to user to interpret it. (and ignore it)

ApplyDimensionCapping should be used only as a 'graceful' safeguard when some dimension accidentally goes out of limits. While overall metric value will be accurate, values involving any dimension which has "DIMENSION_CAPPED" should be ignored. (i.e No filtering or splitting involving this dimension)

example usage:

// current api and usage
var metConfig = new MetricConfiguration(seriesCountLimit: 100, valuesPerDimensionLimit:2,
                new MetricSeriesConfigurationForMeasurement(restrictToUInt32Values: false));
            Metric animalsSold = client.GetMetric("AnimalsSold", "Dimension1", "Dimension2", metConfig);

            animalsSold.TrackValue(100, "Dim1Value1", "Dim2Value1");
            animalsSold.TrackValue(150, "Dim1Value1", "Dim2Value1");

            animalsSold.TrackValue(100, "Dim1Value1", "Dim2Value2");
            animalsSold.TrackValue(175, "Dim1Value1", "Dim2Value2");

            animalsSold.TrackValue(100, "Dim1Value1", "Dim2Value3"); // This line returns false and value is not tracked.
            animalsSold.TrackValue(100, "Dim1Value1", "Dim2Value4"); // This line returns false and value is not tracked.
            // the above would result in animalsSold reporting the following 2 time series
          // "Dim1Value1", "Dim2Value1"  with sum of 150
          // "Dim1Value1", "Dim2Value2"  with sum of 175
         // i.e total animalSold will be 150+175=325, even though the actual animalssold is 525. The last 2 calls are not tracked, hence users must inspect return value and take action.
// new usage based on this proposal
var metConfig = new MetricConfiguration(seriesCountLimit: 100, valuesPerDimensionLimit:2,
                new MetricSeriesConfigurationForMeasurement(restrictToUInt32Values: false));
            metConfig.ApplyDimensionCapping = true;
            Metric animalsSold = client.GetMetric("AnimalsSold", "Dimension1", "Dimension2", metConfig);

            animalsSold.TrackValue(100, "Dim1Value1", "Dim2Value1");
            animalsSold.TrackValue(150, "Dim1Value1", "Dim2Value1");

            animalsSold.TrackValue(100, "Dim1Value1", "Dim2Value2");
            animalsSold.TrackValue(175, "Dim1Value1", "Dim2Value2");

            animalsSold.TrackValue(100, "Dim1Value1", "Dim2Value3");            
            animalsSold.TrackValue(100, "Dim1Value1", "Dim2Value4"); 

            // the above would result in animalsSold reporting the following 3 time series
          // "Dim1Value1", "Dim2Value1"  with sum of 150
          // "Dim1Value1", "Dim2Value2"  with sum of 175
          // "Dim1Value1", "DIMENSION_CAPPED"  with sum of 200
         //   total animalSold will be 525
        // Since "Dimension2" has the value DIMENSION_CAPPED, queries like AnimalsSold with "Dimension2=value" should not be trusted. However, queries like AnimalsSold (total) or AnimalsSold with "Dimension1=value" represent true values.
@cijothomas
Copy link
Contributor Author

This is a pre-req for implementing #1233 . The proposed APIs could be kept internal only for the purpose of #1233 , but regular users can also benefit from this proposal if public.

@cijothomas
Copy link
Contributor Author

@macrogreg @SergeyKanzhelev @Dmitry-Matveev @reyang @Dawgfan Please share comments/concerns.

@SergeyKanzhelev
Copy link
Contributor

so basically you are adding two settings: seriesCountLimit: 100, valuesPerDimensionLimit:2,. And then enforce it in implementation. Do you feel we need a separate caps per "name" or a single number will fit every dimension?

@cijothomas
Copy link
Contributor Author

@SergeyKanzhelev No. seriesCountLimit and valuesPerDimensionLimit are existing. There exists overload to take separate caps per name.

Only new thing to be added per this proposal is ApplyDimensionCapping, a bool field. metConfig.ApplyDimensionCapping = true;
The draft implementation is here: https://github.com/microsoft/ApplicationInsights-dotnet/pull/1245/files

@SergeyKanzhelev
Copy link
Contributor

After offline discussion I think it is a good solution to implement #1233

@macrogreg
Copy link
Contributor

I believe you are underestimating the E2E-experience problem around the wrongness of Other data. It is a big mistake to offer this API as public.
Have customers asked for it? If so – how did they react if explained the challenge?

Above you are saying that the proposed approach is our only option, but it is not correct.
We are ALREADY handling this by returning FALSE from all methods that hit dimension caps so that the customer knows exactly when this happens and can react to it. But the "use Other" solution leads to a pit of failure. Customers who hit this rarely may very well be served just fine with the existing system (return false from API) and for customers who hit this frequently the data will be wrong almost all of the time! In most cases they will want to log some custom message rather than silently log metrics that are incorrect.

None of our competitors that I know of use this kind of approach of dim-capping in custom metrics APIs – for good reason.

I strongly urge you not to add any public APIs for this and solve the problem for standard metrics first, before committing to a back-compat burden that we are not sure about.

I have previously shared this PR for this: #1243

Please consider using it as is or simplifying it further.
The PR you have proposed has a number of detailed issues that I summarized here:
#1245 (comment)

@cijothomas
Copy link
Contributor Author

@macrogreg
Why do you think the problem is underestimated or understated? I fully understood the dim cap issue. (you explained it clearly, and i captured the same to this issue as well with examples. The examples I have above compares both approaches as well)

See the below statement copied from above:
"
could be replaced with something like "Other". This approach may sound an easy solution, but it severely affects data quality,
"

This proposal is :

  1. Not changing existing behavior. (if dim limit reached, continue return false)
  2. Provide a new option 'ApplyDimensionCapping` - (default false to retain current behavior). This is only for those users who prefer to do "Other", rather than lose metric value. These customers who "opt-in" to this, understands that their TrackValue() never returns false when dim cap is hit, and instead all values get combined to "Other", and that they should ignore dimension with "other" when querying. They could setup alerts if any metric hit Other.

Pasting below is the comment for the ApplyDimensionCapping field, which advice customers about the issue.

"   /// <summary>Gets or sets a value indicating whether dimension capping should be applied, when any indiviual dimension
        /// exceeds its limit. If this flag is set, calls to <c>TrackValue(..)</c>, <c>TryGetDataSeries(..)</c> and similar
        /// that would normally return false when cap is hit will return true, and the actual value of dimension will be replaced
        /// by a constant <c>DIMENSION_CAPPED</c>.
        /// The metric will continue to track the value, however, users should be beware that any metric filtering or
        /// splitting involving a dimension which has the value DIMENSION_CAPPED, should be ignored.
"
  1. Regarding the existing solution:
    We are ALREADY handling this by returning FALSE from all methods that hit dimension caps so that the customer knows exactly when this happens and can react to it.

I am not proposing to change this. I am suggested to add another option to those customers who want to accept loss of dimension, and want to keep overall metric value accurate. Again - this is opt-in behavior.

Returning false from the API is not okay for me, as by the time I know about this issue, I already lost my metric value. The proposed API is for those customer who prefer to keep tracking the metric value, while accepting loss of dimension. (There could be other dimension which can still be used).

  1. See my 1st comment to this post - I haven't decided whether to make this public or internal only. I opened the issue to get comments.

@macrogreg
Copy link
Contributor

macrogreg commented Oct 17, 2019

Hey @cijothomas ,

The remark about the underestimation was just because I read the text as "yes, it's a problem, but we still wanna go ahead". So I only referred to to our opinions in regard to how critical that complication is, have no doubt that we all are clear on what it is about. Sorry if I did not express that well enough :)

My point is mainly this: It is so bad that we should advise customer NOT to use this particular approach, rather than making it easy.

  • I am not aware of customers asking for this new API - why add it publicly?
  • We have a strong intuition that most of the time the API will be used incorrectly (but ho hard data either way), so why add it unless there is an ask?
  • You mention setting up alerts on Other, but it is currently NOT POSSIBLE for the alerting system to tell the difference between "No Data" and "Value == 0". And even if it was, all metric processing is about averages, min, max, etc, so a metric alert on "notify me when there is at least one data point with Dim=Name" is not on the books.

Please also note the challenge of taking a lock inside a pool in your particular approach.
If we wanted to make it possible for the customers to do the right thing for dim caps and we are OK with a lock in he loop, then - as a customer - I would much prefer to be guided to write code like this:

public void _Example_LogOperationDuration(string operationType, string operationName, string targetType, TimeSpan duration)
{
    Metric metric = (new TelemetryClient()).GetMetric("Operation Duration MSec", "Operation Type", "Operation Name", "Target Type");

    MetricDataSeriesRetrievalResult result = metric.GetDataSeries(true, operationType, operationName, targetType);

    if (result.ResultCode == MetricDataSeriesRetrievalResultCodes.Failure_SubdimensionsCountLimitReached)
    {
        switch(result.FailureCoordinateIndex)
        {
            case 0:
                // Operation type is expected to be very low cardinality. Handle abnormal condition by logging some kind of error and failing.
                return;

            case 1:
                // I will create another mentric that has the name dimension collepsed. it will be sparse, but it will always be correct.
                // If I did not want it to be sparse, I would always emit this second metric it.
                // A this point I (the customer) have REALLY internalized the trade-offs I am making here!!
                _Example_LogOperationDuration(operationType, targetType, duration);
                return;

            case 2:
                // Target type is expected to be very low cardinality. Handle abnormal condition by logging some kind of error and failing.
                return;
        }
    }

    if (result.IsSuccess)
    {
        result.MetricSeries.TrackValue(duration.TotalMilliseconds);
    }
}

private void _Example_LogOperationDuration(string operationType, string targetType, TimeSpan duration)
{
    Metric metric = (new TelemetryClient()).GetMetric("Operation Duration (no name) MSec", "Operation Type", "Target Type");

    MetricDataSeriesRetrievalResult result = metric.GetDataSeries(true, operationType, targetType);

    if (result.ResultCode == MetricDataSeriesRetrievalResultCodes.Failure_SubdimensionsCountLimitReached)
    {
        switch (result.FailureCoordinateIndex)
        {
            case 0:
                // Operation type is expected to be very low cardinality. Handle abnormal condition by logging some kind of error and failing.
                return;

            case 1:
                // Target type is expected to be very low cardinality. Handle abnormal condition by logging some kind of error and failing.
                return;
        }
    }

    if (result.IsSuccess)
    {
        result.MetricSeries.TrackValue(duration.TotalMilliseconds);
    }
}

For that, simply expose an overload of GetDataSeries that returns the details of the MultidimensionalPointResult, like in this simple PR:
#1253

This approach follows the Core Fx design guidelines:
If you want to enable a feature for experts that can be easily misused, do it in a way that required users to think about the feature and the potential pitfalls. On contrary for features that always lead to the it of success, expose convenience APIs.

@cijothomas
Copy link
Contributor Author

cijothomas commented Oct 17, 2019

  1. I don't have strong opinion of making this public vs internal. I have mentioned it clearly that for standard metric work, this can be internal. Sergey posted he agree to this approach for solving standard metric. (I'd read it as he agrees to this approach, if the new api change is Internal vs Public). Please don't jump into conclusion that I am making this public. (I may, but at this point, its not decided. Maybe we'll keep it public, but hide from Editors visibility. Or the public comment can be made to link to a doc explaining the implications of using this)

  2. I made a draft PR which implemented the proposal. Locks/Perf issues - I'll worry about it only when the PR is made ready for review. Its a draft PR as of now.

  3. Using the example of OperationDuration metric with 3 dimensions, RoleName, Success, OperationName, with dimension limits of, say, 2, 2, 25 each.
    I am tracking metrics.
    I don't expect RoleName, Success to ever hit dimension limit.
    I don't normally expect OperationName to hit the limit of 25, but it can happen under some situations.
    If that unfortunate state occurs where OperationNames exceeds limits, I have 2 options now.

Option1. Monitor the return value, and do something about it. There is no global guidance what to do. I can track another metric (as in your above example.). I could throw error?. I could log trace message. I could collapse operation name into "other" myself. (this is not really possible now, as metric already hit limit, and it wont take 'other' now. And I cannot pre-reserve "other" as well, as I don't know the values of other dimension ahead of time) Or I could do a lot of other things. I may do something different from my peer team. so there is no consistency.

Option2.
EnableDimCap = true. (again, this is false by default. I am consciously making this decision to turn this ON)

This will continue to track my metric. I lose visibility into operation name. I understand that. But my metric queries like "avg(Duration)", avg(Duration)where(RoleName-"WebRoleA"), avg(Duration)where(success=false)" - are still accurate. This is simple to reason about.

I fully understand that if "avg(Duration) where(OperationName=="Other") exists for any time period, I should not use avg(Duration) where (OperationName =="Get values") etc. as this is incorrect data.

My overall metrics are not affected just because of one dimension went rogue.
I don't want to deal with another metric.
Whenever I am consuming this metric, I only need to be aware of one thing - if any dimension has "other" value, that dimension should not be used. Other dimension can be used just fine.

("Other" or "DimensionCapped" or "somethingelse" - it can be set by user themselves, as opposed to Metric system providing a constant. I am not against either.)

I prefer option 2. There may be customers who prefer option 1. This is precisely why we offer options, and let customer pick what they feel comfortable.

I myself hit the problem when fixing standard metrics. It can affect any customer. There are no public docs explaining the Metrics API, so I don't think we have many customers actively using this. But nonetheless, this is a problem.

@cijothomas
Copy link
Contributor Author

Allow me to explain differently.

The following is my need:
I want to track a metric, named "DependencyDuration".
I want 3 dimensions on it
RoleName (1 value most cases, more in rare cases, I am okay to have up to 5)
Success (2 values only)
DependencyTarget (this is potentially high cardinality, but I want to only accept up to 25 values)

I want to ensure that my metric (DependencyDuration) is always accurate. I also want to ensure that my metric (DependencyDuration) with Success dimension is also always accurate.
Because I have a dashboard with the above and it should be accurate always. And I have alerts setup which I want to be fired accurately.

I would like to split/filter by my DependencyTarget in most cases. My application is not expected to go above 25 always, but it can at certain short periods of time. I am okay to lose visibility into DependencyTarget on those occasions.

Options I have:

  1. Use the current API as is, and don't check the return value from TrackValue(). - Not acceptable as this may return false, and my DependencyDuration dashboards/alerts gets inaccurate.

  2. If TrackValue() returns false, I can use a new metric with name DependencyDurationNoTarget. Not acceptable, as my DependencyDuration dashboards gets inaccurate, and I don't want another metric in my dashboard. I also don't want another metric alert.

  3. Keep track of unique values of DependencyTarget myself, and convert to "Other" before I hit the limit. This requires me to write custom logic to track the unique values encountered so far. I will need to do this inside a lock, to ensure I catch it before hitting the limit. - Not acceptable, as I do not want to write this logic myself. I don't know how taking locks affects my app performance. Also I may have another dimension tomorrow and I will need revisit this again. I am expecting ApplicationInsightsSDK to provide me an easier option.

  4. Instead of above, I can use GetDimensionValues(1).Count API so I don't need to track the count myself. However I need do this and ensure I convert Target to "Other" just before hitting cap. Again, I do need to do this inside a lock.. - Not acceptable, I don't know the perf implications. And if I add more dimension, then i need to change this logic as well.

  5. Let SDK handle and implement dim cap for me. (This proposal)

In short, I want to track a single metric with multiple dimensions, and ensure values are always accurate for certain important dimensions. Some dimension may go rogue, but I don't want that to affect my overall metric/other dimensions. The proposal is to address this in SDK itself, so everyone gets to use same solution, accepting the consequences which are explained.

@olegsych
Copy link
Member

@cijothomas It's difficult for me to follow the low-level details in this thread.

At a high level, as a long-time Application Insights customer, I believe that the currently behavior of TrackValue returning false and dropping the sample is wrong. To me this is a bug resulting in loss of my data and should not be a behavior preserved by default. So, instead of making ApplyDimensionCapping=false by default, I believe we should make it true and stop the data loss customers may not even realize they have. The only possible use I can imagine for this setting is to temporarily revert back to the old lossy behavior while the alerts based a capped metric are adjusted to work correctly with full metric data.

I also think that TrackValue should continue to return false when dimension cap is exceeded. This will allow those customers who log capped dimensions to continue doing so.

@cijothomas cijothomas added this to the 2.12 milestone Oct 18, 2019
@cijothomas
Copy link
Contributor Author

Thanks all for sharing your suggestion.
In the interesting on unblocking #1233, we'll proceed with ApplyDimensionCapping as internal only. This leaves us option to revert the decision if needed.

@cijothomas
Copy link
Contributor Author

Closing this as DimensionCapping is enabled as an internal API, and is used by standard metric extractor. Will open a new issue for public facing API for the same

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants