Skip to content
This repository has been archived by the owner on Oct 2, 2021. It is now read-only.

Introduces a mechanism to batch-sending telemetry events. #292

Merged
merged 1 commit into from
Feb 27, 2018

Conversation

changsi-an
Copy link
Contributor

Batch-sending telemetry is necessary because some notifications from debuggee might be too frequent for the telemetry infrastructure to handle. Take VS as an example, it throttles telemetry events if there are two many within a short period of time. We will lose some important telemetry when that happens.

The BatchTelemetryReporter will cache the received telemetry events and sends them out every 10 seconds.

This PR also refactors some code and migrates some general telemetry handling code from DebugSession.ts to utils.ts or telemetry.ts.

@changsi-an
Copy link
Contributor Author

@digeff PTAL

this.runAndMeasureProcessingTime('notification/onPaused', () => {
return this.onPaused(params);
});
});
this.chrome.Debugger.onResumed(() => this.onResumed());
this.chrome.Debugger.onScriptParsed(params => this.onScriptParsed(params));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are we measuring onScriptParsed in another PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, all the other notifications will be taken care by the next PR. I hope to have this PR introduces the "BatchTelemetryReporter" only (and give one usage example).

Some of the notification handlers need some small tweaks before they can be measured correctly. So I'd rather do another PR for those.

src/utils.ts Outdated
@@ -549,4 +551,23 @@ export function isNumber(num: number): boolean {

export function toVoidP(p: Promise<any>): Promise<void> {
return p.then(() => { });
}

export function calculateElapsedTime(startProcessingTime: [number, number]): number {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Wouldn't it be better to type the tuple as "HighRes timer" or something similar?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. Added that.

src/utils.ts Outdated
const MillisecondsPerSecond = 1000;

const ellapsedTime = process.hrtime(startProcessingTime);
const ellapsedMilliseconds = ellapsedTime[0] * MillisecondsPerSecond + ellapsedTime[1] / NanoSecondsPerMillisecond;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isnt this easier to understand a different way => ( ellapsedTime[0]* 1e9 + ellapsedTime[1])/NanoSecondsPerMillisecond?
where 1e9 is the number of nanoseconds in a second.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. Changed.

@changsi-an
Copy link
Contributor Author

I am merging this PR now. If there is any other issues please kindly point it out and I will address later.

@auchenberg
Copy link
Contributor

@changsi-an Why is this needed to be implemented in the DA, if this is limited by VS? Shouldn't this be implemented inside VS's telemetry service to make sure that data isn't lost?

@changsi-an
Copy link
Contributor Author

@auchenberg Tweaking VS's telemetry service is a much bigger change than we have time for. Also regardless of what underlying telemetry service system is being used, overwhelming the service endpoint is not ideal. If the batching strategy works, it should be adopted as an optimization and diminishes the value of sending each telemetry individually.

As a side note, we only batch the events that would potentially amass in a short period of time.

@@ -363,7 +367,11 @@ export abstract class ChromeDebugAdapter implements IDebugAdapter {
* Hook up all connection events
*/
protected hookConnectionEvents(): void {
this.chrome.Debugger.onPaused(params => this.onPaused(params));
this.chrome.Debugger.onPaused((params) => {
this.runAndMeasureProcessingTime('notification/onPaused', () => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd add crdp to the name, so it's clear that the notification comes from Chrome

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will change to crdpnotification/onPaused, sounds fine?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd put a separator so it's easier to identify the two words e.g.:
crdp/notification/onPaused
crdp-notification/onPaused
crdp_notification/onPaused
crdpNotification/onPaused

I think the only one that is allowed is: crdp/notification/onPaused

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adopted crdp/notification/onPaused

startTime: startTime.toString()
};

return procedure().then(() => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally find async/await code easier to read and reason than using promises

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed and changed

properties.successful = 'true';
}, (e) => {
properties.successful = 'false';
properties.exceptionType = "generic";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a more intuitive way to put this information? If I read generic in this field, I don't think I'll understand what it means

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about not setting this field?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed to 'firstChance'

// There is an issue on some clients and reportEvent only currently accept strings properties,
// hence all the following properties must be strings.
successful?: "true" | "false";
exceptionType?: ExceptionType;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be called eventType or actionType rather than exceptionType?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is used to differentiate whether it's a uncaughtException or promiseRejection for exception case. Why should it be eventType or actionType?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that people might expect to see the class name in a field called exceptionType.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this word clearly conveys its meaning. Neither eventType or actionType is correct to capture the intention of this field. The case you mention might be valid, but I feel that it might come from a limited group of people who might come from a scoped background. It is not that convincing that exceptionType has to mean a class name. Users of data should check the content of the data before make any assumptions. So unless we would have a absolutely perfect wording for this concept (other than actionType and eventType), I might leave it as-is.

src/telemetry.ts Outdated
@@ -36,4 +49,80 @@ export class NullTelemetryReporter implements ITelemetryReporter {

}

export const DefaultTelemetryInterval = 10000;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rename this to DefaultTelemetryIntervalInMilliseconds to make it clear what the unit is.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

src/telemetry.ts Outdated
private _eventBuckets: {[eventName: string]: any};
private _timer: NodeJS.Timer;

public constructor(private _telemetryReporter: TelemetryReporter, private _cadence: number = DefaultTelemetryInterval) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_cadenceInMilliseconds?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

src/telemetry.ts Outdated
}

public setupEventHandler(_sendEvent: (event: DebugProtocol.Event) => void): void {
// no-op
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'll be surprising for customers of this class if this is a no-op. It might be clearer if this class doesn't implement ITelemetryReporter and doesn't have this method.

this._eventBuckets = {};
}

private send(): void {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm finding hard to follow what this method does. I think it'd be much easier to understand if we split this into smaller private methods with meaningful self-documenting names. What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's been divided into several parts.

this.reset();
}
/**
* Transfrom the bucket of events data from the form:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really useful :) 👍 👍 👍

let propertyNamesSet = {};
for (const entry of bucket) {
for (const key of Object.keys(entry)) {
propertyNamesSet[key] = true;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the advantage of using a dictionary over an array for this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Array won't work. I am using dictionary as a set. Array can only append new keys (unless we search with indexof()). But a lot of entries will have the same property names.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense. I compleatly missed that :)

Could you add a comment saying that we use the dictionary to eliminate duplicated values?

}

/**
* Get the property keys from all the entries of a event bucket:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome comment!

src/telemetry.ts Outdated
// Run through again, collect the values for each property name.
for (const entry of bucket) {
for (const key of allPropertyNamesInTheBucket) {
properties[`aggregated.${key}`].push(entry[key] || null);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If entry[key] is "" or 0 entry[key] || null will return null

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Fixed.

private send(): void {
for (const eventName in this._eventBuckets) {
const bucket = this._eventBuckets[eventName];
let properties = BatchTelemetryReporter.transfromBucketData(bucket);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'd be awesome if the name of the method would explain what kind of transformation the method is going to do (e.g. what the output is)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Describing the transformation on a method name might be too verbose. The realistic way is probably to refer the comment put on the method. The idea is that "if the name is not sufficient enough to tell what, refer to its signature comment, if still cannot, refer to its implementation".

Or otherwise, if you have a better name in mind, please suggest.

@auchenberg
Copy link
Contributor

@changsi-an I'm still not sure I understand. We don't seem to have this problem today in VS Code, and this approach would require each Debug Adaptor to throttle events that satisfies constraints on VS's telemetry endpoint.

Why not have the central VSTelemetry service implementing this, so no DA should deal with this VS internal limitation?

@roblourens @dhanvikapila

@dhanvikapila
Copy link
Member

@auchenberg we can chat if you need more context. The telemetry we are adding is going to result in lot of data points being generated and so we need to be careful not to overwhelm upstream telemetry sinks (VScode or VS) when they get this data. 10s was a arbitrary number we chose which we can either make larger or smaller. There is no data loss as a result of this PR. Data you got earlier you continue to get (along with lot more new data). The only change will be that the last data set will be sent 10s later for the telemetry sinks to process.

Can you please elaborate on your concerns with this PR? This is best practice when you have potential to generate lot of traffic for telemetry sinks. If we don't do this, every DA using this package will end up potentially spamming their respective telemetry endpoints with lots of requests and as a result causing upstream (VSCode or VS telemetry processors to be) impacted.

@auchenberg
Copy link
Contributor

auchenberg commented Feb 23, 2018

@dhanvikapila Again I'm trying to understand why the throttling problem has to be solved inside a debugging core library that's used in 8 different debuggers across VS and VS Code, and not inside the telemetry service in VS.

In VS Code, to my knowledge, we don't implement telemetry throttling, but simply forwards them to the AppInsights client (npm package) that might be doing some throttling when needed.

My concern is that we are adding VS telemetry specific logic inside a core debugging library to fix a symptom instead of fixing it centrally inside VS's telemetry abstraction/sink/service. It shouldn't be this libraries concern that VS's telemetry service might drop data if the volumes get's to high.

@dhanvikapila
Copy link
Member

@auchenberg as I mentioned earlier, the amount of telemetry is going to be large. If we don't do this, all debug adapters who will also be generating this twenty points can potentially impact their telemetry endpoints.
Throttling makes sense when it happens on the server end and not the client (like VSCode) unless you want to lose data. Batching is the right thing to do at the client since you don't want to drop data on the floor.
This telemetry is going into VS and VSCode. The telemetry is not specific to VS. why do you think VSCode cannot use this telemetry data?

It might be better for us to jump in a call to sync. This is not about any limitations on the telemetry service. As a client of a service the client would need to be sure the data they send is not dropped on the floor resulting in missed data for us.

@isidorn
Copy link
Contributor

isidorn commented Feb 23, 2018

Since @roblourens is this week on vacation and we are wrapping up our milestone (we have endgame next week) I suggest to postpone this PR to the March milestone.
Then @roblourens can give this a nice review

As far as I understand we have been sending telemtry like this always and there is no need to rush this in.

@dhanvikapila
Copy link
Member

We need to get in a room so we can talk about this and close it out. Otherwise we are going to be repeating each other and not get anywhere.

@roblourens
Copy link
Member

@isidorn this is intended for VS, not vscode, so it has no impact on our release.

I have read the comments but still have the same question as Kenneth. I don't understand where the limitation is or where the data loss happens. Is the volume of telemetry really too high? You can include me on any call.

@changsi-an changsi-an changed the base branch from master to feature/telemetry February 27, 2018 23:22
@changsi-an
Copy link
Contributor Author

@roblourens I had a meeting with @auchenberg and Kai today. We reached an agreement to batch-sending telemetry for notifications from a debugee. In that, we don't want telemetry messages clog the communication channel between DA and VS Code, so we could allow messages of higher priority pass through first.

Based on that outcome, we would still want you to code review this PR as is, to clear out issues for edge cases which I have not noticed yet.

In order to catch up release schedule on our end, we will first merge this PR to feature branch "feature/telemetry". I will work on getting it to master after your final code review.

@changsi-an changsi-an merged commit 8fe3d0a into microsoft:feature/telemetry Feb 27, 2018
@roblourens
Copy link
Member

Sure, that's fine. I'll do an in depth review later.

we don't want telemetry messages clog the communication channel between DA and VS Code, so we could allow messages of higher priority pass through first

This part still confuses me, is there evidence that there are so many telemetry messages that it slows anything down? I'd be surprised if we actually send enough to cause an issue.

@changsi-an
Copy link
Contributor Author

When we start reporting onScriptParsed event as telemetry, it might happen. Some project would have hundreds or thousands of small script files, which might be loaded in a short period of time. What reflects to the UI is the loadedScript event, and that's what benefits the customers. If in the meantime, we also send telemetry event (Output events) through. Imagine that there are 100 telemetry events sent out before the last loadedScript event, either alternatively with loadedScript event or randomly intertwined with loadedScript event, the last loadedScript event will be significantly delayed (comparing to the case where there were no telemetry event). Telemetry events do not directly benefit end user, it only benefit the developers or code owners, hence have lower priority. We shouldn't let our design/mechanism compromise UX.

That being said, I don't have a real evidence yet, because it is about the new telemetry events we are adding which has not happened yet. But we do have the above theory and it might be useful to prevent that (an 'evidence') from happening.

@roblourens
Copy link
Member

Maybe we should investigate some more to determine what the capacity of the IPC channel is, and whether we need to do anything here. There are already many messages being exchanged on the websocket and adapter/client connections, many of them very large, and the telemetry seems to be a small part of that. I'm worried that this is a premature optimization - adding dozens of lines of code to do something that sounds right but we don't have data to back it up.

That being said, I don't have a real evidence yet, because it is about the new telemetry events we are adding which has not happened yet. But we do have the above theory and it might be useful to prevent that (an 'evidence') from happening.

Given that, why was this so urgent?

@changsi-an
Copy link
Contributor Author

Here is why: before this PR, the telemetry we have been sending so far does not have a problem. After this PR, we will be sending new telemetry events, which can amass within a short period of time, and we don't want to loose them in case they are throttled. It might not be wise to see the loss first, as the so-called 'evidence', then decide what to do with it. At that point, there is no way to recover the valuable telemetry events we have lost.

For your first point, the issue we want to prevent is not whether the IPC channel is capable of sending through large amount of data (And I believe it is capable too). It is that we don't want the telemetry to cause delay of the real UI interactions. Telemetry is not part of the UX, it should be de-prioritized. Any UI delays, not matter how small, should be avoided.

@roblourens
Copy link
Member

Can you set up a call to work this out? I think there are two points

  • VS's rate limit on telemetry events. We should know what that rate limit is exactly and, if we will hit it, whether we can do this aggregation from PineZorro or inside VS instead of individual adapters
  • Whether telemetry could actually cause a significant delay in the time for a user to see a loaded script. My intuition is that the time to send many events across will be milliseconds, and will be dwarfed by the time to do other things, like resolve sourcemaps for the script. I would be surprised if users are ever waiting for a loaded script to show up because of too much telemetry clogging the channel.

@roblourens roblourens added this to the March 2018 milestone Apr 3, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants