-
Notifications
You must be signed in to change notification settings - Fork 119
Introduces a mechanism to batch-sending telemetry events. #292
Conversation
@digeff PTAL |
this.runAndMeasureProcessingTime('notification/onPaused', () => { | ||
return this.onPaused(params); | ||
}); | ||
}); | ||
this.chrome.Debugger.onResumed(() => this.onResumed()); | ||
this.chrome.Debugger.onScriptParsed(params => this.onScriptParsed(params)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are we measuring onScriptParsed in another PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, all the other notifications will be taken care by the next PR. I hope to have this PR introduces the "BatchTelemetryReporter" only (and give one usage example).
Some of the notification handlers need some small tweaks before they can be measured correctly. So I'd rather do another PR for those.
src/utils.ts
Outdated
@@ -549,4 +551,23 @@ export function isNumber(num: number): boolean { | |||
|
|||
export function toVoidP(p: Promise<any>): Promise<void> { | |||
return p.then(() => { }); | |||
} | |||
|
|||
export function calculateElapsedTime(startProcessingTime: [number, number]): number { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Wouldn't it be better to type the tuple as "HighRes timer" or something similar?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good. Added that.
src/utils.ts
Outdated
const MillisecondsPerSecond = 1000; | ||
|
||
const ellapsedTime = process.hrtime(startProcessingTime); | ||
const ellapsedMilliseconds = ellapsedTime[0] * MillisecondsPerSecond + ellapsedTime[1] / NanoSecondsPerMillisecond; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isnt this easier to understand a different way => ( ellapsedTime[0]* 1e9 + ellapsedTime[1])/NanoSecondsPerMillisecond
?
where 1e9 is the number of nanoseconds in a second.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good. Changed.
I am merging this PR now. If there is any other issues please kindly point it out and I will address later. |
@changsi-an Why is this needed to be implemented in the DA, if this is limited by VS? Shouldn't this be implemented inside VS's telemetry service to make sure that data isn't lost? |
@auchenberg Tweaking VS's telemetry service is a much bigger change than we have time for. Also regardless of what underlying telemetry service system is being used, overwhelming the service endpoint is not ideal. If the batching strategy works, it should be adopted as an optimization and diminishes the value of sending each telemetry individually. As a side note, we only batch the events that would potentially amass in a short period of time. |
src/chrome/chromeDebugAdapter.ts
Outdated
@@ -363,7 +367,11 @@ export abstract class ChromeDebugAdapter implements IDebugAdapter { | |||
* Hook up all connection events | |||
*/ | |||
protected hookConnectionEvents(): void { | |||
this.chrome.Debugger.onPaused(params => this.onPaused(params)); | |||
this.chrome.Debugger.onPaused((params) => { | |||
this.runAndMeasureProcessingTime('notification/onPaused', () => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd add crdp to the name, so it's clear that the notification comes from Chrome
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will change to crdpnotification/onPaused
, sounds fine?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd put a separator so it's easier to identify the two words e.g.:
crdp/notification/onPaused
crdp-notification/onPaused
crdp_notification/onPaused
crdpNotification/onPaused
I think the only one that is allowed is: crdp/notification/onPaused
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adopted crdp/notification/onPaused
src/chrome/chromeDebugAdapter.ts
Outdated
startTime: startTime.toString() | ||
}; | ||
|
||
return procedure().then(() => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I personally find async/await code easier to read and reason than using promises
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed and changed
src/chrome/chromeDebugAdapter.ts
Outdated
properties.successful = 'true'; | ||
}, (e) => { | ||
properties.successful = 'false'; | ||
properties.exceptionType = "generic"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a more intuitive way to put this information? If I read generic in this field, I don't think I'll understand what it means
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about not setting this field?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renamed to 'firstChance'
// There is an issue on some clients and reportEvent only currently accept strings properties, | ||
// hence all the following properties must be strings. | ||
successful?: "true" | "false"; | ||
exceptionType?: ExceptionType; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be called eventType or actionType rather than exceptionType?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is used to differentiate whether it's a uncaughtException or promiseRejection for exception case. Why should it be eventType or actionType?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that people might expect to see the class name in a field called exceptionType.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this word clearly conveys its meaning. Neither eventType
or actionType
is correct to capture the intention of this field. The case you mention might be valid, but I feel that it might come from a limited group of people who might come from a scoped background. It is not that convincing that exceptionType
has to mean a class name. Users of data should check the content of the data before make any assumptions. So unless we would have a absolutely perfect wording for this concept (other than actionType and eventType), I might leave it as-is.
src/telemetry.ts
Outdated
@@ -36,4 +49,80 @@ export class NullTelemetryReporter implements ITelemetryReporter { | |||
|
|||
} | |||
|
|||
export const DefaultTelemetryInterval = 10000; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd rename this to DefaultTelemetryIntervalInMilliseconds to make it clear what the unit is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
src/telemetry.ts
Outdated
private _eventBuckets: {[eventName: string]: any}; | ||
private _timer: NodeJS.Timer; | ||
|
||
public constructor(private _telemetryReporter: TelemetryReporter, private _cadence: number = DefaultTelemetryInterval) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_cadenceInMilliseconds?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
src/telemetry.ts
Outdated
} | ||
|
||
public setupEventHandler(_sendEvent: (event: DebugProtocol.Event) => void): void { | ||
// no-op |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It'll be surprising for customers of this class if this is a no-op. It might be clearer if this class doesn't implement ITelemetryReporter and doesn't have this method.
this._eventBuckets = {}; | ||
} | ||
|
||
private send(): void { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm finding hard to follow what this method does. I think it'd be much easier to understand if we split this into smaller private methods with meaningful self-documenting names. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's been divided into several parts.
this.reset(); | ||
} | ||
/** | ||
* Transfrom the bucket of events data from the form: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is really useful :) 👍 👍 👍
let propertyNamesSet = {}; | ||
for (const entry of bucket) { | ||
for (const key of Object.keys(entry)) { | ||
propertyNamesSet[key] = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the advantage of using a dictionary over an array for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Array won't work. I am using dictionary as a set. Array can only append new keys (unless we search with indexof()
). But a lot of entries will have the same property names.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense. I compleatly missed that :)
Could you add a comment saying that we use the dictionary to eliminate duplicated values?
} | ||
|
||
/** | ||
* Get the property keys from all the entries of a event bucket: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome comment!
src/telemetry.ts
Outdated
// Run through again, collect the values for each property name. | ||
for (const entry of bucket) { | ||
for (const key of allPropertyNamesInTheBucket) { | ||
properties[`aggregated.${key}`].push(entry[key] || null); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If entry[key] is "" or 0 entry[key] || null will return null
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. Fixed.
private send(): void { | ||
for (const eventName in this._eventBuckets) { | ||
const bucket = this._eventBuckets[eventName]; | ||
let properties = BatchTelemetryReporter.transfromBucketData(bucket); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It'd be awesome if the name of the method would explain what kind of transformation the method is going to do (e.g. what the output is)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Describing the transformation on a method name might be too verbose. The realistic way is probably to refer the comment put on the method. The idea is that "if the name is not sufficient enough to tell what, refer to its signature comment, if still cannot, refer to its implementation".
Or otherwise, if you have a better name in mind, please suggest.
@changsi-an I'm still not sure I understand. We don't seem to have this problem today in VS Code, and this approach would require each Debug Adaptor to throttle events that satisfies constraints on VS's telemetry endpoint. Why not have the central VSTelemetry service implementing this, so no DA should deal with this VS internal limitation? |
@auchenberg we can chat if you need more context. The telemetry we are adding is going to result in lot of data points being generated and so we need to be careful not to overwhelm upstream telemetry sinks (VScode or VS) when they get this data. 10s was a arbitrary number we chose which we can either make larger or smaller. There is no data loss as a result of this PR. Data you got earlier you continue to get (along with lot more new data). The only change will be that the last data set will be sent 10s later for the telemetry sinks to process. Can you please elaborate on your concerns with this PR? This is best practice when you have potential to generate lot of traffic for telemetry sinks. If we don't do this, every DA using this package will end up potentially spamming their respective telemetry endpoints with lots of requests and as a result causing upstream (VSCode or VS telemetry processors to be) impacted. |
@dhanvikapila Again I'm trying to understand why the throttling problem has to be solved inside a debugging core library that's used in 8 different debuggers across VS and VS Code, and not inside the telemetry service in VS. In VS Code, to my knowledge, we don't implement telemetry throttling, but simply forwards them to the AppInsights client (npm package) that might be doing some throttling when needed. My concern is that we are adding VS telemetry specific logic inside a core debugging library to fix a symptom instead of fixing it centrally inside VS's telemetry abstraction/sink/service. It shouldn't be this libraries concern that VS's telemetry service might drop data if the volumes get's to high. |
@auchenberg as I mentioned earlier, the amount of telemetry is going to be large. If we don't do this, all debug adapters who will also be generating this twenty points can potentially impact their telemetry endpoints. It might be better for us to jump in a call to sync. This is not about any limitations on the telemetry service. As a client of a service the client would need to be sure the data they send is not dropped on the floor resulting in missed data for us. |
Since @roblourens is this week on vacation and we are wrapping up our milestone (we have endgame next week) I suggest to postpone this PR to the March milestone. As far as I understand we have been sending telemtry like this always and there is no need to rush this in. |
We need to get in a room so we can talk about this and close it out. Otherwise we are going to be repeating each other and not get anywhere. |
@isidorn this is intended for VS, not vscode, so it has no impact on our release. I have read the comments but still have the same question as Kenneth. I don't understand where the limitation is or where the data loss happens. Is the volume of telemetry really too high? You can include me on any call. |
@roblourens I had a meeting with @auchenberg and Kai today. We reached an agreement to batch-sending telemetry for notifications from a debugee. In that, we don't want telemetry messages clog the communication channel between DA and VS Code, so we could allow messages of higher priority pass through first. Based on that outcome, we would still want you to code review this PR as is, to clear out issues for edge cases which I have not noticed yet. In order to catch up release schedule on our end, we will first merge this PR to feature branch "feature/telemetry". I will work on getting it to master after your final code review. |
Sure, that's fine. I'll do an in depth review later.
This part still confuses me, is there evidence that there are so many telemetry messages that it slows anything down? I'd be surprised if we actually send enough to cause an issue. |
When we start reporting onScriptParsed event as telemetry, it might happen. Some project would have hundreds or thousands of small script files, which might be loaded in a short period of time. What reflects to the UI is the loadedScript event, and that's what benefits the customers. If in the meantime, we also send telemetry event (Output events) through. Imagine that there are 100 telemetry events sent out before the last loadedScript event, either alternatively with loadedScript event or randomly intertwined with loadedScript event, the last loadedScript event will be significantly delayed (comparing to the case where there were no telemetry event). Telemetry events do not directly benefit end user, it only benefit the developers or code owners, hence have lower priority. We shouldn't let our design/mechanism compromise UX. That being said, I don't have a real evidence yet, because it is about the new telemetry events we are adding which has not happened yet. But we do have the above theory and it might be useful to prevent that (an 'evidence') from happening. |
Maybe we should investigate some more to determine what the capacity of the IPC channel is, and whether we need to do anything here. There are already many messages being exchanged on the websocket and adapter/client connections, many of them very large, and the telemetry seems to be a small part of that. I'm worried that this is a premature optimization - adding dozens of lines of code to do something that sounds right but we don't have data to back it up.
Given that, why was this so urgent? |
Here is why: before this PR, the telemetry we have been sending so far does not have a problem. After this PR, we will be sending new telemetry events, which can amass within a short period of time, and we don't want to loose them in case they are throttled. It might not be wise to see the loss first, as the so-called 'evidence', then decide what to do with it. At that point, there is no way to recover the valuable telemetry events we have lost. For your first point, the issue we want to prevent is not whether the IPC channel is capable of sending through large amount of data (And I believe it is capable too). It is that we don't want the telemetry to cause delay of the real UI interactions. Telemetry is not part of the UX, it should be de-prioritized. Any UI delays, not matter how small, should be avoided. |
Can you set up a call to work this out? I think there are two points
|
Batch-sending telemetry is necessary because some notifications from debuggee might be too frequent for the telemetry infrastructure to handle. Take VS as an example, it throttles telemetry events if there are two many within a short period of time. We will lose some important telemetry when that happens.
The
BatchTelemetryReporter
will cache the received telemetry events and sends them out every 10 seconds.This PR also refactors some code and migrates some general telemetry handling code from DebugSession.ts to utils.ts or telemetry.ts.