Introduces a mechanism to batch-sending telemetry events. #292

changsi-an · 2018-02-21T23:21:29Z

Batch-sending telemetry is necessary because some notifications from debuggee might be too frequent for the telemetry infrastructure to handle. Take VS as an example, it throttles telemetry events if there are two many within a short period of time. We will lose some important telemetry when that happens.

The BatchTelemetryReporter will cache the received telemetry events and sends them out every 10 seconds.

This PR also refactors some code and migrates some general telemetry handling code from DebugSession.ts to utils.ts or telemetry.ts.

changsi-an · 2018-02-21T23:21:43Z

@digeff PTAL

dhanvikapila · 2018-02-21T23:47:44Z

src/chrome/chromeDebugAdapter.ts

+            this.runAndMeasureProcessingTime('notification/onPaused', () => {
+                return this.onPaused(params);
+            });
+        });
        this.chrome.Debugger.onResumed(() => this.onResumed());
        this.chrome.Debugger.onScriptParsed(params => this.onScriptParsed(params));


are we measuring onScriptParsed in another PR?

Yes, all the other notifications will be taken care by the next PR. I hope to have this PR introduces the "BatchTelemetryReporter" only (and give one usage example).

Some of the notification handlers need some small tweaks before they can be measured correctly. So I'd rather do another PR for those.

dhanvikapila · 2018-02-21T23:52:41Z

src/utils.ts

@@ -549,4 +551,23 @@ export function isNumber(num: number): boolean {

 export function toVoidP(p: Promise<any>): Promise<void> {
    return p.then(() => { });
+}
+
+export function calculateElapsedTime(startProcessingTime: [number, number]): number {


Nit: Wouldn't it be better to type the tuple as "HighRes timer" or something similar?

Sounds good. Added that.

dhanvikapila · 2018-02-21T23:57:30Z

src/utils.ts

+    const MillisecondsPerSecond = 1000;
+
+    const ellapsedTime = process.hrtime(startProcessingTime);
+    const ellapsedMilliseconds = ellapsedTime[0] * MillisecondsPerSecond + ellapsedTime[1] / NanoSecondsPerMillisecond;


isnt this easier to understand a different way => ( ellapsedTime[0]* 1e9 + ellapsedTime[1])/NanoSecondsPerMillisecond?
where 1e9 is the number of nanoseconds in a second.

Sounds good. Changed.

changsi-an · 2018-02-22T21:44:13Z

I am merging this PR now. If there is any other issues please kindly point it out and I will address later.

auchenberg · 2018-02-22T22:00:40Z

@changsi-an Why is this needed to be implemented in the DA, if this is limited by VS? Shouldn't this be implemented inside VS's telemetry service to make sure that data isn't lost?

changsi-an · 2018-02-22T22:09:43Z

@auchenberg Tweaking VS's telemetry service is a much bigger change than we have time for. Also regardless of what underlying telemetry service system is being used, overwhelming the service endpoint is not ideal. If the batching strategy works, it should be adopted as an optimization and diminishes the value of sending each telemetry individually.

As a side note, we only batch the events that would potentially amass in a short period of time.

digeff · 2018-02-22T21:58:06Z

src/chrome/chromeDebugAdapter.ts

@@ -363,7 +367,11 @@ export abstract class ChromeDebugAdapter implements IDebugAdapter {
     * Hook up all connection events
     */
    protected hookConnectionEvents(): void {
-        this.chrome.Debugger.onPaused(params => this.onPaused(params));
+        this.chrome.Debugger.onPaused((params) => {
+            this.runAndMeasureProcessingTime('notification/onPaused', () => {


I'd add crdp to the name, so it's clear that the notification comes from Chrome

Will change to crdpnotification/onPaused, sounds fine?

I'd put a separator so it's easier to identify the two words e.g.:
crdp/notification/onPaused
crdp-notification/onPaused
crdp_notification/onPaused
crdpNotification/onPaused

I think the only one that is allowed is: crdp/notification/onPaused

Adopted crdp/notification/onPaused

digeff · 2018-02-22T22:00:14Z

src/chrome/chromeDebugAdapter.ts

+            startTime: startTime.toString()
+        };
+
+        return procedure().then(() => {


I personally find async/await code easier to read and reason than using promises

Agreed and changed

digeff · 2018-02-22T22:02:55Z

src/chrome/chromeDebugAdapter.ts

+            properties.successful = 'true';
+        }, (e) => {
+            properties.successful = 'false';
+            properties.exceptionType = "generic";


Is there a more intuitive way to put this information? If I read generic in this field, I don't think I'll understand what it means

How about not setting this field?

Renamed to 'firstChance'

digeff · 2018-02-22T22:04:50Z

src/telemetry.ts

+    // There is an issue on some clients and reportEvent only currently accept strings properties,
+    // hence all the following properties must be strings.
+    successful?: "true" | "false";
+    exceptionType?: ExceptionType;


Should this be called eventType or actionType rather than exceptionType?

It is used to differentiate whether it's a uncaughtException or promiseRejection for exception case. Why should it be eventType or actionType?

I think that people might expect to see the class name in a field called exceptionType.

I think this word clearly conveys its meaning. Neither eventType or actionType is correct to capture the intention of this field. The case you mention might be valid, but I feel that it might come from a limited group of people who might come from a scoped background. It is not that convincing that exceptionType has to mean a class name. Users of data should check the content of the data before make any assumptions. So unless we would have a absolutely perfect wording for this concept (other than actionType and eventType), I might leave it as-is.

digeff · 2018-02-22T22:05:27Z

src/telemetry.ts

@@ -36,4 +49,80 @@ export class NullTelemetryReporter implements ITelemetryReporter {

 }

+export const DefaultTelemetryInterval = 10000;


I'd rename this to DefaultTelemetryIntervalInMilliseconds to make it clear what the unit is.

digeff · 2018-02-22T22:05:51Z

src/telemetry.ts

+    private _eventBuckets: {[eventName: string]: any};
+    private _timer: NodeJS.Timer;
+
+    public constructor(private _telemetryReporter: TelemetryReporter, private _cadence: number = DefaultTelemetryInterval) {


_cadenceInMilliseconds?

digeff · 2018-02-22T22:08:32Z

src/telemetry.ts

+    }
+
+    public setupEventHandler(_sendEvent: (event: DebugProtocol.Event) => void): void {
+        // no-op


It'll be surprising for customers of this class if this is a no-op. It might be clearer if this class doesn't implement ITelemetryReporter and doesn't have this method.

digeff · 2018-02-22T22:26:07Z

src/telemetry.ts

+        this._eventBuckets = {};
+    }
+
+    private send(): void {


I'm finding hard to follow what this method does. I think it'd be much easier to understand if we split this into smaller private methods with meaningful self-documenting names. What do you think?

It's been divided into several parts.

digeff · 2018-02-22T23:28:10Z

src/telemetry.ts

+        this.reset();
+    }
+    /**
+     * Transfrom the bucket of events data from the form:


This is really useful :) 👍 👍 👍

digeff · 2018-02-22T23:32:41Z

src/telemetry.ts

+        let propertyNamesSet = {};
+        for (const entry of bucket) {
+            for (const key of Object.keys(entry)) {
+                propertyNamesSet[key] = true;


What is the advantage of using a dictionary over an array for this?

Array won't work. I am using dictionary as a set. Array can only append new keys (unless we search with indexof()). But a lot of entries will have the same property names.

That makes sense. I compleatly missed that :)

Could you add a comment saying that we use the dictionary to eliminate duplicated values?

digeff · 2018-02-22T23:33:03Z

src/telemetry.ts

+    }
+
+    /**
+     * Get the property keys from all the entries of a event bucket:


Awesome comment!

digeff · 2018-02-22T23:37:29Z

src/telemetry.ts

+        // Run through again, collect the values for each property name.
+        for (const entry of bucket) {
+            for (const key of allPropertyNamesInTheBucket) {
+                properties[`aggregated.${key}`].push(entry[key] || null);


If entry[key] is "" or 0 entry[key] || null will return null

Good catch. Fixed.

digeff · 2018-02-22T23:45:29Z

src/telemetry.ts

+    private send(): void {
+        for (const eventName in this._eventBuckets) {
+            const bucket = this._eventBuckets[eventName];
+            let properties = BatchTelemetryReporter.transfromBucketData(bucket);


It'd be awesome if the name of the method would explain what kind of transformation the method is going to do (e.g. what the output is)

Describing the transformation on a method name might be too verbose. The realistic way is probably to refer the comment put on the method. The idea is that "if the name is not sufficient enough to tell what, refer to its signature comment, if still cannot, refer to its implementation".

Or otherwise, if you have a better name in mind, please suggest.

auchenberg · 2018-02-23T01:11:24Z

@changsi-an I'm still not sure I understand. We don't seem to have this problem today in VS Code, and this approach would require each Debug Adaptor to throttle events that satisfies constraints on VS's telemetry endpoint.

Why not have the central VSTelemetry service implementing this, so no DA should deal with this VS internal limitation?

@roblourens @dhanvikapila

dhanvikapila · 2018-02-23T01:36:38Z

@auchenberg we can chat if you need more context. The telemetry we are adding is going to result in lot of data points being generated and so we need to be careful not to overwhelm upstream telemetry sinks (VScode or VS) when they get this data. 10s was a arbitrary number we chose which we can either make larger or smaller. There is no data loss as a result of this PR. Data you got earlier you continue to get (along with lot more new data). The only change will be that the last data set will be sent 10s later for the telemetry sinks to process.

Can you please elaborate on your concerns with this PR? This is best practice when you have potential to generate lot of traffic for telemetry sinks. If we don't do this, every DA using this package will end up potentially spamming their respective telemetry endpoints with lots of requests and as a result causing upstream (VSCode or VS telemetry processors to be) impacted.

auchenberg · 2018-02-23T02:45:55Z

@dhanvikapila Again I'm trying to understand why the throttling problem has to be solved inside a debugging core library that's used in 8 different debuggers across VS and VS Code, and not inside the telemetry service in VS.

In VS Code, to my knowledge, we don't implement telemetry throttling, but simply forwards them to the AppInsights client (npm package) that might be doing some throttling when needed.

My concern is that we are adding VS telemetry specific logic inside a core debugging library to fix a symptom instead of fixing it centrally inside VS's telemetry abstraction/sink/service. It shouldn't be this libraries concern that VS's telemetry service might drop data if the volumes get's to high.

dhanvikapila · 2018-02-23T03:22:04Z

@auchenberg as I mentioned earlier, the amount of telemetry is going to be large. If we don't do this, all debug adapters who will also be generating this twenty points can potentially impact their telemetry endpoints.
Throttling makes sense when it happens on the server end and not the client (like VSCode) unless you want to lose data. Batching is the right thing to do at the client since you don't want to drop data on the floor.
This telemetry is going into VS and VSCode. The telemetry is not specific to VS. why do you think VSCode cannot use this telemetry data?

It might be better for us to jump in a call to sync. This is not about any limitations on the telemetry service. As a client of a service the client would need to be sure the data they send is not dropped on the floor resulting in missed data for us.

isidorn · 2018-02-23T08:05:13Z

Since @roblourens is this week on vacation and we are wrapping up our milestone (we have endgame next week) I suggest to postpone this PR to the March milestone.
Then @roblourens can give this a nice review

As far as I understand we have been sending telemtry like this always and there is no need to rush this in.

dhanvikapila · 2018-02-23T08:11:14Z

We need to get in a room so we can talk about this and close it out. Otherwise we are going to be repeating each other and not get anywhere.

roblourens · 2018-02-23T14:09:01Z

@isidorn this is intended for VS, not vscode, so it has no impact on our release.

I have read the comments but still have the same question as Kenneth. I don't understand where the limitation is or where the data loss happens. Is the volume of telemetry really too high? You can include me on any call.

changsi-an · 2018-02-27T23:30:20Z

@roblourens I had a meeting with @auchenberg and Kai today. We reached an agreement to batch-sending telemetry for notifications from a debugee. In that, we don't want telemetry messages clog the communication channel between DA and VS Code, so we could allow messages of higher priority pass through first.

Based on that outcome, we would still want you to code review this PR as is, to clear out issues for edge cases which I have not noticed yet.

In order to catch up release schedule on our end, we will first merge this PR to feature branch "feature/telemetry". I will work on getting it to master after your final code review.

roblourens · 2018-02-28T07:25:58Z

Sure, that's fine. I'll do an in depth review later.

we don't want telemetry messages clog the communication channel between DA and VS Code, so we could allow messages of higher priority pass through first

This part still confuses me, is there evidence that there are so many telemetry messages that it slows anything down? I'd be surprised if we actually send enough to cause an issue.

changsi-an · 2018-02-28T16:56:20Z

When we start reporting onScriptParsed event as telemetry, it might happen. Some project would have hundreds or thousands of small script files, which might be loaded in a short period of time. What reflects to the UI is the loadedScript event, and that's what benefits the customers. If in the meantime, we also send telemetry event (Output events) through. Imagine that there are 100 telemetry events sent out before the last loadedScript event, either alternatively with loadedScript event or randomly intertwined with loadedScript event, the last loadedScript event will be significantly delayed (comparing to the case where there were no telemetry event). Telemetry events do not directly benefit end user, it only benefit the developers or code owners, hence have lower priority. We shouldn't let our design/mechanism compromise UX.

That being said, I don't have a real evidence yet, because it is about the new telemetry events we are adding which has not happened yet. But we do have the above theory and it might be useful to prevent that (an 'evidence') from happening.

roblourens · 2018-03-05T17:58:34Z

Maybe we should investigate some more to determine what the capacity of the IPC channel is, and whether we need to do anything here. There are already many messages being exchanged on the websocket and adapter/client connections, many of them very large, and the telemetry seems to be a small part of that. I'm worried that this is a premature optimization - adding dozens of lines of code to do something that sounds right but we don't have data to back it up.

That being said, I don't have a real evidence yet, because it is about the new telemetry events we are adding which has not happened yet. But we do have the above theory and it might be useful to prevent that (an 'evidence') from happening.

Given that, why was this so urgent?

changsi-an · 2018-03-05T18:26:11Z

Here is why: before this PR, the telemetry we have been sending so far does not have a problem. After this PR, we will be sending new telemetry events, which can amass within a short period of time, and we don't want to loose them in case they are throttled. It might not be wise to see the loss first, as the so-called 'evidence', then decide what to do with it. At that point, there is no way to recover the valuable telemetry events we have lost.

For your first point, the issue we want to prevent is not whether the IPC channel is capable of sending through large amount of data (And I believe it is capable too). It is that we don't want the telemetry to cause delay of the real UI interactions. Telemetry is not part of the UX, it should be de-prioritized. Any UI delays, not matter how small, should be avoided.

roblourens · 2018-03-05T20:06:50Z

Can you set up a call to work this out? I think there are two points

VS's rate limit on telemetry events. We should know what that rate limit is exactly and, if we will hit it, whether we can do this aggregation from PineZorro or inside VS instead of individual adapters
Whether telemetry could actually cause a significant delay in the time for a user to see a loaded script. My intuition is that the time to send many events across will be milliseconds, and will be dwarfed by the time to do other things, like resolve sourcemaps for the script. I would be surprised if users are ever waiting for a loaded script to show up because of too much telemetry clogging the channel.

changsi-an requested review from roblourens and digeff February 21, 2018 23:21

vscodebot bot assigned roblourens Feb 21, 2018

dhanvikapila reviewed Feb 21, 2018

View reviewed changes

changsi-an force-pushed the master branch from 0312499 to 060ae1c Compare February 22, 2018 20:03

digeff reviewed Feb 22, 2018

View reviewed changes

changsi-an force-pushed the master branch from 060ae1c to 63a8511 Compare February 22, 2018 23:17

digeff reviewed Feb 22, 2018

View reviewed changes

changsi-an force-pushed the master branch from 63a8511 to cf5c348 Compare February 22, 2018 23:39

digeff reviewed Feb 22, 2018

View reviewed changes

changsi-an force-pushed the master branch from cf5c348 to 0029f27 Compare February 23, 2018 00:16

changsi-an force-pushed the master branch from 0029f27 to 04c5702 Compare February 23, 2018 17:30

Introduces a mechanism to batch-sending telemetry events.

3057e0a

changsi-an force-pushed the master branch from 04c5702 to 3057e0a Compare February 23, 2018 18:12

changsi-an changed the base branch from master to feature/telemetry February 27, 2018 23:22

changsi-an merged commit 8fe3d0a into microsoft:feature/telemetry Feb 27, 2018

roblourens added this to the March 2018 milestone Apr 3, 2018

		@@ -36,4 +49,80 @@ export class NullTelemetryReporter implements ITelemetryReporter {

		}

		export const DefaultTelemetryInterval = 10000;

Introduces a mechanism to batch-sending telemetry events. #292

Introduces a mechanism to batch-sending telemetry events. #292

Conversation

changsi-an commented Feb 21, 2018

changsi-an commented Feb 21, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

changsi-an commented Feb 22, 2018

auchenberg commented Feb 22, 2018

changsi-an commented Feb 22, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

auchenberg commented Feb 23, 2018

dhanvikapila commented Feb 23, 2018

auchenberg commented Feb 23, 2018 • edited Loading

dhanvikapila commented Feb 23, 2018

isidorn commented Feb 23, 2018

dhanvikapila commented Feb 23, 2018

roblourens commented Feb 23, 2018

changsi-an commented Feb 27, 2018

roblourens commented Feb 28, 2018

changsi-an commented Feb 28, 2018

roblourens commented Mar 5, 2018

changsi-an commented Mar 5, 2018

roblourens commented Mar 5, 2018

auchenberg commented Feb 23, 2018 •

edited

Loading