Add posthog tracking for UTDs in Element X #2300

VolkerJunginger · 2024-02-20T15:40:27Z

Your use case

Add posthog tracking for UTDs.

Event: Error
Crypto SDK: Rust
Domain E2E

I am not a 100% sure that this what is required for the tracking. Please align also w. the CryptoTeam.

Have you considered any alternatives?

No response

Additional context

No response

Tasks

Give feedback

BillCarsonFr · 2024-02-22T16:12:11Z

Every unable to decrypt should be reported.

The error are mapped to poshtog like that:

{
                    eventName: "Error",
                    domain: "E2EE",
                    name: "OlmKeysNotSentError"
                    timeToDecryptMillis: number (if late utd) | -1 (if final utd),
                    cryptoSDK: "Rust"
}

The error name should be on of those, but as for now the SDK is not reporting the error type, we hardcode to OlmKeysNotSentError

OlmKeysNotSentError: unknown session id
OlmIndexError: ratcheted key (session is known but ratcheted)
OlmUnspecifiedError: Other olm error (mac / signatures...)
UnknownError: anything else (unsupported algorithm?.., replay?)

~~context allows to get the raw client error (might differ depending on client)~~ Ignored for now as the context is not available yet

Decryption error are reported per event. So if 2 messages using the same sessionId are in UTD, it should report twice.

Clients should at least have some mecanism to not report several time the same event. Though non of the existing clients are persiting it, i.e after a restart the same event would be reported.

The cryptoSDK property should be also added to the $screen events, as we are doing ration of unique UTD unique Screen.

To filter out EIX ans EAX we use a App Name (Element X) and Library (posthog-ios or posthog-android)

manuroe · 2024-02-23T15:24:08Z

@BillCarsonFr how useful and feasible would be to have the UTD reports logic in the crypto crate?

I am asking because we have nothing on EX. The timeline is managed by the SDK. It would be great to have a single logic component at least in the SDK but maybe in the crypto crate as it could serve EWR too.

BillCarsonFr · 2024-02-26T10:15:06Z

@BillCarsonFr how useful and feasible would be to have the UTD reports logic in the crypto crate?

I am asking because we have nothing on EX. The timeline is managed by the SDK. It would be great to have a single logic component at least in the SDK but maybe in the crypto crate as it could serve EWR too.

The logic will be in the rust-sdk I believe.
Modulo can we know on sdk side when a message is actually visible on screen?

richvdh · 2024-03-07T22:42:55Z

Modulo can we know on sdk side when a message is actually visible on screen?

Visibility on screen isn't mentioned in the earlier requirements. Does a message have to be on screen (and still on screen after the grace period?) for it to be counted as a UTD for metrics? I think so, but that means a user has to be staring at a UTD for 30 seconds for it to count for the metrics?

VolkerJunginger · 2024-03-08T07:40:17Z

From my perspective the users experience is what counts. So if a user experiences an error because of a UTD this is what we should count.
If we want to introduce a grace period at all I would go with 5 seconds max. This is no scientific number but I would consider any experience that takes longer to be broken.

BillCarsonFr · 2024-03-08T08:21:11Z

From my perspective the users experience is what counts. So if a user experiences an error because of a UTD this is what we should count. If we want to introduce a grace period at all I would go with 5 seconds max. This is no scientific number but I would consider any experience that takes longer to be broken.

Unfortunalty we can't have exactly what we want. The current grace period is 4s and we are increasing it to stabilize the metrics, as it appears that there is a good amount of late keys on other clients (maybe less common with sliding sync).

When we will have ressources we will be able to create a better system, something more like time-to-decrypt in seconds. That would allow us in the analytics tool to move the threshold.

BillCarsonFr · 2024-03-08T08:22:44Z

Modulo can we know on sdk side when a message is actually visible on screen?

Visibility on screen isn't mentioned in the earlier requirements. Does a message have to be on screen (and still on screen after the grace period?) for it to be counted as a UTD for metrics? I think so, but that means a user has to be staring at a UTD for 30 seconds for it to count for the metrics?

No it's just that it was displayed on screen once. Why would we do a stare at it for 30s thing??

BillCarsonFr · 2024-03-08T08:32:23Z

One of the solution for implementation could be to use the existing rust-tracer and create a subscriber for analytics.
There is a crate for sentry https://crates.io/crates/sentry-tracing
not for posthog though

VolkerJunginger · 2024-03-08T08:49:05Z

From my perspective the users experience is what counts. So if a user experiences an error because of a UTD this is what we should count. If we want to introduce a grace period at all I would go with 5 seconds max. This is no scientific number but I would consider any experience that takes longer to be broken.

Unfortunalty we can't have exactly what we want. The current grace period is 4s and we are increasing it to stabilize the metrics, as it appears that there is a good amount of late keys on other clients (maybe less common with sliding sync).

When we will have ressources we will be able to create a better system, something more like time-to-decrypt in seconds. That would allow us in the analytics tool to move the threshold.

Am I getting it right: You increase the grace period to get lower numbers? That's cheating if you ask me.

If we have the ambition to create a great user-experience and believe this comes with no user having to wait for more than 4 sec that their messages decrypt we should stay with the metric as they are.

neilisfragile · 2024-03-08T09:16:23Z

My understanding for the rationale in increasing the grace period is not to juice the numbers but to differentiate permanently failing messages from temporarily failing messages.

Both are bad, but permanently failing messages are much worse.

So by increasing the period we can filter out the temporary failures and can focus on the permanent ones, accepting that the temporary failures also need work further down the line.

The other option would be to report two windows, so a 4s (say) and 30 sec, but I think the team are trying to get the simplest thing possible to unblock them.

@BillCarsonFr can you just confirm my understanding is correct?

fkwp · 2024-03-08T10:40:49Z

Am I getting it right: You increase the grace period to get lower numbers? That's cheating if you ask me.

If we have the ambition to create a great user-experience and believe this comes with no user having to wait for more than 4 sec that their messages decrypt we should stay with the metric as they are.

tbh. this is just a pragmatic short cut. The proper "solution" would be to record the time needed (delay between message arrived and key arrived) and present it in a histogram. This way we could learn a lot about the process... However, adding the missing bits to gather this information is just to expensive.

richvdh · 2024-03-08T10:41:48Z

Modulo can we know on sdk side when a message is actually visible on screen?

Visibility on screen isn't mentioned in the earlier requirements. Does a message have to be on screen (and still on screen after the grace period?) for it to be counted as a UTD for metrics? I think so, but that means a user has to be staring at a UTD for 30 seconds for it to count for the metrics?

No it's just that it was displayed on screen once. Why would we do a stare at it for 30s thing??

It's fine, but if we go down that path, we need to make sure that if the key arrives after 20s, we attempt to decrypt the message again even if it is no longer on screen, so that we can exclude that message from the stats. Maybe that already happens, but the logs in https://github.com/element-hq/element-x-ios-rageshakes/issues/1377 suggest not. I'd like someone more familiar with the Element X codebase to explain what's going on there.

BillCarsonFr · 2024-03-08T12:34:46Z

It's fine, but if we go down that path, we need to make sure that if the key arrives after 20s, we attempt to decrypt the message again even if it is no longer on screen, so that we can exclude that message from the stats.

Yes we need to review all the existing code and ensure it's doing what we want.

I'd like someone more familiar with the Element X codebase to explain what's going on there.

AFAIK as I looked some times ago, I think this logic is in the Timeline object, there is a event_handler added that will call retry_decryption when a room_key/fowarded room_key is received.
So looks like if the Timeline is dropped, it will not retry? I don't know how this work for backup though.

Also, I was surprised that it was not using the signaling in the crypto crate (there is a stream when keys are added/udpated in store).

BillCarsonFr · 2024-03-08T12:46:08Z

My understanding for the rationale in increasing the grace period is not to juice the numbers but to differentiate permanently failing messages from temporarily failing messages.

Both are bad, but permanently failing messages are much worse.

So by increasing the period we can filter out the temporary failures and can focus on the permanent ones, accepting that the temporary failures also need work further down the line.

The other option would be to report two windows, so a 4s (say) and 30 sec, but I think the team are trying to get the simplest thing possible to unblock them.

@BillCarsonFr can you just confirm my understanding is correct?

Exactly, in the perfect world we would to the perfect thing.
As you said we want to filter out temporary failures for now

BillCarsonFr · 2024-03-08T16:36:00Z

Regarding Visibility on screen, we might just add a field in the captured event (visibleOnScreen or something like that). We will then be able to filter out on posthog if we want

BillCarsonFr · 2024-03-13T10:33:38Z

Quick Notes on the discussions:

The SDK API is not at the moment exposing the type of error that did occur (MissingRoomKey/Megolm), so we can't find the correct name to put in the Error event.

Decision: We don't block on that and hardcode OlmKeysNotSentError. We decided to use OlmKeysNotSentError and not UnknownError because on other platforms OlmKeysNotSentError is 90% of the reported errors, UnknownError is very low. So would be more accurate to default to OlmKeysNotSentError for now

This adds a new mechanism in the UI crate (since re-attempts to decrypt happen in the timeline, as of today — later that'll happen in the event cache) to notify whenever we run into a UTD (an event couldn't be decrypted) or a late-decryption event (after some time, a UTD could be decrypted). This new hook will deduplicate pings for the same event (identifying events on their event id), and also implements an optional grace period. If an event was a UTD: - if it's still a UTD after the grace period, then it's reported with a `None` `time_to_decrypt`, - if it's not a UTD anymore (i.e. it's been decrypted in the meanwhile), then it's reported with a `time_to_decrypt` set to the time it took to decrypt the event (approximate, since it starts counting from the time the timeline receives it, not the time the SDK fails to decrypt it at first). It's configurable as an optional hook on timeline builders. For the FFI, it's configurable at the sync service's level with a "delegate", and then the sync service will forward the hook to the timelines it creates, and the hook will forward the UTD info to the delegate. Part of element-hq/element-meta#2300. --- * ui: add a new module and trait for subscribing to unable-to-decrypt notifications and late decryptions (i.e. the key came in after the event that required it for decryption). * timeline: hook in (!) the unable-to-decrypt hook * timeline: prefix some test names with test_ * utd hook: delay reporting a UTDs * ffi: add support for configuring the utd hook * utd hook: switch strategy, have a single hook And have the data structure contain extra information about late-decryption events. * utd hook: rename `SmartUtdHook` to `UtdHookManager` * ffi: configure the UTD hook with a grace period of 60 seconds And ignore UTDs that have been late-decrypted in less than 4 seconds. * utd hook: update documentation and satisfy the clippy gods * ffi: introduce another UnableToDecryptInfo FFI struct that exposes simplified fields from the SDK's version * review: introduce type alias for pending utd reports * review: address other review comments

bnjbvr · 2024-03-14T13:14:25Z

The SDK initial implementation has just been landed, so unassigning myself — it's now in the hands of our app devs!

manuroe · 2024-03-18T07:58:15Z

First UTDs analytics will be available in EXA 0.4.6 and EXI 1.5.12. We should make those versions public in the store today.

VolkerJunginger added T-Enhancement App: ElementX Android App: ElementX iOS labels Feb 20, 2024

BillCarsonFr mentioned this issue Feb 23, 2024

Posthog | Increase decryption failure grace period #2303

Closed

richvdh changed the title ~~Ad posthog tracking for UTDs~~ Add posthog tracking for UTDs Feb 26, 2024

richvdh changed the title ~~Add posthog tracking for UTDs~~ Add posthog tracking for UTDs in Element X Mar 6, 2024

BillCarsonFr mentioned this issue Mar 11, 2024

Posthog: Add more properties to unable to decrypt errors to allow more filtering options #2332

Closed

manuroe assigned bnjbvr Mar 11, 2024

bnjbvr mentioned this issue Mar 11, 2024

ui/timeline: allow subscribing to UTDs and late-decrypt events matrix-org/matrix-rust-sdk#3206

Merged

bnjbvr removed their assignment Mar 14, 2024

manuroe closed this as completed Mar 18, 2024

BillCarsonFr mentioned this issue May 2, 2024

Element X: Posthog UTDs round off #2413

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add posthog tracking for UTDs in Element X #2300

Add posthog tracking for UTDs in Element X #2300

VolkerJunginger commented Feb 20, 2024 •

edited by manuroe

Loading

Tasks

BillCarsonFr commented Feb 22, 2024 •

edited

Loading

manuroe commented Feb 23, 2024

BillCarsonFr commented Feb 26, 2024

richvdh commented Mar 7, 2024

VolkerJunginger commented Mar 8, 2024

BillCarsonFr commented Mar 8, 2024 •

edited

Loading

BillCarsonFr commented Mar 8, 2024

BillCarsonFr commented Mar 8, 2024

VolkerJunginger commented Mar 8, 2024

neilisfragile commented Mar 8, 2024

fkwp commented Mar 8, 2024 •

edited

Loading

richvdh commented Mar 8, 2024

BillCarsonFr commented Mar 8, 2024

BillCarsonFr commented Mar 8, 2024

BillCarsonFr commented Mar 8, 2024 •

edited

Loading

BillCarsonFr commented Mar 13, 2024

bnjbvr commented Mar 14, 2024

manuroe commented Mar 18, 2024

Add posthog tracking for UTDs in Element X #2300

Add posthog tracking for UTDs in Element X #2300

Comments

VolkerJunginger commented Feb 20, 2024 • edited by manuroe Loading

Your use case

Have you considered any alternatives?

Additional context

Tasks

BillCarsonFr commented Feb 22, 2024 • edited Loading

manuroe commented Feb 23, 2024

BillCarsonFr commented Feb 26, 2024

richvdh commented Mar 7, 2024

VolkerJunginger commented Mar 8, 2024

BillCarsonFr commented Mar 8, 2024 • edited Loading

BillCarsonFr commented Mar 8, 2024

BillCarsonFr commented Mar 8, 2024

VolkerJunginger commented Mar 8, 2024

neilisfragile commented Mar 8, 2024

fkwp commented Mar 8, 2024 • edited Loading

richvdh commented Mar 8, 2024

BillCarsonFr commented Mar 8, 2024

BillCarsonFr commented Mar 8, 2024

BillCarsonFr commented Mar 8, 2024 • edited Loading

BillCarsonFr commented Mar 13, 2024

bnjbvr commented Mar 14, 2024

manuroe commented Mar 18, 2024

VolkerJunginger commented Feb 20, 2024 •

edited by manuroe

Loading

BillCarsonFr commented Feb 22, 2024 •

edited

Loading

BillCarsonFr commented Mar 8, 2024 •

edited

Loading

fkwp commented Mar 8, 2024 •

edited

Loading

BillCarsonFr commented Mar 8, 2024 •

edited

Loading