Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add posthog tracking for UTDs in Element X #2300

Closed
3 tasks done
VolkerJunginger opened this issue Feb 20, 2024 · 18 comments
Closed
3 tasks done

Add posthog tracking for UTDs in Element X #2300

VolkerJunginger opened this issue Feb 20, 2024 · 18 comments

Comments

@VolkerJunginger
Copy link
Contributor

VolkerJunginger commented Feb 20, 2024

Your use case

Add posthog tracking for UTDs.

Event: Error
Crypto SDK: Rust
Domain E2E

I am not a 100% sure that this what is required for the tracking. Please align also w. the CryptoTeam.

Have you considered any alternatives?

No response

Additional context

No response

Tasks

Preview Give feedback
@BillCarsonFr
Copy link
Member

BillCarsonFr commented Feb 22, 2024

Every unable to decrypt should be reported.

The error are mapped to poshtog like that:

{
                    eventName: "Error",
                    domain: "E2EE",
                    name: "OlmKeysNotSentError"
                    timeToDecryptMillis: number (if late utd) | -1 (if final utd),
                    cryptoSDK: "Rust"
}

The error name should be on of those, but as for now the SDK is not reporting the error type, we hardcode to OlmKeysNotSentError

  • OlmKeysNotSentError: unknown session id
  • OlmIndexError: ratcheted key (session is known but ratcheted)
  • OlmUnspecifiedError: Other olm error (mac / signatures...)
  • UnknownError: anything else (unsupported algorithm?.., replay?)

context allows to get the raw client error (might differ depending on client) Ignored for now as the context is not available yet

Decryption error are reported per event. So if 2 messages using the same sessionId are in UTD, it should report twice.

Clients should at least have some mecanism to not report several time the same event. Though non of the existing clients are persiting it, i.e after a restart the same event would be reported.

The cryptoSDK property should be also added to the $screen events, as we are doing ration of unique UTD unique Screen.

To filter out EIX ans EAX we use a App Name (Element X) and Library (posthog-ios or posthog-android)

@manuroe
Copy link
Member

manuroe commented Feb 23, 2024

@BillCarsonFr how useful and feasible would be to have the UTD reports logic in the crypto crate?

I am asking because we have nothing on EX. The timeline is managed by the SDK. It would be great to have a single logic component at least in the SDK but maybe in the crypto crate as it could serve EWR too.

@BillCarsonFr
Copy link
Member

@BillCarsonFr how useful and feasible would be to have the UTD reports logic in the crypto crate?

I am asking because we have nothing on EX. The timeline is managed by the SDK. It would be great to have a single logic component at least in the SDK but maybe in the crypto crate as it could serve EWR too.

The logic will be in the rust-sdk I believe.
Modulo can we know on sdk side when a message is actually visible on screen?

@richvdh richvdh changed the title Ad posthog tracking for UTDs Add posthog tracking for UTDs Feb 26, 2024
@richvdh richvdh changed the title Add posthog tracking for UTDs Add posthog tracking for UTDs in Element X Mar 6, 2024
@richvdh
Copy link
Member

richvdh commented Mar 7, 2024

Modulo can we know on sdk side when a message is actually visible on screen?

Visibility on screen isn't mentioned in the earlier requirements. Does a message have to be on screen (and still on screen after the grace period?) for it to be counted as a UTD for metrics? I think so, but that means a user has to be staring at a UTD for 30 seconds for it to count for the metrics?

@VolkerJunginger
Copy link
Contributor Author

From my perspective the users experience is what counts. So if a user experiences an error because of a UTD this is what we should count.
If we want to introduce a grace period at all I would go with 5 seconds max. This is no scientific number but I would consider any experience that takes longer to be broken.

@BillCarsonFr
Copy link
Member

BillCarsonFr commented Mar 8, 2024

From my perspective the users experience is what counts. So if a user experiences an error because of a UTD this is what we should count. If we want to introduce a grace period at all I would go with 5 seconds max. This is no scientific number but I would consider any experience that takes longer to be broken.

Unfortunalty we can't have exactly what we want. The current grace period is 4s and we are increasing it to stabilize the metrics, as it appears that there is a good amount of late keys on other clients (maybe less common with sliding sync).

When we will have ressources we will be able to create a better system, something more like time-to-decrypt in seconds. That would allow us in the analytics tool to move the threshold.

@BillCarsonFr
Copy link
Member

Modulo can we know on sdk side when a message is actually visible on screen?

Visibility on screen isn't mentioned in the earlier requirements. Does a message have to be on screen (and still on screen after the grace period?) for it to be counted as a UTD for metrics? I think so, but that means a user has to be staring at a UTD for 30 seconds for it to count for the metrics?

No it's just that it was displayed on screen once. Why would we do a stare at it for 30s thing??

@BillCarsonFr
Copy link
Member

One of the solution for implementation could be to use the existing rust-tracer and create a subscriber for analytics.
There is a crate for sentry https://crates.io/crates/sentry-tracing
not for posthog though

@VolkerJunginger
Copy link
Contributor Author

From my perspective the users experience is what counts. So if a user experiences an error because of a UTD this is what we should count. If we want to introduce a grace period at all I would go with 5 seconds max. This is no scientific number but I would consider any experience that takes longer to be broken.

Unfortunalty we can't have exactly what we want. The current grace period is 4s and we are increasing it to stabilize the metrics, as it appears that there is a good amount of late keys on other clients (maybe less common with sliding sync).

When we will have ressources we will be able to create a better system, something more like time-to-decrypt in seconds. That would allow us in the analytics tool to move the threshold.

Am I getting it right: You increase the grace period to get lower numbers? That's cheating if you ask me.

If we have the ambition to create a great user-experience and believe this comes with no user having to wait for more than 4 sec that their messages decrypt we should stay with the metric as they are.

@neilisfragile
Copy link
Member

My understanding for the rationale in increasing the grace period is not to juice the numbers but to differentiate permanently failing messages from temporarily failing messages.

Both are bad, but permanently failing messages are much worse.

So by increasing the period we can filter out the temporary failures and can focus on the permanent ones, accepting that the temporary failures also need work further down the line.

The other option would be to report two windows, so a 4s (say) and 30 sec, but I think the team are trying to get the simplest thing possible to unblock them.

@BillCarsonFr can you just confirm my understanding is correct?

@fkwp
Copy link

fkwp commented Mar 8, 2024

Am I getting it right: You increase the grace period to get lower numbers? That's cheating if you ask me.

If we have the ambition to create a great user-experience and believe this comes with no user having to wait for more than 4 sec that their messages decrypt we should stay with the metric as they are.

tbh. this is just a pragmatic short cut. The proper "solution" would be to record the time needed (delay between message arrived and key arrived) and present it in a histogram. This way we could learn a lot about the process... However, adding the missing bits to gather this information is just to expensive.

@richvdh
Copy link
Member

richvdh commented Mar 8, 2024

Modulo can we know on sdk side when a message is actually visible on screen?

Visibility on screen isn't mentioned in the earlier requirements. Does a message have to be on screen (and still on screen after the grace period?) for it to be counted as a UTD for metrics? I think so, but that means a user has to be staring at a UTD for 30 seconds for it to count for the metrics?

No it's just that it was displayed on screen once. Why would we do a stare at it for 30s thing??

It's fine, but if we go down that path, we need to make sure that if the key arrives after 20s, we attempt to decrypt the message again even if it is no longer on screen, so that we can exclude that message from the stats. Maybe that already happens, but the logs in https://github.com/element-hq/element-x-ios-rageshakes/issues/1377 suggest not. I'd like someone more familiar with the Element X codebase to explain what's going on there.

@BillCarsonFr
Copy link
Member

It's fine, but if we go down that path, we need to make sure that if the key arrives after 20s, we attempt to decrypt the message again even if it is no longer on screen, so that we can exclude that message from the stats.

Yes we need to review all the existing code and ensure it's doing what we want.

I'd like someone more familiar with the Element X codebase to explain what's going on there.

AFAIK as I looked some times ago, I think this logic is in the Timeline object, there is a event_handler added that will call retry_decryption when a room_key/fowarded room_key is received.
So looks like if the Timeline is dropped, it will not retry? I don't know how this work for backup though.

Also, I was surprised that it was not using the signaling in the crypto crate (there is a stream when keys are added/udpated in store).

@BillCarsonFr
Copy link
Member

My understanding for the rationale in increasing the grace period is not to juice the numbers but to differentiate permanently failing messages from temporarily failing messages.

Both are bad, but permanently failing messages are much worse.

So by increasing the period we can filter out the temporary failures and can focus on the permanent ones, accepting that the temporary failures also need work further down the line.

The other option would be to report two windows, so a 4s (say) and 30 sec, but I think the team are trying to get the simplest thing possible to unblock them.

@BillCarsonFr can you just confirm my understanding is correct?

Exactly, in the perfect world we would to the perfect thing.
As you said we want to filter out temporary failures for now

@BillCarsonFr
Copy link
Member

BillCarsonFr commented Mar 8, 2024

Regarding Visibility on screen, we might just add a field in the captured event (visibleOnScreen or something like that). We will then be able to filter out on posthog if we want

@BillCarsonFr
Copy link
Member

Quick Notes on the discussions:

The SDK API is not at the moment exposing the type of error that did occur (MissingRoomKey/Megolm), so we can't find the correct name to put in the Error event.

Decision: We don't block on that and hardcode OlmKeysNotSentError. We decided to use OlmKeysNotSentError and not UnknownError because on other platforms OlmKeysNotSentError is 90% of the reported errors, UnknownError is very low. So would be more accurate to default to OlmKeysNotSentError for now

bnjbvr added a commit to matrix-org/matrix-rust-sdk that referenced this issue Mar 14, 2024
This adds a new mechanism in the UI crate (since re-attempts to decrypt happen in the timeline, as of today — later that'll happen in the event cache) to notify whenever we run into a UTD (an event couldn't be decrypted) or a late-decryption event (after some time, a UTD could be decrypted).

This new hook will deduplicate pings for the same event (identifying events on their event id), and also implements an optional grace period. If an event was a UTD:

- if it's still a UTD after the grace period, then it's reported with a `None` `time_to_decrypt`,
- if it's not a UTD anymore (i.e. it's been decrypted in the meanwhile), then it's reported with a `time_to_decrypt` set to the time it took to decrypt the event (approximate, since it starts counting from the time the timeline receives it, not the time the SDK fails to decrypt it at first).

It's configurable as an optional hook on timeline builders. For the FFI, it's configurable at the sync service's level with a "delegate", and then the sync service will forward the hook to the timelines it creates, and the hook will forward the UTD info to the delegate.

Part of element-hq/element-meta#2300.

---

* ui: add a new module and trait for subscribing to unable-to-decrypt notifications

and late decryptions (i.e. the key came in after the event that required it for decryption).

* timeline: hook in (!) the unable-to-decrypt hook

* timeline: prefix some test names with test_

* utd hook: delay reporting a UTDs

* ffi: add support for configuring the utd hook

* utd hook: switch strategy, have a single hook

And have the data structure contain extra information about late-decryption events.

* utd hook: rename `SmartUtdHook` to `UtdHookManager`

* ffi: configure the UTD hook with a grace period of 60 seconds

And ignore UTDs that have been late-decrypted in less than 4 seconds.

* utd hook: update documentation and satisfy the clippy gods

* ffi: introduce another UnableToDecryptInfo FFI struct that exposes simplified fields from the SDK's version

* review: introduce type alias for pending utd reports

* review: address other review comments
@bnjbvr bnjbvr removed their assignment Mar 14, 2024
@bnjbvr
Copy link
Member

bnjbvr commented Mar 14, 2024

The SDK initial implementation has just been landed, so unassigning myself — it's now in the hands of our app devs!

@manuroe
Copy link
Member

manuroe commented Mar 18, 2024

First UTDs analytics will be available in EXA 0.4.6 and EXI 1.5.12. We should make those versions public in the store today.

@manuroe manuroe closed this as completed Mar 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants