Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RUMM-1744 Collect Kronos telemetry in Internal Monitoring #709

Conversation

ncreated
Copy link
Member

@ncreated ncreated commented Jan 7, 2022

What and why?

📦🔬 This PR adds KronosMonitor for collecting internal telemetry on Kronos execution (NTP sync). This is to gather more data and have more observability on #647.

Note: this additional telemetry will only be collected in apps that enable our Internal Monitoring feature, which means our own dogfood 🐶 projects. It won't be collected from customer apps.

How?

KronosMonitor traces 3 phases of Kronos execution:

  • overall clock sync,
  • DNS resolution,
  • UDP connection to each resolved IP.
internal protocol KronosMonitor {
    // MARK: - Clock sync
    func notifySyncStart(from pool: String)
    func notifySyncEnd(serverOffset: TimeInterval?)

    // MARK: - DNS resolution
    func notifyResolveDNS(to addresses: [KronosInternetAddress])

    // MARK: - IP querying
    func notifyStartQuerying(ip address: KronosInternetAddress, numberOfSamples: Int)
    func notifyReceivePacket(from address: KronosInternetAddress, isValidSample: Bool)
    func notifyEndQuerying(ip address: KronosInternetAddress)
}

Additionally, it uses NWConnection to check UDP connection reachability for each resolved IP on port 123. It then aggregates all results (3 Kronos phases + connection checks to all IPs) and reports single telemetry event (Log), e.g.:

Screenshot 2022-01-07 at 16 19 49

Review checklist

  • Feature or bugfix MUST have appropriate tests (unit, integration)
  • Make sure each commit and the PR mention the Issue number or JIRA reference

@@ -7,22 +7,45 @@
@testable import Datadog

class KronosE2ETests: E2ETests {
Copy link
Member Author

@ncreated ncreated Jan 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is to also collect this telemetry from our E2E tests and run full KronosMonitor logic in CI environment.

Copy link
Contributor

@buranmert buranmert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm but i left a comment regarding usage of DispatchGroup

@ncreated ncreated requested a review from buranmert January 10, 2022 14:58
@sergiocampama
Copy link

I know this is not the scope of this PR, but could there also be a way to disable clock sync in the API/build settings? This might help mitigate this issue in the mean time for our users. The alternative for me would be to fork the repo and do this on my own, but I'd much prefer to stay on the official Cocoapod dependency.

@ncreated
Copy link
Member Author

ncreated commented Jan 11, 2022

I know this is not the scope of this PR, but could there also be a way to disable clock sync in the API/build settings? This might help mitigate this issue in the mean time for our users.

Hello @sergiocampama 🙂👋. Unfortunately, while such API / build setting might work for you it won't work for all our users. If an app uses Datadog Distributed Tracing or tracks 1st party RUM resources, disabling NTP will lead to serious and immediate misalignments in client to server spans causing serious issues in data consistency. We totally understand that some apps might be not collecting distributed telemetry but the set of APIs and configuration options we provide must be equally functional for everyone.

This is why we're not considering it as viable solution to mitigate the problem. Instead, the last sequence of actions we took in dd-sdk-ios should move us closer towards finding the root cause of #647 and fixing it. Meantime we stay open to any help from the OSS community on reproducing the issue.

The alternative for me would be to fork the repo and do this on my own, but I'd much prefer to stay on the official Cocoapod dependency.

I wish I could have better answer for you and I clearly understand why this is important for your users. Unfortunately, a tailor-made fork is currently the only option to disable NTP in dd-sdk-ios. It can be easily achieved by not invoking KronosClock.sync {} in our ServerDateProvider.

@ncreated ncreated merged commit 79c63be into master Jan 11, 2022
@ncreated ncreated deleted the ncreated/RUMM-1744-collect-Kronos-telemetry-in-internal-monitoring branch January 11, 2022 11:04
@ncreated ncreated mentioned this pull request Jan 11, 2022
2 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants