Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RUMM-1890 Fix tests flakiness #711

Merged
merged 7 commits into from
Jan 11, 2022

Conversation

ncreated
Copy link
Member

@ncreated ncreated commented Jan 10, 2022

What and why?

🔬 This PR addresses tests flakiness from last month - collected from 189K test runs on the main branch:

Screenshot 2022-01-10 at 13 41 11

  • green - reproduced and fixed;
  • orange - cannot reproduce, added more verbosity;
  • other - too little data;

How?

Each case explained in PR comments. There's no general conclusion.

Review checklist

  • Feature or bugfix MUST have appropriate tests (unit, integration)
  • Make sure each commit and the PR mention the Issue number or JIRA reference

notably: `testWhenDataIsBeingUploaded_itPrintsUploadProgressInformationAndSendsErrorsThroughInternalMonitoring`
The flakiness was caused by `userLogger` reference leaked in some other tests which use `DateCorrector` (e.g. all tests
in `DatadogTests`). NTP sync completion block was executed no matter of `self` existence, making it send logs
to current (global) `userLogger` arbitrarily. This was causing some other tests asserting `userLogger` output to receive
false data, coming not from their execution.
by running each measure to fixed number of samples, instead of using
time-based condition.
precisely in `testWhenCurrentValueIsObtainedFromNetworkConnectionInfoProvider_thenCrashContextProviderNotifiesNewContext`.
by ensuring autorelease VC deallocation with `autoreleasepool {}`
…ntsAreSent`

by increasing number of samples to 400 (with 200 it was failing 4/500 repetitions,
with 300 1/500, with 400 it's 100% success).
@ncreated ncreated self-assigned this Jan 10, 2022
Comment on lines 41 to 45
completion: { offset in
completion: { [weak self] offset in
guard let _ = self else {
return
}

Copy link
Member Author

@ncreated ncreated Jan 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is to address flaky execution of

func testWhenDataIsBeingUploaded_itPrintsUploadProgressInformationAndSendsErrorsThroughInternalMonitoring() {
let previousUserLogger = userLogger
defer { userLogger = previousUserLogger }
let mockUserLoggerOutput = LogOutputMock()
userLogger = .mockWith(logOutput: mockUserLoggerOutput)

Problem was that DateCorrector is expected to perform real NTP sync in many high-level tests (e.g. most tests in DatadogTests). This closure might leak from test execution and send NTP sync log to global userLogger, causing bad value being recorded in other tests that depend on global userLogger mock, e.g.:

Screenshot 2022-01-10 at 10 18 26

Screenshot 2022-01-10 at 10 19 07

The fix is to not leak this closure - and return it early when SDK was deallocated.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this work?

completion: { [weak userLogger] offset in

Not a change request, I'm not sure about lifecycle of the logger vs. DateCorrector 🤔

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DateCorrector is instantiated only once for a given SDK instance (from Datadog.initialize()).

I think more correct solution is to avoid doing any work in this DateCorrector closure if self == nil. From self == nil we know that DateCorrector was deallocated → so the parent Datadog object was deallocated → so no-one is interested in result of this work.

Capturing weak reference to userLogger might work too but it will be ignoring only a portion of the completion work related to userLogger. As no-one will use the result of completion work, I find this odd.

How does it sound?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the root cause is that Kronos.Clock is static as well as userLogger. I think it's best to fix it in ServerDateProvider and call the completion only if self exist, WDYT?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's best to fix it in ServerDateProvider and call the completion only if self exist, WDYT?

Indeed! It depends on self already and clearly it runs its completion inconsitently when self is nil. Will do 👍

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done ✅

Comment on lines -191 to +193
XCTAssertEqual(initialContext.lastNetworkConnectionInfo, initialNetworkConnectionInfo)
XCTAssertEqual(updatedContext?.lastNetworkConnectionInfo, currentNetworkConnectionInfo)
let updatedNetworkConnectionInfo = try XCTUnwrap(updatedContext?.lastNetworkConnectionInfo)
XCTAssertEqual(initialContext.lastNetworkConnectionInfo, initialNetworkConnectionInfo, "It must store initial network info")
XCTAssertEqual(updatedNetworkConnectionInfo, currentNetworkConnectionInfo, "It must store updated network info")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It failed 2 times over 1 month, but no clue on what's happening wrong in this test. I'm just adding more verbosity to these assertions, so we can better understand it.

let simulatedSessionsCount = 200
let simulatedSessionsCount = 400
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using tests repetition in local:

  • with 200 samples - 4 failures in 500 runs
  • with 300 samples - 1 failures in 500 runs
  • with 400 samples - 0 failures in 500 runs

With 400 samples it runs in ~180ms.

Comment on lines -94 to 101
let identity = try XCTUnwrap(vc?.asRUMViewIdentity())
try autoreleasepool {
var vc: UIViewController? = UIViewController()
identity = try XCTUnwrap(vc?.asRUMViewIdentity())
XCTAssertNotNil(identity.identifiable, "Reference should be available while `vc` is alive.")
vc = nil
}

XCTAssertNotNil(identity.identifiable, "Reference should be available while `vc` is alive.")
vc = nil
XCTAssertNil(identity.identifiable, "Reference should not be available after `vc` was deallocated.")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 failures in last 1 month. It was failing on:

XCTAssertNotNil(identity.identifiable, "Reference should be available while `vc` is alive.")

I assume that autoreleasepool {} should help as VC is autorelease Objective-C object and now we clean it up in different scope than the assertion.

func testHighAndLowRefreshRates() {
func testWhenMainThreadOverheadGoesUp_itMeasuresLowerRefreshRate() throws {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 failures in last 1 month. This is quite old issue, I changed the approach for this test:

  • Instead of observing main thread for certain amount of time...
  • ... now it observes it until VitalRefreshRateReader records certain number of samples.

It passes 500x repetition in local. From my observation, time-based constraint was very flaky (sometimes recording just 1 sample, sometimes many). Current approach of recording 30 samples seems stable and still executes under 2s in local (like before).

Comment on lines +152 to +158
if #available(iOS 13.0, *) {
encodedValue1 = try prettyEncoder.encode(value1)
encodedValue2 = try prettyEncoder.encode(value2)
} else {
encodedValue1 = try prettyEncoder.encode(EncodingContainer(value1))
encodedValue2 = try prettyEncoder.encode(EncodingContainer(value2))
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This fixes the biggest problem in last 1 month - 14 failures in:

func testGivenContextWithLastRUMSessionStateSet_whenItGetsEncoded_thenTheValueIsPreservedAfterDecoding() throws {
let randomRUMSessionState: RUMSessionState? = Bool.random() ? .mockRandom() : nil
// Given
var context: CrashContext = .mockRandom()
context.lastRUMSessionState = randomRUMSessionState
// When
let serializedContext = try encoder.encode(context)
// Then
let deserializedContext = try decoder.decode(CrashContext.self, from: serializedContext)
try AssertEncodedRepresentationsEqual(
value1: deserializedContext.lastRUMSessionState,
value2: randomRUMSessionState
)
}

When fuzzy input is resolved to nil, serialization fails below iOS 13. To fix it I'm using the helper we added exactly for this case:

/// Prior to `iOS13.0`, the `JSONEncoder` supports only object or array as the root type.
/// Hence we can't test encoding `Encodable` values directly and we need to wrap it inside this `EncodingContainer` container.
///
/// Reference: https://bugs.swift.org/browse/SR-6163
struct EncodingContainer<Value: Encodable>: Encodable {

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👌

@ncreated ncreated marked this pull request as ready for review January 10, 2022 13:40
@ncreated ncreated requested a review from a team as a code owner January 10, 2022 13:40
Comment on lines +152 to +158
if #available(iOS 13.0, *) {
encodedValue1 = try prettyEncoder.encode(value1)
encodedValue2 = try prettyEncoder.encode(value2)
} else {
encodedValue1 = try prettyEncoder.encode(EncodingContainer(value1))
encodedValue2 = try prettyEncoder.encode(EncodingContainer(value2))
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👌

@ncreated ncreated merged commit 10b4943 into master Jan 11, 2022
@ncreated ncreated deleted the ncreated/RUMM-1890-fix-issues-from-nightly-tests branch January 11, 2022 11:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants