Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create new device for tests that deploy configs #2753

Merged
merged 25 commits into from
Apr 10, 2020

Conversation

damonbarry
Copy link
Member

@damonbarry damonbarry commented Mar 30, 2020

The new end-to-end tests have become increasingly flaky as more tests (and more complicated tests) have been added. This is largely because most of the tests share the same edge device identity, which leads to at least two problems:

  • As more configurations are deployed, edge hub works increasingly harder to route traffic, often to devices/modules from previous tests that have already been removed. Tests that run near the end of the suite end up failing randomly, simply because we aren't waiting long enough for edge hub to work as designed under rapid configuration churn (nor can we afford to wait; the tests need to run reasonably fast).
  • The currently-running test has no way to know when edge hub has changed it's routes to match the current deployment, as opposed to the previous deployment from the previous test. This means that when a test's modules start sending messages, those first messages might be delivered to the wrong destination or not delivered at all. Many tests carefully count received messages so they're sensitive to this problem.

With this change, we create a new edge device identity for any test/fixture that deploys its own configuration. This allows edge hub to start up fresh each time, and tests can more readily verify the exact scenario they're testing. And we still get the speed benefit of only installing iotedged once per run (as opposed to the old end-to-end framework, which installs everything from scratch for every test--reliable but slow).

Included changes:

  • Add a DeviceId class that can generate unique IDs from a common base. The base ID is used as the prefix for all logs in a single test run.
  • Remove DeviceId property from the Context class. Early on, it was nice to be able to pass in a device ID instead of having the tests generate one, but it's not really that useful anymore, and the new approach is much simpler.
  • Promote the DeviceId property on EdgeRuntime to public.
  • In code locations that formerly referred to Context.Current.DeviceId: use DeviceId.Generate() to get a new device ID if needed, otherwise use EdgeRuntime.DeviceId to get the parent edge device's ID.
  • Remove the IdentityLimits class. The DeviceId class now centralizes and encapsulates the creation of device IDs so callers don't have to remember to check length limits after creating their own IDs.
  • Push SAS- and X509-specific code from ManualProvisioningFixture into the appropriate derived classes. ManualProvisioningFixture could probably go away but would require more careful thought so holding off on that.
  • In SasManualProvisioningFixture, change the [OneTimeSetUp] method to [SetUp] so that each test under this fixture gets a new edge device identity. The exception to this rule is tests under the derived CustomCertificatesFixture (TransparentGateway tests), which only gets one shared device by overriding [SetUp] with an empty method, and instead calling the parent logic from [OneTimeSetUp].
  • In X509ManualProvisioningFixture, remove some unused variables.
  • In the ValidateMetrics test we were deploying a vanilla agent image first so that the agent would get updated when we deploy it with special config later (see Improve ValidateMetrics test on Windows #2583). Now that we configure iotedged with a fresh device before starting the test, it will automatically start out with the vanilla agent so I removed the workaround code from the test.
  • In the Module test class, several tests use the temp sensor module so the TempSensorModule helper class was created to hand out unique module names to avoid edge hub routing conflicts. But now that we restart iotedged with a fresh device ID for each test, we don't need the helper class anymore. I removed it.
  • Refactor some test code to follow style guidelines, e.g., don't explicitly call default base constructor, don't explicitly decorate class members with private, etc.

@damonbarry damonbarry changed the title E2e device per test Create new device for tests that deploy configs Mar 30, 2020
kodiakhq bot pushed a commit that referenced this pull request Apr 2, 2020
Docker images take a _long_ time to download on Raspberry Pi. Since we recently started deleting all images before running the tests (#2736), the first test to run (currently it's `QuickstartCerts`) has to download images from scratch. Also, the `TempFilterFunc` test has to download a rather large image that doesn't share anything in common with our other images. Both these tests tend to timeout before they can download everything, even after the recent timeout increase to 10 min (#2761).

This change expands the Linux arm32v7-specific build setup to:
1. git clean as sudo - we need to clean up the source code tree before running the tests, especially the previous run's test logs and certificates (the latter are owned by root, so the `checkout` task in `e2e-setup.yaml` can't delete them).
2. Parse `context.json` to get the list of required images for this test run.
3. `docker pull` the required images _before any old images are removed_, to take advantage of layer caching which speeds up downloads significantly.
4. Remove old containers, images, docker networks, and volumes.

With these changes in place, tests on the Pi are completing in 30 seconds-2.5 minutes. As a result this change also reduces the test timeout to 6 minutes so that failed tests don't slow the run down as much as the previous 10 minute timeout.

A few tests that were failing before are still failing and need to be investigated separately:
- `TransparentGateway` for MQTT + CA/self-signed certs
- `TempFilterFunc` and `ValidateMetrics` (see #2753)
- `PriorityQueueModuleToModuleMessages` (see #2762)
@damonbarry damonbarry marked this pull request as ready for review April 10, 2020 16:13
Copy link
Contributor

@dylanbronson dylanbronson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - Linux went all green with this PR. Windows still has issues, but unrelated to this PR it seems

@kodiakhq kodiakhq bot merged commit 4744cf6 into Azure:master Apr 10, 2020
@damonbarry damonbarry deleted the e2e-device-per-test branch April 10, 2020 22:58
kodiakhq bot pushed a commit that referenced this pull request Jul 27, 2020
Given recent improvements to the end-to-end test framework (#2635, #2753), the logic that splits deployments into multiple stages (#2248) is no longer needed. This change removes that logic.

In verifying the changes, I also ran into occasional failures in the `ValidateMetrics` test because the stringized JSON labels that were recently added to Edge Agent containers failed to compare. The expected and actual twin reported properties always had the same JSON properties, but sometimes in different orders:

```
systemModules.edgeAgent.settings.createOptions: {"Labels":{"net.azure-devices.edge.create-options":"{}","net.azure-devices.edge.env":"{\"RuntimeLogLevel\":{\"value\":\"debug\"},\"PerformanceMetricsUpdateFrequency\":{\"value\":\"00:00:20\"}}","net.azure-devices.edge.owner":"Microsoft.Azure.Devices.Edge.Agent"}
systemModules.edgeAgent.settings.createOptions: {"Labels":{"net.azure-devices.edge.create-options":"{}","net.azure-devices.edge.env":"{\"PerformanceMetricsUpdateFrequency\":{\"value\":\"00:00:20\"},\"RuntimeLogLevel\":{\"value\":\"debug\"}}","net.azure-devices.edge.owner":"Microsoft.Azure.Devices.Edge.Agent"}
```

The logic that compares these properties is trying to confirm that the configuration has been deployed, and the agent's labels aren't related to the deployment so I just removed them from the comparison.
damonbarry added a commit to damonbarry/iotedge that referenced this pull request Aug 5, 2020
Given recent improvements to the end-to-end test framework (Azure#2635, Azure#2753), the logic that splits deployments into multiple stages (Azure#2248) is no longer needed. This change removes that logic.

In verifying the changes, I also ran into occasional failures in the `ValidateMetrics` test because the stringized JSON labels that were recently added to Edge Agent containers failed to compare. The expected and actual twin reported properties always had the same JSON properties, but sometimes in different orders:

```
systemModules.edgeAgent.settings.createOptions: {"Labels":{"net.azure-devices.edge.create-options":"{}","net.azure-devices.edge.env":"{\"RuntimeLogLevel\":{\"value\":\"debug\"},\"PerformanceMetricsUpdateFrequency\":{\"value\":\"00:00:20\"}}","net.azure-devices.edge.owner":"Microsoft.Azure.Devices.Edge.Agent"}
systemModules.edgeAgent.settings.createOptions: {"Labels":{"net.azure-devices.edge.create-options":"{}","net.azure-devices.edge.env":"{\"PerformanceMetricsUpdateFrequency\":{\"value\":\"00:00:20\"},\"RuntimeLogLevel\":{\"value\":\"debug\"}}","net.azure-devices.edge.owner":"Microsoft.Azure.Devices.Edge.Agent"}
```

The logic that compares these properties is trying to confirm that the configuration has been deployed, and the agent's labels aren't related to the deployment so I just removed them from the comparison.
ggjjj pushed a commit to ggjjj/iotedge that referenced this pull request Aug 20, 2020
Given recent improvements to the end-to-end test framework (Azure#2635, Azure#2753), the logic that splits deployments into multiple stages (Azure#2248) is no longer needed. This change removes that logic.

In verifying the changes, I also ran into occasional failures in the `ValidateMetrics` test because the stringized JSON labels that were recently added to Edge Agent containers failed to compare. The expected and actual twin reported properties always had the same JSON properties, but sometimes in different orders:

```
systemModules.edgeAgent.settings.createOptions: {"Labels":{"net.azure-devices.edge.create-options":"{}","net.azure-devices.edge.env":"{\"RuntimeLogLevel\":{\"value\":\"debug\"},\"PerformanceMetricsUpdateFrequency\":{\"value\":\"00:00:20\"}}","net.azure-devices.edge.owner":"Microsoft.Azure.Devices.Edge.Agent"}
systemModules.edgeAgent.settings.createOptions: {"Labels":{"net.azure-devices.edge.create-options":"{}","net.azure-devices.edge.env":"{\"PerformanceMetricsUpdateFrequency\":{\"value\":\"00:00:20\"},\"RuntimeLogLevel\":{\"value\":\"debug\"}}","net.azure-devices.edge.owner":"Microsoft.Azure.Devices.Edge.Agent"}
```

The logic that compares these properties is trying to confirm that the configuration has been deployed, and the agent's labels aren't related to the deployment so I just removed them from the comparison.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants