Use total tests expected for Interop score aggregation #242

DanielRyanSmith · 2025-02-04T23:42:48Z

This change fixes an issue in which past Interop scores (displayed in the graph on the dashboard) will sometimes be inaccurate. The Interop scores were being calculated using the total number of labeled tests found within the set of runs, rather than the number of labeled tests that are expected to exist. This means that the Interop scores were essentially being calculated as the value of the Interop score during that run's date, rather than what that Interop scores should be today.

This bug only affects focus areas that have had their labeled tests change during the Interop year, and also does not affect the calculations for past browser scores, which is why the Interop score was occasionally displaying as higher than individual browser scores.

Within the scoring script is an explanation of why we want to aggregate the scores this way:

      // We always normalize against the number of tests we are looking for,
      // rather than the total number of tests we found. The trade-off is all
      // about new tests being added to the set.
      //
      // If a large chunk of tests are introduced at date X, and they fail in
      // some browser, then runs after date X look worse if you're only
      // counting total tests found - even though the tests would have failed
      // before date X as well.
      //
      // Conversely, if a large chunk of tests are introduced at date X, and
      // they pass in some browser, then runs after date X would get an
      // artificial boost in pass-rate due to this - even if the tests would
      // have passed before date X as well.
      //
      // We consider the former case worse than the latter, so optimize for it
      // by always comparing against the full test list. This does mean that
      // when tests are added to the set, previously generated data is no
      // longer valid and this script should be re-run for all dates.

Graph views after change

use total test counts for interop score aggregation

1bd082a

DanielRyanSmith requested a review from jgraham February 4, 2025 23:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use total tests expected for Interop score aggregation #242

Use total tests expected for Interop score aggregation #242

DanielRyanSmith commented Feb 4, 2025 •

edited

Loading

Use total tests expected for Interop score aggregation #242

Are you sure you want to change the base?

Use total tests expected for Interop score aggregation #242

Conversation

DanielRyanSmith commented Feb 4, 2025 • edited Loading

Graph views after change

DanielRyanSmith commented Feb 4, 2025 •

edited

Loading