Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KOL-3467 - Thresholded #249

Merged
merged 24 commits into from
Nov 6, 2023
Merged

KOL-3467 - Thresholded #249

merged 24 commits into from
Nov 6, 2023

Conversation

diegokolena
Copy link
Contributor

Linked issue(s):

KOL-3467

What change does this PR introduce and why?

  • We introduce support for Dict on the MetricsTestSample. This will allow to models thresholded like this:
@dataclass(frozen=True)
class TestSampleMetric(DataObject):
   a: float

@dataclass(frozen=True)
class CustomMetric(MetricsTestSample):
   t1: Dict[float, TestSampleMetric]

Please check if the PR fulfills these requirements

  • Include reference to internal ticket and/or GitHub issue "Fixes #NNNN" (if applicable)
  • Relevant tests for the changes have been added
  • Relevant docs have been added / updated

@codecov
Copy link

codecov bot commented Oct 3, 2023

Codecov Report

Merging #249 (206ef07) into trunk (9cb317a) will decrease coverage by 11.04%.
Report is 2 commits behind head on trunk.
The diff coverage is 96.00%.

@@             Coverage Diff             @@
##            trunk     #249       +/-   ##
===========================================
- Coverage   93.69%   82.66%   -11.04%     
===========================================
  Files         113      114        +1     
  Lines        7361     7384       +23     
  Branches      878      884        +6     
===========================================
- Hits         6897     6104      -793     
- Misses        358     1198      +840     
+ Partials      106       82       -24     
Flag Coverage Δ
integration 72.81% <96.00%> (+0.07%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Coverage Δ
kolena/workflow/__init__.py 100.00% <100.00%> (ø)
kolena/workflow/_validators.py 92.94% <100.00%> (+0.08%) ⬆️
kolena/workflow/evaluator.py 90.90% <100.00%> (ø)
kolena/workflow/thresholded.py 95.23% <95.23%> (ø)

... and 35 files with indirect coverage changes

Copy link
Contributor

@munkyshi munkyshi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be jumping ahead with the review here, but sharing my thoughts:

I'm a bit confused at the direction here. It looks like we're extending the base data object to allow arbitrary dicts, but then restricting that to test cases only.

  • What about test samples? The original intent was to support this at the test sample level
  • How would the webapp know how to handle these? Does it automatically parse any dict as if it were a thresholded object?

kolena/workflow/_validators.py Outdated Show resolved Hide resolved
@diegokolena
Copy link
Contributor Author

Might be jumping ahead with the review here, but sharing my thoughts:

I'm a bit confused at the direction here. It looks like we're extending the base data object to allow arbitrary dicts, but then restricting that to test cases only.

  • What about test samples? The original intent was to support this at the test sample level

I added now for test samples too. I was to split on two PR.

  • How would the webapp know how to handle these? Does it automatically parse any dict as if it were a thresholded object?

Initially, I considered introducing a hidden "__type" parameter to help the webapp differentiate between standard and thresholded objects. However, after reviewing the existing codebase, I observed that this approach would be inconsistent with our current practices. Instead, I'm leaning towards a more seamless integration. The webapp could dynamically adapt its behavior when it detects a MetricsTestSample object containing a dict. What are your thoughts on this?

@diegokolena diegokolena marked this pull request as ready for review October 5, 2023 13:32
@diegokolena diegokolena requested review from a team as code owners October 5, 2023 13:32
@@ -93,6 +93,7 @@ class AggregateMetrics(MetricsTestCase):
macro_F1: float
mAP: float
PerClass: List[PerClassMetrics]
ThresholdMetrics: Dict[float, PerClassMetrics]
Copy link
Contributor

@yifanwu-kolena yifanwu-kolena Oct 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just want to double check that we use float as key is intentional. b/c floating number will have precision issues for instance

>>> 0.1+0.2
0.30000000000000004

>>> x = {1.12381902839021830912830981209381092380912830982190380192839081230982190: 42}
>>> x
{1.1238190283902183: 42}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've considered the precision issues associated with using floats as keys. However, it's important to note that access to these dictionary values will be managed exclusively by us through the webapp. We plan to control this access via a slider, which will operate on a restricted set of values. This approach mitigates the risks associated with float imprecision.

@munkyshi @yifanwu-kolena Any other use case I am missing where we can have issue?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would consider storing the keys as strings so we can deterministically match them. I don't see any big downside with storing the thresholds in strings and it means we won't have potential precision issues in JS as well once these objects come across the network

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would avoid using float as key considering we are using json for data serialization/deserialization. Handling float precision is already something we need to cautious of, and using it as key, which is not valid for json could be more landmines.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Side note: is this the right place to update? This PR is about test-sample level metrics, and thresholds for our discussion is scoped to test-sample only (so far), right?

@munkyshi
Copy link
Contributor

munkyshi commented Oct 9, 2023

@diegokolena I'm wondering if the dictionary based approach is the right change for our client API.

Concerns:

  • We have typically forbidden nesting dictionaries as data object values (unless specifically wrapped with a different explicitly declared data object, such as annotations)
  • We're running into pain points for explicitly using numeric keys
  • The List[DataObject] pattern is pretty common in our client API already (and is less friction for users to adapt to), in places like List[BoundingBox] or List[NestedTestCaseMetric]

Rather than pushing forward with this approach, I wonder if it's worth revisiting the original List[Thresholded] approach?

@diegokolena
Copy link
Contributor Author

diegokolena commented Oct 10, 2023

@diegokolena I'm wondering if the dictionary based approach is the right change for our client API.

Concerns:

  • We have typically forbidden nesting dictionaries as data object values (unless specifically wrapped with a different explicitly declared data object, such as annotations)
  • We're running into pain points for explicitly using numeric keys
  • The List[DataObject] pattern is pretty common in our client API already (and is less friction for users to adapt to), in places like List[BoundingBox] or List[NestedTestCaseMetric]

Rather than pushing forward with this approach, I wonder if it's worth revisiting the original List[Thresholded] approach?

In this approach, the "Thresholded" dictionary will include the threshold parameter as one of its key/value pairs, as follows:

car_values = Thresholded([
    {"threshold": 0.01, "tp": 100, "fp": 50, "fn": 0},
    {"threshold": 0.02, "tp": 99, "fp": 50, "fn": 1},
    {"threshold": 0.03, "tp": 98, "fp": 49, "fn": 2},
    ...
    {"threshold": 0.99, "tp": 1, "fp": 1, "fn": 99},
])

How will we determine which key/value pair represents the threshold? Should we use a naming convention?

@diegokolena diegokolena reopened this Oct 10, 2023
@diegokolena diegokolena marked this pull request as draft October 13, 2023 14:29
@gordonhart
Copy link
Member

Let me know when this is ready for review, @diegokolena!

@diegokolena diegokolena marked this pull request as ready for review October 19, 2023 13:03
a: List[str]

with pytest.raises(TypeError):
MyThresholdedMetrics(a=List["1"])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Including List changes the intended type, correct?

Suggested change
MyThresholdedMetrics(a=List["1"])
MyThresholdedMetrics(a=["1"])

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

a: Dict[str, str]

with pytest.raises(TypeError):
MyThresholdedMetrics(a=Dict["1", "1"])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this include threshold to ensure that this doesn't fail for excluding it?

Suggested change
MyThresholdedMetrics(a=Dict["1", "1"])
MyThresholdedMetrics(threshold=1, a={"key": "value"})

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

kolena/workflow/evaluator.py Outdated Show resolved Hide resolved
kolena/workflow/evaluator.py Outdated Show resolved Hide resolved
validate_scalar_data_object_type(metrics_test_case_type, supported_list_types=[MetricsTestCase])
validate_scalar_data_object_type(
metrics_test_case_type,
supported_list_types=[MetricsTestCase, ThresholdedMetrics],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to validate that ThresholdedMetrics is only used as a list type?

In other words, it doesn't make sense to have something like:

class ObjectDetectionThresholded(ThresholdedMetrics):
    tp: List[BoundingBox]
    fp: List[BoundingBox]
    fn: List[BoundingBox]
    precision: float
    recall: float

# makes sense
class MyMetrics(MetricsTestSample):
    car_detections: List[ObjectDetectionThresholded]

# doesn't make sense
class MyMetrics2(MetricsTestSample):
    car_detections: ObjectDetectionThresholded

tests/unit/workflow/test_validators.py Show resolved Hide resolved
Copy link
Contributor

@munkyshi munkyshi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving with one final request



@dataclass(frozen=True)
class ThresholdedMetrics(TypedDataObject, metaclass=PreventThresholdOverrideMeta):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, last change here -- since this will be public to our users, lets add a docstring about usage.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would prefer to have this a bit more detailed / user actionable (example: nested test case metrics).

I'd structure this similarly:

    A data class representing metrics associated with a specified threshold.

    `List[ThresholdedMetrics]` is intended for usage as a field type for a [`MetricsTestSample`][kolena.workflow.MetricsTestSample] or [`MetricsTestCase`][kolena.workflow.MetricsTestCase]. This list is intended to expected to contain threshold-specific metric values for a discrete set of specific thresholds that are consistent across [`TestSample`][kolena.workflow.TestSample]s within a test run.

    `ThresholdedMetrics` does not allow dictionary objects as field values and ensures
    that the threshold cannot be overridden.
    It should be subclassed with the relevant fields for a given workflow.

    Example usage:

    ```python
    from kolena.workflow import MetricsTestSample
    from kolena.workflow import ThresholdedMetrics
    from kolena.workflow.annotations import BoundingBox

    @dataclass(frozen=True)
    class ClassThresholdedMetrics(ThresholdedMetrics):
        precision: float
        recall: float
        f1: float

    @dataclass(frozen=True)
    class TestSampleMetrics(MetricsTestSample):
        car: List[ClassThresholdedMetrics]
        pedestrian: List[ClassThresholdedMetrics]

    # metrics object construction
    metric = TestSampleMetrics(
        car=[
            ClassThresholdedMetrics(
                threshold=0.3,
                precision=0.5,
                recall=0.8,
                f1=0.615,
            ),
            ClassThresholdedMetrics(
                threshold=0.4,
                precision=0.6,
                recall=0.6,
                f1=0.6,
            ),
            ClassThresholdedMetrics(
                threshold=0.5,
                precision=0.8,
                recall=0.4,
                f1=0.533,
            ),
            ...
        ],
        pedestrian=[
            ClassThresholdedMetrics(
                threshold=0.3,
                precision=0.6,
                recall=0.9,
                f1=0.72,
            ),
            ClassThresholdedMetrics(
                threshold=0.4,
                precision=0.7,
                recall=0.7,
                f1=0.7,
            ),
            ClassThresholdedMetrics(
                threshold=0.5,
                precision=0.8,
                recall=0.6,
                f1=0.686,
            ),
            ...
        ],
    )
    ```

    Raises:
        TypeError: If any field value is a dictionary.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be non-private, since this is meant to be used by our users?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My rationale for using _thresholded.py was that it essentially serves as an extension of TypedDataObject, which is found in _datatypes.py.

But if you think it makes more sense to rename it to thresholded.py, that's fine with me.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it makes sense as thresholded.py -- it's analogous to the other things we expect our users to see and extend (such as MetricsTestSample in evaluator.py). We don't expect our users to know of and extend TypedDataObject directly, so that's still private.

@diegokolena diegokolena merged commit c85ef45 into trunk Nov 6, 2023
6 checks passed
@diegokolena diegokolena deleted the KOL-3467 branch November 6, 2023 21:40
munkyshi added a commit that referenced this pull request Nov 7, 2023
munkyshi added a commit that referenced this pull request Nov 7, 2023
* Revert "Attach docs for ThresholdedMetrics (#306)"

This reverts commit 5b111ab.

* Revert "KOL-3467 - Thresholded (#249)"

This reverts commit c85ef45.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants