-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KOL-3467 - Thresholded #249
Conversation
Codecov Report
@@ Coverage Diff @@
## trunk #249 +/- ##
===========================================
- Coverage 93.69% 82.66% -11.04%
===========================================
Files 113 114 +1
Lines 7361 7384 +23
Branches 878 884 +6
===========================================
- Hits 6897 6104 -793
- Misses 358 1198 +840
+ Partials 106 82 -24
Flags with carried forward coverage won't be shown. Click here to find out more.
|
- Update evaluator and validators to allow dicts. - Added new unit test to verify the new use cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be jumping ahead with the review here, but sharing my thoughts:
I'm a bit confused at the direction here. It looks like we're extending the base data object to allow arbitrary dicts, but then restricting that to test cases only.
- What about test samples? The original intent was to support this at the test sample level
- How would the webapp know how to handle these? Does it automatically parse any dict as if it were a thresholded object?
I added now for test samples too. I was to split on two PR.
Initially, I considered introducing a hidden "__type" parameter to help the webapp differentiate between standard and thresholded objects. However, after reviewing the existing codebase, I observed that this approach would be inconsistent with our current practices. Instead, I'm leaning towards a more seamless integration. The webapp could dynamically adapt its behavior when it detects a MetricsTestSample object containing a dict. What are your thoughts on this? |
@@ -93,6 +93,7 @@ class AggregateMetrics(MetricsTestCase): | |||
macro_F1: float | |||
mAP: float | |||
PerClass: List[PerClassMetrics] | |||
ThresholdMetrics: Dict[float, PerClassMetrics] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just want to double check that we use float as key is intentional. b/c floating number will have precision issues for instance
>>> 0.1+0.2
0.30000000000000004
>>> x = {1.12381902839021830912830981209381092380912830982190380192839081230982190: 42}
>>> x
{1.1238190283902183: 42}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've considered the precision issues associated with using floats as keys. However, it's important to note that access to these dictionary values will be managed exclusively by us through the webapp. We plan to control this access via a slider, which will operate on a restricted set of values. This approach mitigates the risks associated with float imprecision.
@munkyshi @yifanwu-kolena Any other use case I am missing where we can have issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would consider storing the keys as strings so we can deterministically match them. I don't see any big downside with storing the thresholds in strings and it means we won't have potential precision issues in JS as well once these objects come across the network
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would avoid using float as key considering we are using json for data serialization/deserialization. Handling float precision is already something we need to cautious of, and using it as key, which is not valid for json could be more landmines.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Side note: is this the right place to update? This PR is about test-sample level metrics, and thresholds for our discussion is scoped to test-sample only (so far), right?
@diegokolena I'm wondering if the dictionary based approach is the right change for our client API. Concerns:
Rather than pushing forward with this approach, I wonder if it's worth revisiting the original |
In this approach, the "Thresholded" dictionary will include the threshold parameter as one of its key/value pairs, as follows:
How will we determine which key/value pair represents the threshold? Should we use a naming convention? |
…ors and evaluator to support this new field.
Let me know when this is ready for review, @diegokolena! |
a: List[str] | ||
|
||
with pytest.raises(TypeError): | ||
MyThresholdedMetrics(a=List["1"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Including List
changes the intended type, correct?
MyThresholdedMetrics(a=List["1"]) | |
MyThresholdedMetrics(a=["1"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated.
a: Dict[str, str] | ||
|
||
with pytest.raises(TypeError): | ||
MyThresholdedMetrics(a=Dict["1", "1"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this include threshold
to ensure that this doesn't fail for excluding it?
MyThresholdedMetrics(a=Dict["1", "1"]) | |
MyThresholdedMetrics(threshold=1, a={"key": "value"}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated.
kolena/workflow/evaluator.py
Outdated
validate_scalar_data_object_type(metrics_test_case_type, supported_list_types=[MetricsTestCase]) | ||
validate_scalar_data_object_type( | ||
metrics_test_case_type, | ||
supported_list_types=[MetricsTestCase, ThresholdedMetrics], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to validate that ThresholdedMetrics
is only used as a list type?
In other words, it doesn't make sense to have something like:
class ObjectDetectionThresholded(ThresholdedMetrics):
tp: List[BoundingBox]
fp: List[BoundingBox]
fn: List[BoundingBox]
precision: float
recall: float
# makes sense
class MyMetrics(MetricsTestSample):
car_detections: List[ObjectDetectionThresholded]
# doesn't make sense
class MyMetrics2(MetricsTestSample):
car_detections: ObjectDetectionThresholded
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving with one final request
kolena/workflow/_thresholded.py
Outdated
|
||
|
||
@dataclass(frozen=True) | ||
class ThresholdedMetrics(TypedDataObject, metaclass=PreventThresholdOverrideMeta): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, last change here -- since this will be public to our users, lets add a docstring about usage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would prefer to have this a bit more detailed / user actionable (example: nested test case metrics).
I'd structure this similarly:
A data class representing metrics associated with a specified threshold.
`List[ThresholdedMetrics]` is intended for usage as a field type for a [`MetricsTestSample`][kolena.workflow.MetricsTestSample] or [`MetricsTestCase`][kolena.workflow.MetricsTestCase]. This list is intended to expected to contain threshold-specific metric values for a discrete set of specific thresholds that are consistent across [`TestSample`][kolena.workflow.TestSample]s within a test run.
`ThresholdedMetrics` does not allow dictionary objects as field values and ensures
that the threshold cannot be overridden.
It should be subclassed with the relevant fields for a given workflow.
Example usage:
```python
from kolena.workflow import MetricsTestSample
from kolena.workflow import ThresholdedMetrics
from kolena.workflow.annotations import BoundingBox
@dataclass(frozen=True)
class ClassThresholdedMetrics(ThresholdedMetrics):
precision: float
recall: float
f1: float
@dataclass(frozen=True)
class TestSampleMetrics(MetricsTestSample):
car: List[ClassThresholdedMetrics]
pedestrian: List[ClassThresholdedMetrics]
# metrics object construction
metric = TestSampleMetrics(
car=[
ClassThresholdedMetrics(
threshold=0.3,
precision=0.5,
recall=0.8,
f1=0.615,
),
ClassThresholdedMetrics(
threshold=0.4,
precision=0.6,
recall=0.6,
f1=0.6,
),
ClassThresholdedMetrics(
threshold=0.5,
precision=0.8,
recall=0.4,
f1=0.533,
),
...
],
pedestrian=[
ClassThresholdedMetrics(
threshold=0.3,
precision=0.6,
recall=0.9,
f1=0.72,
),
ClassThresholdedMetrics(
threshold=0.4,
precision=0.7,
recall=0.7,
f1=0.7,
),
ClassThresholdedMetrics(
threshold=0.5,
precision=0.8,
recall=0.6,
f1=0.686,
),
...
],
)
```
Raises:
TypeError: If any field value is a dictionary.
kolena/workflow/_thresholded.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be non-private, since this is meant to be used by our users?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My rationale for using _thresholded.py was that it essentially serves as an extension of TypedDataObject, which is found in _datatypes.py.
But if you think it makes more sense to rename it to thresholded.py, that's fine with me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it makes sense as thresholded.py
-- it's analogous to the other things we expect our users to see and extend (such as MetricsTestSample
in evaluator.py
). We don't expect our users to know of and extend TypedDataObject
directly, so that's still private.
Linked issue(s):
KOL-3467
What change does this PR introduce and why?
Please check if the PR fulfills these requirements