-
Notifications
You must be signed in to change notification settings - Fork 232
MDS Data Redaction
DRAFT: A page about data redaction principles used in the Provider Reports and Metrics areas of MDS.
The MDS 1.1.0 release introduces two new features, Provider Reports and Metrics, that redact low aggregated counts of data.
For aggregated Reports and Metrics data, what “low count” values should be removed to protect rider privacy?
Common practice to remove small counts of individuals from aggregated datasets, eg, census, health departments
The value at which you do not share any data, eg, 5, 7, 10, etc. “K” just means a variable you can set, like ‘x’ in algebra.
During learning phase: lean towards lower risk and greater anonymization.
Low k-values mean more information, but higher risk. High k-values mean less information, but lower risk. We have an idea of the risk, but it changes greatly based based on scenarios.
Some factors that affect both risk exposure and requirements for more granular data:
- Geography size (parking, no ride, equity zone, operating areas)
- Population density (dense, sparse, residential, commercial)
- Time frame (month, week, day, hour)
- Data consumer/audience (internal, research, public)
- Policy reason (enforcement, equity, operations)
- Special groups data (all riders, low income)
Based on scenario variability and the dynamic nature of how Reports and Metrics work with MDS data, we recommend a lower risk k-value of 10 during the learning beta period until we get real-world feedback.
Questions
Set now as a value of 10. Should it be a range instead? No basis for ranges defined, 10 is a low risk value. Should there be different values for different scenarios? Not sure how to define basis for all scenario combinations. Should we show 0 count values separately? Some other anonymity methods do this, but for our case 0 and <10 should not be distinguishable.
MDS Links
Working Groups
2.1.0 Release
0.4.1 Release Planning Meetings