Skip to content

MDS Data Redaction

Michael Schnuerle edited this page Mar 1, 2021 · 14 revisions

DRAFT.

MDS 1.1.0 introduces two new beta features, Provider Reports and Metrics. The OMF is using specific data redaction principles, explained in this document, to remove low counts of data for privacy.

Summary

Some uses of Provider Reports and Metrics may return a small count of trips for a certain geographic area or time frame. These low aggregated data counts could increase a privacy risk of re-identification when combined with other data sources or outside knowledge. To correct for that, these features do not return data below a certain count of aggregate results. This is called k-anonymity and the threshold we have set is a k-value of 10.

Solution Details

If the query returns less than 10 trips in an aggregate count, then that row's count value is returned as a "-1" value. Note 0 values are also returned as "-1" to account for privacy risk in some edge case scenarios.

Both Provider Reports and Metrics have a "Data Redaction" section that summarizes the information in this document, and link here for more details, context, and guidance. See the Data Redaction sections here within Provider Reports and Metrics.

Feedback Welcome

As these new features are in beta, the k-value of 10 may be adjusted up or down in future releases and/or may become dynamic to account for specific categories of use cases. To improve the specification and to inform future guidance, beta users are encouraged to share their feedback and questions about k-values on this discussion thread and at our weekly Working Group meetings.

Additional Risk

Using this k-anonymity methodology will reduce, but not necessarily eliminate the risk that an individual could be re-identified in a dataset. Redacting low counts using k-anonymity is just one part of good privacy protection principles, which you can read more about in our MDS Privacy Guide for Cities.

Risk Scenarios

Higher k-values have lower re-identification risk, but may result in less complete data depending on the duration of time periods and size of geographic areas for which the reports are calculated. Some use cases (such as sharing results with trusted parties who already have access to disaggregated trip data) may not require k-anonymization, while others (such as sharing with less trusted partners or extracts for the public) may require substantial k-anonymization. While reports with any k-value are substantially less sensitive than disaggregated trip records, they should still be treated as potentially sensitive unless a more detailed risk analysis is performed by the hosting organization.

Because of scenario variability and the dynamic nature of how Provider Reports and Metrics work with subsets of MDS data, we recommend a lower risk k-value of 10 during the beta learning period until we get real-world feedback and incorporate changes.

Methodology

It is a common practice to remove small counts of individuals from aggregated datasets, eg, census areas, health department maps. In many of these cases a k-value of 5 is sufficient to protect privacy of individuals. However, the OMF community has decided that during the learning phase as cities and companies test out these features in the real world and receive feedback, we should use a value of 10 as it leans towards lower risk and greater data anonymization.

Factors in Scenario Variability

Low k-values mean more information, but higher risk. High k-values mean less information, but lower risk. We have an idea of the risk, but it changes greatly based based on scenarios and audience.

Some factors that affect both risk exposure and the need for more granular data are:

  1. Geography size (parking, no ride, equity zone, operating areas)
  2. Population density (dense, sparse, residential, commercial)
  3. Time frame (month, week, day, hour)
  4. Data consumer/audience (internal, research, public)
  5. Policy reason (enforcement, equity, operations)
  6. Special groups data (all riders, low income)
K-value Risk Variability Chart

For now we are using a higher one-size-fits all k-value of 10 since it provides the right balance of low risk and adequate data for most policy scenarios.

Open Questions

The OMF acknowledges that there are many ways to use k-anonymity and we have chosen the method presented here as a low risk option until we receive more on the ground feedback.

  1. The k-value is set now as a flat value of 10. Should it be a range instead? We don't yet have much basis to define ranges for every scenario.
  2. Should there be different k-values for different scenarios? We are not yet sure how to define a basis for all scenario combinations until we get some real-world feedback.
  3. Should we show 0 count values separately? Some other k-anonymity methods do this, but for our case we believe 0 and 1-9 should not be distinguishable.

For more questions and to leave your thoughts, see our public discussion area.

Definitions

  • k-anonymity - Removing low counts of aggregated data to reduce individual re-identification risk.
  • k-value - The threshold at which you redact data. The “K” here just means a variable you can set, like ‘x’ in algebra. Counts below the "k" value are removed.
Clone this wiki locally