Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Correlatability of usage patterns #12

Closed
msporny opened this issue Nov 28, 2016 · 21 comments
Closed

Correlatability of usage patterns #12

msporny opened this issue Nov 28, 2016 · 21 comments
Labels
privacy-tracker Group bringing to attention of Privacy, or tracked by the Privacy Group but not needing response.

Comments

@msporny
Copy link
Member

msporny commented Nov 28, 2016

When and how verifiable claims are used is important for correlation purposes. It is important that implementers are aware and may warn their customers of correlation patterns.

@msporny msporny added editorial Purely editorial changes to the specification. privacy-tracker Group bringing to attention of Privacy, or tracked by the Privacy Group but not needing response. labels Nov 28, 2016
@msporny msporny removed the editorial Purely editorial changes to the specification. label Dec 7, 2016
@jandrieu
Copy link
Contributor

I'd like to step out a few layers to relate this to "Identity" because, in effect we are talking about how verifiable claims will be used, intentionally or otherwise, to correlate individuals' identities from one context to another.

It turns out that "identity" as used in technical discussions means two things, sometimes conflated in a subset as PII or "Personally Identifiable Information".

Verifiable claims will be used for both of these and there are different nuances for each in terms of correlation.

First, there are identifiers and attributes used explicitly for correlating individuals across contexts. Sometimes these identifiers are opaque, e.g., "anonymous" GUIDs used for tracking in third party cookies. Sometimes they are grounded in real-world identity systems such as legal name, driver's license #, etc.

Second, attributes are used for customizing services, such as a zip code for movie listings or using a name in a greeting on a page, separate from whether or not that name or zip code is used to correlate the individual with anything beyond the presentation of features.

Privacy issues occur when

  1. people are correlated across contexts in ways they did not expect or desire, either by the original steward of such information or by third parties
  2. attributes are shared with parties in unexpected or undesired ways

The two biggest gotchas, in my experience, are

  1. Thinking "anonymous" identifiers resolve privacy issues. In fact, they are the root of third party cookie privacy concerns. They do help with minimizing undesired attribute sharing, but since correlation can be used to fuse identities from different data sets, there remains the likelihood of even anonymous identifiers leaking attributes through affiliation.

  2. Imagining that there is a subset of attributes that, properly managed, address privacy issues. So called "Personally Identifiable Information" has been used as a framework for privacy, but it leaves open several core questions. https://en.wikipedia.org/wiki/Personally_identifiable_information There is no definitive agreement as to what is or isn't PII--in some cases this question has been addressed by the courts, but as discussed in the Wikipedia article, there are also statutory and standards-based definitions which demonstrate a variety of possible interpretations of that term. More critically, it has been repeatedly shown that even when using "anonymized" data sets with no innate PII, personal information and even real-world identities can be discovered. The AOL search data leak is perhaps the most famous of these examples https://en.wikipedia.org/wiki/AOL_search_data_leak .

Sometimes, the notion of "identity" and PII is taken to include any and all attributes related to an individual, regardless of how those attributes are used. This helps deal with the reality that even seemingly innocuous data can be used for deanonymizing and correlating individuals, but it turns your "identity" into everything which is almost useless as input to engineering a good system.

Rather than discussing this issue in terms of the title language "verifiable claims are used are important for correlation purposes", I would suggest that verifiable claims will be used to both

  • correlate individuals across contexts (in the case of common identifiers and shared attributes) and
  • minimize or prevent correlation (in the case of anonymous or tokenized identifiers).

Implementers should understand and communicate to their users how their particular identity systems correlate individuals, e.g., from session to session and with third party services, and how they actively prevent or minimize undesired and unexpected correlation.

The shift I'm going for here is that, contrary to a general push in the crypto world, correlation isn't inherently bad. In fact, correlation is the direct result of identification and "identity" is useful. While anti-correlation features of various technologies are great features, that focus has obscured the fact that there are times when we want and need to be correctly correlated with our rights and privileges. In a privacy respecting system, individuals would have maximum control over correlation, enabling intentional correlation where desired while preventing undesired correlations. The appropriate limits of that control are still up for debate, but every system using verifiable claims will necessitate design choices that impact how individuals are correlated across contexts.

@agropper
Copy link

Privacy engineering can help sort through the different aspects of privacy in a use case. I’d like to relate this issue to the specific prescription use case that I maintain.

  1. The highest privacy goal of this use-case is to maintain the self-sovereignty of the physician-patient relationship in the face of regulatory requirements. Specifically, we seek to enable a transaction system for a prescription that does not depend on institutional trust as an identity provider for the physician and the patient together. The privacy benefit is the ability for the physician and the patient to interact without having that interaction monitored by an intermediary institution such as a hospital. This opportunity for an un-mediated patient-physician relationship was common with paper prescriptions and has been lost in the transition to electronic health records.

  2. In order to achieve this privacy goal, the role of the hospital intermediary as a combined root of trust for patient ID, physician ID, physician attribute, and transaction auditability needs to be distributed among various substitutable actors with a minimum of correlation risk across the actors.

  3. Working backward, the physician claim must be verified against a directory operated by an issuer that has no role relative to the issuance of the physician ID or any role whatsoever relative to patient ID. The reason for this is that the directory operator does not want any responsibility for security breaches of any patient information and has no interest in sharing an identity provider used by the patient. For the physician ID, the directory provider does not want to bear responsibility for identity proofing the physician. Cost-effective operation of directories requires they trust identity providers. The directory operator is merely a relying party, using whatever identity the physician chooses. The federation implied by the physician’s IDP is responsible for identity proofing to a level adequate for prescribing controlled substances per DEA and access to patient records under state and federal privacy mandates.

  4. The physician ID used to maintain the physician claim must be non-repudiable and able to:

  • Sign updates to the physician directory operated by the issuer prescription in an auditable way (e.g.: associated with a blockchain timestamp).

  • Sign a prescription for a specific patient in an auditable way.

  1. The patient identity (but not necessarily the patient ID) captured in the prescription and presented to the pharmacy must be correlated in a non-repudiable way. The pharmacy transaction must be auditable.

  2. The pharmacy must be able to verify the physician claim in a way that does not allow the pharmacy to correlate other transactions by the same physician. (The sale of physician prescribing info has been challenged in high courts and is allowed as free speech by the pharmacy. This causes a lot of privacy problems for the physician and the patient. It is the primary source of the huge data broker market in healthcare.)

  3. The physician must be able to report the transaction to a (law-enforcement) registry that can track patient identity across different physician-patient relationships and physicians must be able to query this registry prior to issuing a prescription. The registry itself, as a law enforcement function, can have access to the identity of the physician and the patient. (These state-operated registries are called prescription drug monitoring programs.) The pharmacy must be able to verify that the prescription was reported to the registry (to keep the doctors honest). The pharmacy may have it’s own law enforcement registry reporting requirements but these are outside the physician-patient relationship privacy engineering issue.

  4. Note that in most, maybe all, states, the pharmacy can deliver a prescription to the physician for the physician to distribute to the patient. In this somewhat inconvenient way, the privacy of the patient relative to the pharmacy can be absolute.

@jandrieu
Copy link
Contributor

Thanks for the example, @agropper. It's a solid example of where Verifiable Claims can help with privacy.

To help us move towards a more rigorous lexicon, I'd like to call this a "use domain" instead of a "use case." I'm hoping to establish a specific semantics for use cases:

A use case defines a specific value-creating transaction between an individual and the system.

A use domain defines a related set of use cases.

I'm still working through the best alternative language, but "use domain" or "domain of use" seems like a good way to describe this example, which include several transactions, as well as domain-specific non-functional requirements, such as both the correlatability and non-correlatability you outline.

From what I read, I tease out a few different transactions:

  1. Issue prescription
  2. Verify prescription
  3. Present prescription
  4. Audit pharmacy
  5. Register prescription

There may also be transactions related to the credential that enables a doctor to prescribe as well as recording pharmacy interactions: requesting fulfilment of a prescription, fulfilling a prescription, etc., so we can understand the needs of the audit. As with many of these kinds of use, the trick is defining the coherent boundary so we can focus on the new and interesting bits. For example, one could discuss how all of the entities in the domain provision their credentials: the monitoring agencys, the pharmacies, the pharmacists, the insurance companies (surprisingly missing from your example). Clearly, taking some of these entities (and their credentialing) as a given greatly simplifies the documentation.

To try and tease out the correlatability:

Intended correlations:

  1. The live person redeeming a prescription needs to be correlatable to the patient for whom the prescription was given, by the pharmacist, prior to distribution so that the medicine is given to the actual patient.
  2. The patient needs to be correlatable to a singular legal person by the prescription drug monitoring program for the purposes of assuring that individuals are not getting multiple prescriptions by visiting multiple doctors. The physician needs to be able to query the program prior to issuing a prescription.
  3. The physician and patients need to be correlatable across multiple prescriptions for physician audits.
  4. A given prescription must be resolvable to a delivery address while preventing the pharmacy from correlating the doctor to the prescription. This resolution must be non-repudiable.
  5. Upon delivery, a prescription must be correlatable by the issuing doctor to the patient

Blocked correlations:

  1. The pharmacy must not be able to correlate the physician's prescriptions across different patients.
  2. Someone who is not the intended patient must not be able to redeem a prescription (must not be falsely correlated as the patient).
  3. The prescription may be redeemed at any pharmacy. There is no innate correlation between a given prescription and the pharmacy that fulfills it.

Do these transactions and correlations seem correct?

Questions:

  1. If I assume for the sake of discussion that all of this information is stored in an effectively public repository--this assumption addresses both public ledgers and compromised data stores--then can we assume that certain information is encrypted for the intended recipient? For example is the prescription delivery address encrypted for a specific delivery service?

  2. For correlation 3, who is doing the audit? How do we allow an audit without allowing the pharmacy to perform the same correlation? Are there baked in assumptions about where "auditable" data is stored that can be trusted to be secure from the Pharmacy? I don't think we care about pharmacies that are bad actors willing to hack a physician's database. For this use domain, it might be valuable to identify a strawman architecture that distinguishes who holds what data. I'm assuming that the monitoring program and the physician both have private data stores for audit purposes, while the rest of the data could be stored in a self-sovereign public ledger. (If insurance is involved, the pharmacy will probably need its own data store as well.) Or... is there a way that all of this information could be in a public data store?

  3. For correlation 2, are we trusting the monitoring program to operate a secure, live system? Or is the goal to have that monitoring (and the doctor's query) based in a public ledger? In other words, along with question 2, can we clarify where, for this use domain, we need to trust a system (and its operator) with certain information and which systems we choose not to trust with certain information?

  4. What about insurance companies? Are they an important part of the privacy engineering?

  5. Does the pharmacy need to verify that the prescription has been registered with the monitoring program?

@msporny
Copy link
Member Author

msporny commented Jan 20, 2017

@agropper @jandrieu Should we move @agropper's use case description to the use cases repository? This repository is about data model and specifically the privacy/correlatability section. We may want to split this discussion into two aspects: 1) The use case itself (put it in the use cases repo issue tracker), and 2) How this use case impacts the correlatability subsection in the privacy considerations section.

@jandrieu
Copy link
Contributor

Sounds good. I'll move the use domain over there for its own refinement.

One thing became clear to me in working through Adrian's use domain is the need for a strawman architecture for these kinds of use domains so that we can evaluate the privacy impact. For example, the main page for the Tahoe-LAFS has a simple diagram distinguishing what parts of the architecture must be trusted implicitly and which rely instead on cryptographic trust.

I'm reminded of Eben Moglen's testimony to congress in 2010:

These [Facebook] “privacy settings” merely determine what one user can see of another user’s private
data. The grave, indeed fatal, design error in social networking services like Facebook isn’t that Johnny can see Billy’s data. It’s that the service operator has uncontrolled access to everybody’s
data, regardless of the so-called “privacy settings.”

So, in order to understand how Verifiable Claims addresses privacy issues, I believe we will need to consider how they would operate within the context of various systems, each of which will have distinct trust boundaries and differing needs for information access.

Once we understand that, we can evaluate what the data model needs to support those use cases, and in particular, how verifiable claims improve privacy when used correctly.

@burnburn
Copy link
Contributor

So, in order to understand how Verifiable Claims addresses privacy issues, I believe we will need to consider how they would operate within the context of various systems, each of which will have distinct trust boundaries and differing needs for information access.

@jandrieu So, how do you suggest we proceed? In particular, is there anything you think might be productive to discuss in next week's call?

@agropper
Copy link

Thanks @jandrieu You're structuring this is a useful way. The correlations and transactions seem correct.

Questions:
1 - Good question. I'm not sure what to say about encryption but I suspect security design will be evident as we go forward.

2 - I agree with your framing. I don't know the legal answer to who does routine audits. I would allow for a separate registry no matter what. The pharmacy is handling controlled substances and subject to audit by the DEA. I'm skeptical of storing anything other than timestamps in a public data store.

3 - Good point. I think we need to do both. Keep in mind that some states will require the querying physician to have a relationship with the patient and others will simply require they be a licensed practitioner.

4 - The insurance may need to be consulted for decision support and/or costs before the prescription is finalized by the physician. The pharmacy also needs insurance access, unless the patient pays cash - which is allowed by law. Once we create the "use domain" representation, we would do well to add insurance.

5 - Maybe. The monitoring programs are run at state level and can include the pharmacy. some states also mandate that physicians check the registry before prescribing controlled substances and we could imagine transactions that warn the physician or regulators if this is not done.

@jandrieu
Copy link
Contributor

Created new issue in use case repo: w3c/vc-use-cases#38

@msporny
Copy link
Member Author

msporny commented Jan 24, 2017

Next steps is to do a privacy analysis on Adrian Gropper's use case listed in this issue. The people that volunteered are: @jandrieu @agropper @jonnycrunch @msporny @amigus

@burnburn
Copy link
Contributor

Discussed in 24 Jan 2017 telecon (link to minutes when available)

@stonematt
Copy link
Contributor

@jandrieu @agropper @jonnycrunch @msporny @amigus -- looking for an update on this issue.

@jandrieu
Copy link
Contributor

jandrieu commented Feb 6, 2017

Note this has moved to the use cases repository: w3c/vc-use-cases#38

@msporny
Copy link
Member Author

msporny commented Feb 6, 2017

This issue was originally about writing 2-3 paragraphs for the Data Model specification under the Privacy section related to Correlatability. I think we're going a bit overkill here - the analysis that @jandrieu and @agropper are doing is useful, but we probably don't need to wait on that to write a section for the specification. We just need a general idea for what sort of things you could put in a Verifiable Claim that would correlate you and to what degree.

What we need to resolve this issue is 2-3 paragraphs that describe why/how Verifiable Claims can lead to correlation based on usage patterns.

@amigus
Copy link

amigus commented Feb 7, 2017

@stonematt I'm going to lead the effort to put the use-case @agropper defined, through the NIST Privacy Risk Assessment Methodology based on NIST.IR.8062. I expect I'll need help from the other volunteers in the coming weeks.

@jandrieu jandrieu self-assigned this Feb 7, 2017
@jonnycrunch
Copy link
Contributor

Regarding the original issue of Correlatability: I see the issue as maintaining control over the selective disclosure of verifiable claims that may lead to Correlatability and understanding/accepting that risk. The Rx use-case is interesting and much too complex, because face it, when we are sick and/or dying we will give up our souls for the chance to live longer. Perhaps, this thread should continue with the more general discussion of Correlatabilty. More specific the use of Correlatability or K-means clustering of certain attributes for potential re-identification. The classic example was the re-identification of Gov. Weld from 'de-identified' patient discharge information in Massachusetts. In medicine, with regard to the potential re-identification of "de-identified" Personally Identifiable Information we often use k-anonymity methods to preserve privacy, whereas certain attributes are translated to more general ranges. Female form zip-code 37203 age 29 is translated to Female from area 372**, <20 age <=30 depending on the data set and calculated. I didn't find the NIST, IR 8062 very useful as it only described the typical governmental mantra of "Monitor -- Assess -- Respond" and policy responses. The worksheet would be helpful to create the strawman discussion. Although the verbage and concepts were helpful, the privacy risk equation wasn't terrible helpful. Rather, I think we should model the risk of re-identificaiton given the attributes and have a cohesive methodology that we use for each scenario/use-case. (For instance, the 29 year old female in 37203 who is buying alcohol or filling a Rx). The context of how many people live in area code 37203 and in that age range is about 30k. So rather than focus on the characteristics of a secure system as suggested in the PRAM, i think we should focus on the methods to calculate the risk for re-identification given the context of each of the selective disclosure of attributes.

@agropper
Copy link

agropper commented Feb 15, 2017 via email

@jandrieu
Copy link
Contributor

I just made a pull request with specific language.

One challenge I had was distinguishing the privacy of the holder separate from the subject. It's easier to discuss and reason about when the holder is presumed to be the subject, but that presumption is knowably false when dealing with delegated or guardian holders, e.g., claims about a child. Unfortunately, it also should probably never be presumed that the holder of a claim is guaranteed to be the subject of the claim. Maybe I'm missing something here, but I don't think we've teased out the issues of identity assurance between the

  1. the holder of a claim (the digital holder, who has the JSON-LD or other serialization),
  2. the presenter of a claim (the party actually asserting the claim to an inspector)
  3. the subject of the claim

My understanding is that, for example, a parent could present to a DID and assert that DID applies to an individual who they claim is their child. In this case, all three of the above listed parties are different. I haven't seen any language addressing how we deal with the presenter's assertion that the subject of the claim is any particular individual, including themselves.

I realize some of this is protocol related and potentially out of scope, but I found my own language challenging to reconcile with the ambiguous relationship between the holder, the presenter, and the subject.

@agropper
Copy link

agropper commented Feb 22, 2017 via email

@prototypo
Copy link

@agropper It would be really helpful to align terms, such as "patient" with equivalent (or new, if necessary) Verifiable Claims terms (e.g. "entity").

It would also be helpful to surface all the implicit relationships in the prescription use case (e.g. the verifying employee at the pharmacy is authorised by the pharmacy to certify a prescription as being valid).

@agropper
Copy link

agropper commented Mar 13, 2017 via email

msporny pushed a commit that referenced this issue Apr 14, 2017
@msporny
Copy link
Member Author

msporny commented Oct 25, 2017

@jandrieu did a PR for this issue and it was accepted into the spec. Closing the issue.

@msporny msporny closed this as completed Oct 25, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
privacy-tracker Group bringing to attention of Privacy, or tracked by the Privacy Group but not needing response.
Projects
None yet
Development

No branches or pull requests

8 participants