-
Notifications
You must be signed in to change notification settings - Fork 185
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Semantic conventions vs GDPR #128
Comments
What I suggest is:
We could also consider adding some env var like |
url.full
and db.statement
attributes vs GDPR
I see you had already contributed in that area in the past: open-telemetry/opentelemetry-specification#1502 |
@Oberon00 In open-telemetry/opentelemetry-specification#1502 I was mostly concerned about leaking login+password (I find it a lot more critical). I felt that mentioning GDPR/sensitive data would not bring a lot of value 2 years ago as semantic conventions were in a very early stage. I was afraid that it would cause more confusion and paralysis. Right now I see a lot of contributions in semantic conventions. I think that it is a good moment to start doing something with privacy/GDPR/data protection. |
I want to add one point - exception callstacks are commonly considered as privacy data. |
Today, I learned that for one of our customer uses client (mobile) instrumentation (Android, iOS). The PII data emitted via HTTP Client instrumentation is very problematic for them. Because the instrumentation works on "end-user" devices therefore it is not easy to get rid of them. Ideally they would prefer to not emit them from the devices at all (via network). |
I think it usually leaks "internal details" of the instrumented system (most telemetry is causing leakage of some internal details). I do not think it would be easy that they leak "personal data" via exception callstack UNLESS they also contain parameter values. I am not aware of any language/ecosystem which dumps parameter values in exception callstacks. |
One extra point - login/password leak can happen anywhere. I'll give an example from Microsoft Azure - in many places we allow users to put "tags" (very generic, arbitrary strings that can be associated with some entities) which can be used to group or search things. And we know that some users could put sensitive information there (e.g. it could be their emails or even passwords), so we consider all of these privacy data and put lots of efforts on redaction, classification, isolation and access control. Another typical issue is that people might put something wrong by mistake when they instrument, in a large system we do see developers making mistakes (e.g. putting user email address in an attribute named "ResourceType"). I feel schema cannot solve these problems, a centralized scanning/redaction system can provide consistent/reliable guarantee (which comes with perf cost for sure). |
For sure it cannot solve it. It is more about Defense in Depth and making the hardening more straightforward and adding protection on more layers. |
I really like the idea of env. variable or something similar to do best effort of anonymizing PII data. I think it is crucial to align on the best effort of not sending this data outside of client devices. |
We could also consider adding some configuration/feature to the SDK that would allow attribute retraction (e.g. via exporter decorator). This could help user's were retraction using OTel Collector's attributesprocessor is problematic (like here). But without additional hints the user may still have trouble to find telemetry which has "not low" probability to contain PII or other sensitive data. |
I had a conversation with @trask, and here is the summary: Our current focus lies on the HTTP semantic convention as it's planned to be stable and can serve as an example for future semantic conventions. During our discussion, we identified the following attributes as problematic:
Regarding However, for A) Apply the same approach as for B) Implement an opt-in functionality to scrub/retract the path and query parts from the URL. However, this leads that most of the data is redundant (except for the missing URL scheme). Personally, I lean towards option A for since I don't consider URL scheme to be critical telemetry. Moreover, we can always add On the other hand, we are both not sure if such attribute retraction should be at the instrumentation level, we both think that maybe the SDK should offer a more general way to suppress specific attributes. Maybe SpanProcessors (rather impossible) for spans and Views (it only contains "allow-list", no "deny-list") for metrics are the way to go? EDIT: I got a feedback from @MrAlias that doing filtering on higher level would lead the resource consumption overhead that may not be acceptable. Not collecting unwanted attributes on instrumentation level would be more efficient. |
I'd like to have a broader OTEL-wide discussion on handling GDPR concerns. Specifically:
I'm going to raise the discussion in the next TC meeting to see if we can get alignment on a direction here and what thoughts exist from others. I suggest this is worth some good brainstorming. |
Related issue open-telemetry/opentelemetry-go-contrib#3895 |
@jsuereth Any update? |
I just spotted
|
related, we have had PII data reported a couple of times now in |
I think, for o11y, it's impossible to NOT collect PII. At this point our approach should be the following:
The guidance to not generate PII by default is temporary, pending better enable/disable features in the SDK for full user control. |
SIG meetings notes. It appears that the general agreement on how to handle the problem is as follows:
|
After some thoughts, I find
are good enough for solving the issue. It is also inline with my original proposal: #128 (comment) |
I wanted to add a paper I found pretty useful for folks evaluating solutions here. This paper covers the background material, a bit of info in terms of legal qualifications and mathematical principles for evaluating anonymity. It may be a good foundation for starting a taxonomy for semconv. |
Per https://github.com/open-telemetry/semantic-conventions/blob/9b455310519ec511656f91d1db0e30f5e32acd2a/specification/trace/semantic_conventions/http.md#http-client
url.full
is currently Required.However, the URL can contain sensitive data e.g. personal data (PII) in GDPR terminology (e.g. login, ID).
GDPR adds many rights to the subjects a lot of rights that may be very problematic (e.g. https://www.digitalguardian.com/blog/google-fined-57m-data-protection-watchdog-over-gdpr-violations).
Maybe it should be Recommended similarly to
device.id
which also has the following notice:semantic-conventions/specification/resource/semantic_conventions/device.md
Line 17 in 9b45531
Maybe should add some notice that the URL can contain sensitive/personal data and one may consider to delete it using OTel Collector's
attributesprocessor
?Maybe the collection of this attribute should be configurable?
The same concerns apply to
db.statement
.PS. I am sorry that the issue has questions than answers.
The text was updated successfully, but these errors were encountered: