Replies: 8 comments 22 replies
-
All of the constraints related to input and structure seem reasonable, but I think it's crucial that keys be included in the analysis. Keys and values are part of the document structure, and IMHO much of the value from being able to analyze structured data comes from being able to analyze those relationships, not only the discrete values. In our case, we're trying to recognize psudoidentifiers, and many of them have values that are inherently hard to detect with a practically useful level of confidence, like age or US zipcode. But if we can also apply patterns to the key and use those results in combination to validate, invalidate, enhance, etc. the value results, we can perform detection with a much higher level of confidence. We would need this capability to support the majority of our use case. I like the idea of having keys and values being used together similar to text and recognizer_context_words in |
Beta Was this translation helpful? Give feedback.
-
Please see an initial sample here: https://github.com/microsoft/presidio/blob/main/docs/samples/python/batch_processing.ipynb |
Beta Was this translation helpful? Give feedback.
-
Supporting context from external sourcesWith the latest release of Presidio (2.2.25+), users are able to pass context words from external sources. This could be especially useful in structured/semi-structured settings. For example, the column name (in a data frame / sql table) or the key (in a json/xml file) could be passed as context, and would be compared with each recognizer's context words. For example: # Imports
from presidio_analyzer import AnalyzerEngine, PatternRecognizer, RecognizerRegistry, Pattern
# Define a simple ZIP regex
regex = r"(\b\d{5}(?:\-\d{4})?\b)" # very weak regex pattern
zipcode_pattern = Pattern(name="zip code (weak)", regex=regex, score=0.01)
# Define the recognizer with the defined pattern and context words that are relevant for zip codes
zipcode_recognizer = PatternRecognizer(supported_entity="US_ZIP_CODE", patterns = [zipcode_pattern], context= ["zip","zipcode"])
# Create recognizer registry and add new recognizer
registry = RecognizerRegistry()
registry.add_recognizer(zipcode_recognizer)
# Create analyzer engine
analyzer = AnalyzerEngine(registry=registry)
# Run with external context
external_context = ["zip"] # e.g. the column name
result = analyzer.analyze(text="My code is 90210", language="en", context=external_context)
print(result) Result:
|
Beta Was this translation helpful? Give feedback.
-
@willsthompson @RaviThej0803 @ashishmgofficial @Onenguyen not sure if this is still relevant, but there's a draft PR (#878) on structured/semi-structured processing. If you'd like to influence how this is implemented, please review it and provide your feedback. Thanks! |
Beta Was this translation helpful? Give feedback.
-
Hello, I wanted to check in to see if any developments have been made on future use case #2 "Allow users to define which keys can be useful for detection as additional context, i.e. use an entity in one column (e.g. customer name) as a hint for detection in other columns (e.g. free text entered by the same customer)." I refer to this as "entity integrity" or "entity preservation" throughout anonymization. I was able to easily write a new recognizer and feed it a customer name to distinguish the customer from other persons within semi-structured data. I'm now struggling to scale. I've tried experimenting with batch analyzer, but I can't seem get anything going. I'm currently working with row-wise apply of analyzer, but again, not a scalable solution. IMO Presidio's batch processing with a entity preservation feature would be an invaluable tool for ML engineers and data augmentation - providing the ability preserve the semantic context of unstructured data for NLP tasks. |
Beta Was this translation helpful? Give feedback.
-
Hi, Please let us know if this is helpful, and if anything specific is missing. |
Beta Was this translation helpful? Give feedback.
-
Hi, a little late to the party, but I was having a similar use case and thought I'd share my 2 cents. My use case was mainly around point 3:
I was mainly concerned with text in nested Code: import json
import sys
from json.decoder import py_scanstring
from json.scanner import py_make_scanner
from presidio_analyzer import AnalyzerEngine
from presidio_anonymizer import AnonymizerEngine
try:
from _json import scanstring as c_scanstring
except ImportError:
c_scanstring = None
scanstring = c_scanstring or py_scanstring
class AnonDecoder(json.JSONDecoder):
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self._analyzer = AnalyzerEngine()
self._anonymizer = AnonymizerEngine()
def parse_str(*args, **kwargs):
text, end = scanstring(*args, **kwargs)
anon_text = self._anonimize(text)
return anon_text, end
self.parse_string = parse_str
self.scan_once = py_make_scanner(self)
def _anonimize(self, text):
analyzer_results = self._analyzer.analyze(
text=text,
language="en",
)
anonymized_results = self._anonymizer.anonymize(
text=text,
analyzer_results=analyzer_results,
)
return anonymized_results.text
data = """{
"key_a": {"key_a1": "My phone number is 212-121-1424"},
"key_b": ["www.abc.com"],
"key_c": 3,
"names": ["James Bond", "Clark Kent", "Hakeem Olajuwon", "No name here!"],
"key_d": [
{
"key2": "Peter Parker"
},
{
"key3": "My phone number is 414-121-1424"
}
]
}"""
json.dump(json.loads(data, cls=AnonDecoder), fp=sys.stdout, indent=4) Result: {
"key_a": {
"key_a1": "My phone number is <PHONE_NUMBER>"
},
"key_b": [
"<URL>"
],
"key_c": 3,
"names": [
"<PERSON>",
"<PERSON>",
"<PERSON>",
"No name here!"
],
"key_d": [
{
"key2": "<PERSON>"
},
{
"key3": "My phone number is <PHONE_NUMBER>"
}
]
} @omri374 Love to hear you folks thoughts. Thank you for the awesome library ❤️ |
Beta Was this translation helpful? Give feedback.
-
Presidio now has more in this space: https://microsoft.github.io/presidio/structured/ Note that there are currently two ways to use Presidio with structured data: Main differences are:
Note that in the future we plan to have a more seamless integration of the batch analysis / anonymization into presidio-structured. |
Beta Was this translation helpful? Give feedback.
-
Handling structured / semi-structured data with Presidio
Context / Problem Statement
Presidio's main strength lies in the ability to detect PII entities in unstructured text. However, many use cases involve either fully structured or semi-structured data, where Presidio cannot be used out of the box. The scenarios for structured/semi structured data de-identification are mainly around tabular data which might or might not contain free text columns, and json/xml files with specific fields containing PII.
Use cases we would like to support initially
Use cases we might want to consider in the future
Considered Options
Key/value pairs:
a. Provide a new analyzer API for reading a flat dict/json document, and iterate over keys and values.
b. Provide a new anonymizer API for recreating the anonymized json/dict.
This will support both structured data in tables (assuming each row is treated as key:value) and flat json documents. This option will allow users to scan and de-identify tabular data (sql, pandas) as well as shallow dictionaries (i.e.
Dict[str, str]
).Key/list of values: In addition to (1), provide the ability to run on dicts/json containing lists, such as this one:
This will extend (1) with the ability to run on dictionaries of type
Dict[str, List[str]]
as well as on entire columns on tabular data (assuming the tabular data is represented as a dictionary, where every column has a key and list of values).Dictionaries/json of arbitrary structure: In addition to (1) and (2), provide the ability to parse deeper json/dict structures. A naive approach would be to flatten the dict, run the logic in (1) or (2) and unflatten it back (potentially with something like this). Note that this library does not support the flattening of list of objects, so it would not support all use cases (note: a thorough analysis of options is required).
Code examples
High level code illustration of option 1
High level code illustration of option 2
Consequences
Next steps
Dear community members, we would like to hear your thoughts on this. Is this approach supporting your use case? Please share your feedback in the comments. Contributions from the community into Presidio are also welcome!
For specific questions, feel free to reach out to the team at presidio@microsoft.com.
Beta Was this translation helpful? Give feedback.
All reactions