You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
I have one unique user journey where I have to mask the PII for one usecase, then again I have to unmasked the PII of same data in another usecase. Its like flowing river where at one village I have to travel under the soil and at another village I have to flow over the soil. The problem I am facing is little unique, now whenever there are two different entities gets detected for some portions of the sentences, and these portions got intersected each other in small portion. Now, when I wanted to get the data back after unmasking the text I am getting changed original text. Due to not handling conflict of these intersections of entities properly.
Sample Code
from presidio_analyzer import RecognizerResult, AnalyzerEngine
from presidio_anonymizer import AnonymizerEngine
from presidio_anonymizer.entities import OperatorConfig
text = """Fake card number 4151 3217 6243 3448.com that overlaps with nonexisting URL."""
anomyzerEngine = AnonymizerEngine()
analyzerEngine = AnalyzerEngine()
result = analyzerEngine.analyze(text, language= 'en')
deIdentifiedText = anomyzerEngine.anonymize(
text=text, analyzer_results=result, operators={"DEFAULT": OperatorConfig("keep")}
)
print("De-identified text")
print(deIdentifiedText.text)
Results Generated De-identified text Fake card number 4151 3217 6243 34483448.com that overlaps with nonexisting URL.
-> Here its detected entities list [type: CREDIT_CARD, start: 17, end: 36, score: 1.0, type: URL, start: 32, end: 40, score: 0.5].
-> Now if you see the two entities start and end positions (17, 36) and (32, 40). This edge case making the problem while getting the text original because it is unnecessarily adding extra 3448 number in the above example. This extra is getting inserted due to common intersection portion of the text from (32, 36) and this changing the original text completely different.
Expected behavior De-identified text Fake card number 4151 3217 6243 3448.com that overlaps with nonexisting URL.
-> It should give the original text as it is without modifying it. and its entities list should look like: [type: CREDIT_CARD, start: 17, end: 36, score: 1.0, type: URL, start: 36, end: 40, score: 0.5]. This will help us to identify the both the entities without loosing them and without modifying the original text.
Additional context
This can be solved by improving the conflict handling function in AnonymizerEngine. I have solved this issue, will raise PR for the same.
The text was updated successfully, but these errors were encountered:
Describe the bug
I have one unique user journey where I have to mask the PII for one usecase, then again I have to unmasked the PII of same data in another usecase. Its like flowing river where at one village I have to travel under the soil and at another village I have to flow over the soil. The problem I am facing is little unique, now whenever there are two different entities gets detected for some portions of the sentences, and these portions got intersected each other in small portion. Now, when I wanted to get the data back after unmasking the text I am getting changed original text. Due to not handling conflict of these intersections of entities properly.
Sample Code
Results Generated
De-identified text
Fake card number 4151 3217 6243 34483448.com that overlaps with nonexisting URL.
-> Here its detected entities list [type: CREDIT_CARD, start: 17, end: 36, score: 1.0, type: URL, start: 32, end: 40, score: 0.5].
-> Now if you see the two entities start and end positions (17, 36) and (32, 40). This edge case making the problem while getting the text original because it is unnecessarily adding extra 3448 number in the above example. This extra is getting inserted due to common intersection portion of the text from (32, 36) and this changing the original text completely different.
Expected behavior
De-identified text
Fake card number 4151 3217 6243 3448.com that overlaps with nonexisting URL.
-> It should give the original text as it is without modifying it. and its entities list should look like: [type: CREDIT_CARD, start: 17, end: 36, score: 1.0, type: URL, start: 36, end: 40, score: 0.5]. This will help us to identify the both the entities without loosing them and without modifying the original text.
Additional context
This can be solved by improving the conflict handling function in AnonymizerEngine. I have solved this issue, will raise PR for the same.
The text was updated successfully, but these errors were encountered: