Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Effective handling of intersecting entities such a way that it won't change the original text. #1195

Closed
VMD7 opened this issue Oct 28, 2023 · 0 comments

Comments

@VMD7
Copy link
Contributor

VMD7 commented Oct 28, 2023

Describe the bug
I have one unique user journey where I have to mask the PII for one usecase, then again I have to unmasked the PII of same data in another usecase. Its like flowing river where at one village I have to travel under the soil and at another village I have to flow over the soil. The problem I am facing is little unique, now whenever there are two different entities gets detected for some portions of the sentences, and these portions got intersected each other in small portion. Now, when I wanted to get the data back after unmasking the text I am getting changed original text. Due to not handling conflict of these intersections of entities properly.

Sample Code

from presidio_analyzer import RecognizerResult, AnalyzerEngine
from presidio_anonymizer import AnonymizerEngine
from presidio_anonymizer.entities import OperatorConfig

text = """Fake card number 4151 3217 6243 3448.com that overlaps with nonexisting URL."""

anomyzerEngine = AnonymizerEngine()
analyzerEngine = AnalyzerEngine()

result = analyzerEngine.analyze(text, language= 'en')

deIdentifiedText = anomyzerEngine.anonymize(
    text=text, analyzer_results=result, operators={"DEFAULT": OperatorConfig("keep")} 
)

print("De-identified text")
print(deIdentifiedText.text)

Results Generated
De-identified text
Fake card number 4151 3217 6243 34483448.com that overlaps with nonexisting URL.
-> Here its detected entities list [type: CREDIT_CARD, start: 17, end: 36, score: 1.0, type: URL, start: 32, end: 40, score: 0.5].
-> Now if you see the two entities start and end positions (17, 36) and (32, 40). This edge case making the problem while getting the text original because it is unnecessarily adding extra 3448 number in the above example. This extra is getting inserted due to common intersection portion of the text from (32, 36) and this changing the original text completely different.

Expected behavior
De-identified text
Fake card number 4151 3217 6243 3448.com that overlaps with nonexisting URL.
-> It should give the original text as it is without modifying it. and its entities list should look like: [type: CREDIT_CARD, start: 17, end: 36, score: 1.0, type: URL, start: 36, end: 40, score: 0.5]. This will help us to identify the both the entities without loosing them and without modifying the original text.

Additional context
This can be solved by improving the conflict handling function in AnonymizerEngine. I have solved this issue, will raise PR for the same.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants