Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/presidio-structured #1192

Merged
merged 65 commits into from
Jan 14, 2024
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
Show all changes
65 commits
Select commit Hold shift + click to select a range
8c6be26
presidio-structured
Jakob-98 Sep 13, 2023
87e4d18
Add unit tests
Jakob-98 Oct 31, 2023
99a5b5d
Merge branch 'main' into feature/presidio-tabular
omri374 Oct 31, 2023
f9ec126
rename engine, add buildfile
Jakob-98 Nov 9, 2023
e4622fd
Update setup.py
Jakob-98 Nov 9, 2023
d4f37de
Merge branch 'main' into feature/presidio-tabular
omri374 Nov 10, 2023
1097a62
Merge branch 'main' into feature/presidio-tabular
SharonHart Nov 16, 2023
1427528
lint-build-test
Jakob-98 Nov 17, 2023
7a971ac
Merge branch 'feature/presidio-tabular' of https://github.com/Jakob-9…
Jakob-98 Nov 17, 2023
463beba
Update lint-build-test.yml
Jakob-98 Nov 17, 2023
5f36b40
Add packages to setup.py
Nov 22, 2023
6693817
Update presidio-structured to alpha version
Nov 22, 2023
25e961e
Update Presidio structured README.md
Nov 22, 2023
c356dd2
Add logging configuration to presidio-structured
Nov 22, 2023
3d9bf2f
Refactor AnalysisBuilder constructor to accept an
Nov 22, 2023
fe0750f
Fix entity mapping in JsonAnalysisBuilder
Nov 22, 2023
48a0cd6
Drop type in docstring in analysis builder classes
Nov 22, 2023
7a6ed72
Refactor TabularAnalysisBuilder to use
Nov 22, 2023
fff9a36
Update data_reader.py with type hints for file
Nov 22, 2023
0915d9f
Update data_reader.py to include additional
Nov 22, 2023
d0db1c3
Update Transformer to Processor term in
Nov 22, 2023
3931558
Add PandasDataProcessor as default to StructuredEngine
Nov 22, 2023
5977230
Move structured sample files to the docs
Nov 22, 2023
1770112
Add Presidio Structured Notebook to samples index
Nov 22, 2023
c202f0c
Remove unnecessary imports in structured sample
Nov 22, 2023
91f9f6b
Update to processors in structured __init__ files
Nov 22, 2023
d71ff88
Add explanation for structured table sample
Nov 22, 2023
15e03c3
Delete unnecessary __init__s in structured test
Nov 22, 2023
354e223
Fix bug in JsonAnalysisBuilder entity mapping
Nov 23, 2023
f637f34
Merge pull request #1 from ebotiab/feature/presidio-tabular
Jakob-98 Nov 24, 2023
db1f3d8
pr comments, nits, minor tests
Jakob-98 Nov 24, 2023
29f7f8a
README
Jakob-98 Nov 27, 2023
33182bb
Add TabularAnalysisBuilder
Jakob-98 Nov 27, 2023
43c39d8
Some basic logging
Jakob-98 Nov 27, 2023
411f1bd
linting
Jakob-98 Nov 27, 2023
e31ff12
Fix typo in logger variable name
Nov 27, 2023
bdd7e20
Refactor analysis builder to include score
Nov 27, 2023
15b756f
Linting, continued
Jakob-98 Nov 27, 2023
d4e317c
Update Pipfile
Jakob-98 Nov 27, 2023
78f0c01
Merge remote-tracking branch 'upstream/feature/presidio-tabular' into…
Nov 27, 2023
6513668
Refactor JsonAnalysisBuilder to support language
Nov 27, 2023
df2a4e0
Fix not camel case in TabularAnalysisBuilder
Nov 27, 2023
75da36a
Add score_threshold parameter to AnalysisBuilder
Nov 27, 2023
54fb99c
Refactor JSON analysis builder to gain consistency
Nov 27, 2023
7fe314a
Remove low score results in JsonAnalysisBuilder
Nov 27, 2023
c25d82f
Add tests to json analysis with score threshold
Nov 27, 2023
0f3364d
Fix bug in JSON analysis to update map with
Nov 27, 2023
0d6ebfc
Fix bug in JSON analysis to take only entity types
Nov 27, 2023
5f60ee5
Fix typos in test anl json names and assert values
Nov 27, 2023
b942513
Update build-structured.yml
Jakob-98 Nov 28, 2023
f042ffe
Create __init__.py
omri374 Nov 29, 2023
22ee87d
Type hint fix python <3.10, loggger typo
Jakob-98 Nov 29, 2023
c60e727
Merge branch 'feature/presidio-tabular' of https://github.com/Jakob-9…
Jakob-98 Nov 29, 2023
575498f
Update setup.py
Jakob-98 Nov 29, 2023
0499de0
Merge branch 'feature/presidio-tabular' into analysis-builder-improve…
Nov 30, 2023
8246d22
Merge branch 'main' into feature/presidio-tabular
omri374 Dec 3, 2023
6977c5d
Merge branch 'main' into feature/presidio-tabular
omri374 Dec 11, 2023
b522b88
Merge branch 'main' into feature/presidio-tabular
SharonHart Dec 24, 2023
38beac7
Merge pull request #3 from ebotiab/analysis-builder-improvements
Jakob-98 Jan 9, 2024
4e2bea4
PR comments variety
Jakob-98 Jan 9, 2024
0388c89
further pr comments
Jakob-98 Jan 9, 2024
cdc8923
readme, refactor score, refactor tabular analysis
Jakob-98 Jan 9, 2024
6985aa7
Update test_analysis_builder.py
Jakob-98 Jan 9, 2024
0a87783
lint
Jakob-98 Jan 10, 2024
1ed6e6c
Merge branch 'main' into feature/presidio-tabular
omri374 Jan 11, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 5 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,11 @@

All notable changes to this project will be documented in this file.

## [Unreleased]
### Added
#### Structured
* Added V1 of presidio-structured, a library (presidio-structured) which re-uses existing logic from existing presidio components to allow anonymization of (semi-)structured data.
Jakob-98 marked this conversation as resolved.
Show resolved Hide resolved

## [2.2.34] - Oct. 30th 2024

### Added
Expand Down Expand Up @@ -51,7 +56,6 @@ All notable changes to this project will be documented in this file.
* Changed the ACR instance (#1089)
* Updated to Cred Scan V3 (#1154)


## [2.2.33] - June 1st 2023
### Added
#### Anonymizer
Expand Down
18 changes: 18 additions & 0 deletions presidio-structured/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
# Presidio structured

## Status
Jakob-98 marked this conversation as resolved.
Show resolved Hide resolved

### TODO

For TODOs, see draft PR.

## Description

The Presidio stuctured is..

## Deploy Presidio analyzer to Azure

## Simple usage example

## Documentation

9 changes: 9 additions & 0 deletions presidio-structured/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
"""Anonymizer root module."""
import logging

# Set up default logging (with NullHandler)


# logging.getLogger("presidio-str").addHandler(logging.NullHandler())
Jakob-98 marked this conversation as resolved.
Show resolved Hide resolved

# __all__ = ["AnonymizerEngine", "DeanonymizeEngine", "BatchAnonymizerEngine"]
15 changes: 15 additions & 0 deletions presidio-structured/presidio_structured/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
from .analysis_builder import JsonAnalysisBuilder, TabularAnalysisBuilder
from .config import StructuredAnalysis
from .data import CsvReader, JsonDataTransformer, JsonReader, PandasDataTransformer
from .tabular_engine import TabularEngine

__all__ = [
"TabularEngine",
"JsonAnalysisBuilder",
"TabularAnalysisBuilder",
"StructuredAnalysis",
"CsvReader",
"JsonReader",
"PandasDataTransformer",
"JsonDataTransformer",
]
163 changes: 163 additions & 0 deletions presidio-structured/presidio_structured/analysis_builder.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,163 @@
from abc import ABC, abstractmethod
from collections import Counter
from collections.abc import Iterable
from typing import Any, Dict, Iterator, Union

from pandas import DataFrame
from presidio_analyzer import (
AnalyzerEngine,
BatchAnalyzerEngine,
DictAnalyzerResult,
RecognizerResult,
)

from presidio_structured.config import StructuredAnalysis


class AnalysisBuilder(ABC):
"""
Abstract base class for a configuration generator.
"""

def __init__(self):
"""Initialize the configuration generator."""
self.analyzer = AnalyzerEngine()
Jakob-98 marked this conversation as resolved.
Show resolved Hide resolved

@abstractmethod
def generate_analysis(self, data: Union[Dict, DataFrame]) -> StructuredAnalysis:
"""
Abstract method to generate a configuration from the given data.

:param data: The input data. Can be a dictionary or DataFrame instance.
:type data: Union[Dict, DataFrame]
:return: The generated configuration.
:rtype StructuredAnalysis:
"""
pass


class JsonAnalysisBuilder(AnalysisBuilder):
"""Concrete configuration generator for JSON data."""

def generate_analysis(self, data: Dict) -> StructuredAnalysis:
"""
Generate a configuration from the given JSON data.

:param data: The input JSON data.
:type data: Dict
:return: The generated configuration.
:rtype StructuredAnalysis:
"""
batch_analyzer = BatchAnalyzerEngine(analyzer_engine=self.analyzer)
Jakob-98 marked this conversation as resolved.
Show resolved Hide resolved
analyzer_results = batch_analyzer.analyze_dict(input_dict=data, language="en")
return self._generate_analysis_from_results_json(analyzer_results)

def _generate_analysis_from_results_json(
self, analyzer_results: Iterator[DictAnalyzerResult], prefix: str = ""
) -> StructuredAnalysis:
"""
Generate a configuration from the given analyzer results.

:param analyzer_results: The analyzer results.
:type analyzer_results: Iterator[DictAnalyzerResult]
:param prefix: The prefix for the configuration keys.
:type prefix: str
:return: The generated configuration.
:rtype StructuredAnalysis:
"""
mappings = {}

if not isinstance(analyzer_results, Iterable):
return mappings

for result in analyzer_results:
current_key = prefix + result.key

if isinstance(result.value, dict):
nested_mappings = self._generate_analysis_from_results_json(
result.recognizer_results, prefix=current_key + "."
)
mappings.update(nested_mappings.entity_mapping)

if sum(1 for _ in result.recognizer_results) > 0:
for recognizer_result in result.recognizer_results:
mappings[current_key] = recognizer_result.entity_type
Jakob-98 marked this conversation as resolved.
Show resolved Hide resolved
return StructuredAnalysis(entity_mapping=mappings)


class TabularAnalysisBuilder(AnalysisBuilder):
"""Concrete configuration generator for tabular data."""

def generate_analysis(
self, df: DataFrame, n: int = 100, language: str = "en"
Jakob-98 marked this conversation as resolved.
Show resolved Hide resolved
) -> StructuredAnalysis:
"""
Generate a configuration from the given tabular data.

:param df: The input tabular data (dataframe).
:type df: DataFrame
Jakob-98 marked this conversation as resolved.
Show resolved Hide resolved
:param n: The number of samples to be taken from the dataframe.
:type n: int
:param language: The language to be used for analysis.
:type language: str
:return: The generated configuration.
:rtype StructuredAnalysis:
"""
if n > len(df):
n = len(df)

df = df.sample(n)

key_recognizer_result_map = self._find_most_common_entity(df, language)

key_entity_map = {
key: result.entity_type
for key, result in key_recognizer_result_map.items()
if result.entity_type != "NON_PII"
Jakob-98 marked this conversation as resolved.
Show resolved Hide resolved
}

return StructuredAnalysis(entity_mapping=key_entity_map)

def _find_most_common_entity(
self, df: DataFrame, language: str
) -> Dict[str, RecognizerResult]:
"""
Find the most common entity in a dataframe column.

:param df: The dataframe where entities will be searched.
:type df: DataFrame
:param language: Language to be used in the analysis engine.
:type language: str
:return: A dictionary mapping column names to the most common RecognizerResult.
:rtype: Dict[str, RecognizerResult]
"""
key_recognizer_result_map = {}

for column in df.columns:
batch_analyzer = BatchAnalyzerEngine(analyzer_engine=self.analyzer)
Jakob-98 marked this conversation as resolved.
Show resolved Hide resolved
analyzer_results = batch_analyzer.analyze_iterator(
[val for val in df[column]], language=language
)

if all(len(res) == 0 for res in analyzer_results):
key_recognizer_result_map[column] = RecognizerResult(
entity_type="NON_PII", start=0, end=1, score=1.0
)
continue
# Grabbing most common type
types_list = [
res[0].entity_type for res in analyzer_results if len(res) > 0
Jakob-98 marked this conversation as resolved.
Show resolved Hide resolved
]
type_counter = Counter(types_list)
most_common_type = type_counter.most_common(1)[0][0]
# Grabbing the average confidence score for the most common type.
scores = [
omri374 marked this conversation as resolved.
Show resolved Hide resolved
res[0].score
for res in analyzer_results
if len(res) > 0 and res[0].entity_type == most_common_type
]
average_score = sum(scores) / len(scores) if scores else 0.0
key_recognizer_result_map[column] = RecognizerResult(
Jakob-98 marked this conversation as resolved.
Show resolved Hide resolved
most_common_type, 0, 1, average_score
)
return key_recognizer_result_map
5 changes: 5 additions & 0 deletions presidio-structured/presidio_structured/config/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
from .structured_analysis import StructuredAnalysis

__all__ = [
"StructuredAnalysis",
]
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
""" Structured Analysis module. """

from dataclasses import dataclass
from typing import Dict


@dataclass
class StructuredAnalysis:
"""Dataclass containing entity analysis from structured data. Currently only contains entity mapping."""

entity_mapping: Dict[
str, str
] # NOTE ideally Literal[...] with allowed EntityTypes, but cannot unpack in Literal.
9 changes: 9 additions & 0 deletions presidio-structured/presidio_structured/data/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
from .data_reader import CsvReader, JsonReader
from .data_transformers import JsonDataTransformer, PandasDataTransformer

__all__ = [
"CsvReader",
"JsonReader",
"PandasDataTransformer",
"JsonDataTransformer",
]
69 changes: 69 additions & 0 deletions presidio-structured/presidio_structured/data/data_reader.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,69 @@
""" Helper data classes, mostly simple wrappers to ensure consistent user interface. """

import json
from abc import ABC, abstractmethod
from typing import Any, Dict

import pandas as pd


class ReaderBase(ABC):
"""
Base class for data readers.

This class should not be instantiated directly. Instead use or define a reader subclass.
"""

@abstractmethod
def read(self, path: str) -> Any:
Jakob-98 marked this conversation as resolved.
Show resolved Hide resolved
"""
Extract data from file located at path.

:param path: String defining the location of the file to read.
:return: The data read from the file.
"""
pass


class CsvReader(ReaderBase):
"""
Reader for reading csv files.

Usage::

reader = CsvReader()
data = reader.read(path="filepath.csv")

"""

def read(self, path: str) -> pd.DataFrame:
Jakob-98 marked this conversation as resolved.
Show resolved Hide resolved
"""
Read csv file to pandas dataframe.

:param path: String defining the location of the csv file to read.
:return: Pandas DataFrame with the data read from the csv file.
"""
return pd.read_csv(path)


class JsonReader(ReaderBase):
"""
Reader for reading json files.

Usage::

reader = JsonReader()
data = reader.read(path="filepath.json")

"""

def read(self, path: str) -> Dict[str, Any]:
Jakob-98 marked this conversation as resolved.
Show resolved Hide resolved
"""
Read json file to dict.

:param path: String defining the location of the json file to read.
:return: dictionary with the data read from the json file.
"""
with open(path) as f:
data = json.load(f)
return data
Loading