Skip to content

Commit

Permalink
Feature/presidio-structured (#1192)
Browse files Browse the repository at this point in the history
* presidio-structured

changelog

Static analysis

docstrings, types

preliminary tests engine

static analysis

isort

Minor refactorings

Update README.md

Fix late binding issues and example

removal of old samples

Refactoring, adding example

pre-clean-break-commit

broken commit, fixing TabularConfigBuilder

Rename TabularConfig

pre-breaking replace commit

removal of some old experimental files

rename tabular to structured

restructuring presidio tabular - pre del commit

Add project TODOs

testing dump presidio tabular

* Add unit tests

* rename engine, add buildfile

* Update setup.py

* lint-build-test

* Update lint-build-test.yml

* Add packages to setup.py

* Update presidio-structured to alpha version

* Update Presidio structured README.md

* Add logging configuration to presidio-structured
module

* Refactor AnalysisBuilder constructor to accept an
optional AnalyzerEngine parameter

* Fix entity mapping in JsonAnalysisBuilder

* Drop type in docstring in analysis builder classes

* Refactor TabularAnalysisBuilder to use
BatchAnalyzerEngine for all columns

* Update data_reader.py with type hints for file
paths

* Update data_reader.py to include additional
keyword arguments in read() method

* Update Transformer to Processor term in
StructuredEngine

* Add PandasDataProcessor as default to StructuredEngine
init

* Move structured sample files to the docs

* Add Presidio Structured  Notebook to samples index

* Remove unnecessary imports in structured sample

* Update to processors in structured __init__ files

* Add explanation for structured table sample

* Delete unnecessary __init__s in structured test

* Fix bug in JsonAnalysisBuilder entity mapping

* pr comments, nits, minor tests

* README

* Add TabularAnalysisBuilder

* Some basic logging

* linting

* Fix typo in logger variable name

* Refactor analysis builder to include score
threshold

* Linting, continued

* Update Pipfile

* Refactor JsonAnalysisBuilder to support language
parameter

* Fix not camel case in TabularAnalysisBuilder

* Add score_threshold parameter to AnalysisBuilder

* Refactor JSON analysis builder to gain consistency

* Remove low score results in JsonAnalysisBuilder

* Add tests to json analysis  with score threshold

* Fix bug in JSON analysis to update map with
nested_mappings

* Fix bug in JSON analysis to take only entity types

* Fix typos in test anl json names and assert values

* Update build-structured.yml

* Create __init__.py

* Type hint fix python <3.10, loggger typo

* Update setup.py

* PR comments variety

* further pr comments

* readme, refactor score, refactor tabular analysis

* Update test_analysis_builder.py

* lint

---------

Co-authored-by: Omri Mendels <omri374@users.noreply.github.com>
Co-authored-by: Sharon Hart <sharonh.dev@gmail.com>
Co-authored-by: enrique.botia <enrique.botia@netzima.com>
  • Loading branch information
4 people authored Jan 14, 2024
1 parent 2a8d3ec commit 966d17a
Show file tree
Hide file tree
Showing 26 changed files with 1,915 additions and 1 deletion.
27 changes: 27 additions & 0 deletions .pipelines/templates/build-structured.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
steps:
- task: Bash@3
displayName: 'Setup pipenv'
inputs:
targetType: 'inline'
script: |
set -eux # fail on error
python -m pip install --upgrade pip
python -m pip install pipenv
pipenv --python 3
- task: Bash@3
displayName: 'Install deps'
inputs:
targetType: 'inline'
workingDirectory: 'presidio-structured'
script: |
set -eux # fail on error
pipenv install --deploy --dev
pipenv run pip install -e ../presidio-analyzer/. # Use the existing analyzer and not the one in PyPI
pipenv run pip install -e ../presidio-anonymizer/. # Use the existing analyzer and not the one in PyPI
- template: ./build-python.yml
parameters:
SERVICE: 'Structured'
WORKING_FOLDER: 'presidio-structured'

23 changes: 23 additions & 0 deletions .pipelines/templates/lint-build-test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -76,6 +76,7 @@ stages:
versionSpec: '$(python.version)'
displayName: 'Use Python $(python.version)'
- template: ./build-image-redactor.yml

- job: TestCli
displayName: Test Cli
pool:
Expand All @@ -97,3 +98,25 @@ stages:
versionSpec: '$(python.version)'
displayName: 'Use Python $(python.version)'
- template: ./build-cli.yml

- job: TestStructured
displayName: Test Presidio Structured
pool:
vmImage: 'ubuntu-latest'
strategy:
matrix:
Python38:
python.version: '3.8'
Python39:
python.version: '3.9'
Python310:
python.version: '3.10'
Python311:
python.version: '3.11'

steps:
- task: UsePythonVersion@0
inputs:
versionSpec: '$(python.version)'
displayName: 'Use Python $(python.version)'
- template: ./build-structured.yml
8 changes: 7 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,12 @@

All notable changes to this project will be documented in this file.


## [Unreleased]
### Added
#### Structured
* Added alpha of presidio-structured, a library (presidio-structured) which re-uses existing logic from existing presidio components to allow anonymization of (semi-)structured data.

## [2.2.351] - Nov. 6th 2024
### Changed
#### Analyzer
Expand All @@ -17,6 +23,7 @@ All notable changes to this project will be documented in this file.
#### Analyzer
* Put org in ignore as it has many FPs (#1200)


## [2.2.34] - Oct. 30th 2024

### Added
Expand Down Expand Up @@ -66,7 +73,6 @@ All notable changes to this project will be documented in this file.
* Changed the ACR instance (#1089)
* Updated to Cred Scan V3 (#1154)


## [2.2.33] - June 1st 2023
### Added
#### Anonymizer
Expand Down
1 change: 1 addition & 0 deletions docs/samples/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@
| Usage | Images | Python Notebook | [Plot custom bounding boxes](https://github.com/microsoft/presidio/blob/main/docs/samples/python/plot_custom_bboxes.ipynb)
| Usage | Text | Python Notebook | [Integrating with external services](https://github.com/microsoft/presidio/blob/main/docs/samples/python/integrating_with_external_services.ipynb) |
| Usage | Text | Python file | [Remote Recognizer](https://github.com/microsoft/presidio/blob/main/docs/samples/python/example_remote_recognizer.py) |
| Usage | Structured | Python Notebook | [Presidio Structured Basic Usage Notebook](https://github.com/microsoft/presidio/blob/main/docs/samples/python/example_structured.ipynb) |
| Usage | Text | Python file | [Azure AI Language as a Remote Recognizer](python/text_analytics/index.md) |
| Usage | CSV | Python file | [Analyze and Anonymize CSV file](https://github.com/microsoft/presidio/blob/main/docs/samples/python/process_csv_file.py) |
| Usage | Text | Python | [Using Flair as an external PII model](https://github.com/microsoft/presidio/blob/main/docs/samples/python/flair_recognizer.py)|
Expand Down
4 changes: 4 additions & 0 deletions docs/samples/python/csv_sample_data/test_structured.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
id,name,email,street,city,state,postal_code
1,John Doe,john.doe@example.com,123 Main St,Anytown,CA,12345
2,Jane Smith,jane.smith@example.com,456 Elm St,Somewhere,TX,67890
3,Alice Johnson,alice.johnson@example.com,789 Pine St,Elsewhere,NY,11223
Loading

0 comments on commit 966d17a

Please sign in to comment.