From 854b09546e97ab829eb86943f3fb3f9657c8e966 Mon Sep 17 00:00:00 2001 From: Omri Mendels Date: Wed, 11 Oct 2023 14:56:22 +0000 Subject: [PATCH] Revert "fork of new docs" This reverts commit 2b7c3133172d824e90d6f6ab28e805a5c33947e1. --- docs/analyzer/customizing_nlp_models.md | 43 ++----- docs/analyzer/developing_recognizers.md | 38 +++--- docs/analyzer/index.md | 37 +++++- docs/analyzer/languages-config.yml | 22 +--- docs/analyzer/languages.md | 3 +- docs/analyzer/nlp_engines/spacy_stanza.md | 17 +-- docs/analyzer/nlp_engines/transformers.md | 116 ++---------------- docs/anonymizer/index.md | 35 +++++- docs/api/analyzer_python.md | 4 +- docs/api/anonymizer_python.md | 2 +- docs/api/image_redactor_python.md | 13 +- docs/faq.md | 42 +++---- docs/getting_started.md | 55 +-------- docs/index.md | 6 +- docs/installation.md | 55 +++------ .../image_redaction_allow_list_approach.ipynb | 2 +- .../python/transformers_recognizer/index.md | 22 ++-- docs/text_anonymization.md | 5 +- docs/tutorial/04_external_services.md | 2 +- docs/tutorial/05_languages.md | 4 +- docs/tutorial/index.md | 2 +- 21 files changed, 192 insertions(+), 333 deletions(-) diff --git a/docs/analyzer/customizing_nlp_models.md b/docs/analyzer/customizing_nlp_models.md index 5969e86b19..3e67934c4e 100644 --- a/docs/analyzer/customizing_nlp_models.md +++ b/docs/analyzer/customizing_nlp_models.md @@ -1,11 +1,11 @@ -# Customizing the NLP engine in Presidio Analyzer - -Presidio uses NLP engines for two main tasks: NER based PII identification, -and feature extraction for downstream rule based logic (such as leveraging context words for improved detection). -While Presidio comes with an open-source model (the `en_core_web_lg` model from spaCy), -additional NLP models and frameworks could be plugged in, either public or proprietary. -These models can be trained or downloaded from existing NLP frameworks like [spaCy](https://spacy.io/usage/models), -[Stanza](https://github.com/stanfordnlp/stanza) and +# Customizing the NLP models in Presidio Analyzer + +Presidio uses NLP engines for two main tasks: NER based PII identification, +and feature extraction for custom rule based logic (such as leveraging context words for improved detection). +While Presidio comes with an open-source model (the `en_core_web_lg` model from spaCy), +it can be customized by leveraging other NLP models, either public or proprietary. +These models can be trained or downloaded from existing NLP frameworks like [spaCy](https://spacy.io/usage/models), +[Stanza](https://github.com/stanfordnlp/stanza) and [transformers](https://github.com/huggingface/transformers). In addition, other types of NLP frameworks [can be integrated into Presidio](developing_recognizers.md#machine-learning-ml-based-or-rule-based). @@ -63,30 +63,9 @@ Configuration can be done in two ways: - lang_code: es model_name: es_core_news_md - ner_model_configuration: - labels_to_ignore: - - O - model_to_presidio_entity_mapping: - PER: PERSON - LOC: LOCATION - ORG: ORGANIZATION - AGE: AGE - ID: ID - DATE: DATE_TIME - low_confidence_score_multiplier: 0.4 - low_score_entity_names: - - ID - - ORG ``` - The `ner_model_configuration` section contains the following parameters: - - - `labels_to_ignore`: A list of labels to ignore. For example, `O` (no entity) or entities you are not interested in returning. - - `model_to_presidio_entity_mapping`: A mapping between the transformers model labels and the Presidio entity types. - - `low_confidence_score_multiplier`: A multiplier to apply to the score of entities with low confidence. - - `low_score_entity_names`: A list of entity types to apply the low confidence score multiplier to. - - The [default conf file](https://github.com/microsoft/presidio/blob/main/presidio-analyzer/conf/default.yaml) is read during the default initialization of the `AnalyzerEngine`. Alternatively, the path to a custom configuration file can be passed to the `NlpEngineProvider`: + The default conf file is read during the default initialization of the `AnalyzerEngine`. Alternatively, the path to a custom configuration file can be passed to the `NlpEngineProvider`: ```python from presidio_analyzer import AnalyzerEngine, RecognizerRegistry @@ -118,14 +97,12 @@ Configuration can be done in two ways: c. pass requests in each of these languages. !!! note "Note" - Presidio can currently use one NER model per language via the `NlpEngine`. If multiple are required, - consider wrapping NER models as additional recognizers ([see sample here](https://github.com/microsoft/presidio/blob/main/docs/samples/python/example_remote_recognizer.py)). + Presidio can currently use one NLP model per language. ## Leverage frameworks other than spaCy, Stanza and transformers for ML based PII detection In addition to the built-in spaCy/Stanza/transformers capabitilies, it is possible to create new recognizers which serve as interfaces to other models. For more information: - - [Remote recognizer documentation](adding_recognizers.md#creating-a-remote-recognizer) and [samples](../samples/python/integrating_with_external_services.ipynb). - [Flair recognizer example](../samples/python/flair_recognizer.py) diff --git a/docs/analyzer/developing_recognizers.md b/docs/analyzer/developing_recognizers.md index 3772867cec..546c0ce35b 100644 --- a/docs/analyzer/developing_recognizers.md +++ b/docs/analyzer/developing_recognizers.md @@ -7,8 +7,7 @@ Recognizers define the logic for detection, as well as the confidence a predicti ### Accuracy -Each recognizer, regardless of its complexity, could have false positives and false negatives. When adding new recognizers, we try to balance the effect of each recognizer on the entire system. -A recognizer with many false positives would affect the system's usability, while a recognizer with many false negatives might require more work before it can be integrated. For reproducibility purposes, it is be best to note how the recognizer's accuracy was tested, and on which datasets. +Each recognizer, regardless of its complexity, could have false positives and false negatives. When adding new recognizers, we try to balance the effect of each recognizer on the entire system. A recognizer with many false positives would affect the system's usability, while a recognizer with many false negatives might require more work before it can be integrated. For reproducibility purposes, it is be best to note how the recognizer's accuracy was tested, and on which datasets. For tools and documentation on evaluating and analyzing recognizers, refer to the [presidio-research Github repository](https://github.com/microsoft/presidio-research). !!! note "Note" @@ -23,8 +22,7 @@ Make sure your recognizer doesn't take too long to process text. Anything above ### Environment -When adding new recognizers that have 3rd party dependencies, make sure that the new dependencies don't interfere with Presidio's dependencies. -In the case of a conflict, one can create an isolated model environment (outside the main presidio-analyzer process) and implement a [`RemoteRecognizer`](https://github.com/microsoft/presidio/blob/main/presidio-analyzer/presidio_analyzer/remote_recognizer.py) on the presidio-analyzer side to interact with the model's endpoint. +When adding new recognizers that have 3rd party dependencies, make sure that the new dependencies don't interfere with Presidio's dependencies. In the case of a conflict, one can create an isolated model environment (outside the main presidio-analyzer process) and implement a [`RemoteRecognizer`](https://github.com/microsoft/presidio/blob/main/presidio-analyzer/presidio_analyzer/remote_recognizer.py) on the presidio-analyzer side to interact with the model's endpoint. In addition, make sure the license on the 3rd party dependency allows you to use it for any purpose. ## Recognizer Types @@ -34,7 +32,7 @@ Generally speaking, there are three types of recognizers: A deny list is a list of words that should be removed during text analysis. For example, it can include a list of titles (`["Mr.", "Mrs.", "Ms.", "Dr."]` to detect a "Title" entity.) -See [this documentation](index.md#how-to-add-a-new-recognizer) on adding a new recognizer. The [`PatternRecognizer`](https://github.com/microsoft/presidio/blob/main/presidio-analyzer/presidio_analyzer/pattern_recognizer.py) class has built-in support for a deny-list input. +See [this documentation](index.md#how-to-add-a-new-recognizer) on adding a new recognizer. The [`PatternRecognizer`](/presidio-analyzer/presidio_analyzer/pattern_recognizer.py) class has built-in support for a deny-list input. ### Pattern Based @@ -49,26 +47,36 @@ See some examples here: ### Machine Learning (ML) Based or Rule-Based Many PII entities are undetectable using naive approaches like deny-lists or regular expressions. -In these cases, we would wish to utilize a Machine Learning model capable of identifying entities in free text, or a rule-based recognizer. +In these cases, we would wish to utilize a Machine Learning model capable of identifying entities in free text, or a rule-based recognizer. There are four options for adding ML and rule based recognizers: -#### ML: Utilize SpaCy, Stanza or Transformers +#### Utilize SpaCy or Stanza -Presidio currently uses [spaCy](https://spacy.io/) as a framework for text analysis and Named Entity Recognition (NER), and [stanza](https://stanfordnlp.github.io/stanza/) and [huggingface transformers](https://huggingface.co/docs/transformers/index) as an alternative. To avoid introducing new tools, it is recommended to first try to use `spaCy`, `stanza` or `transformers` over other tools if possible. +Presidio currently uses [spaCy](https://spacy.io/) as a framework for text analysis and Named Entity Recognition (NER), and [stanza](https://stanfordnlp.github.io/stanza/) as an alternative. To avoid introducing new tools, it is recommended to first try to use `spaCy` or `stanza` over other tools if possible. `spaCy` provides descent results compared to state-of-the-art NER models, but with much better computational performance. -`spaCy`, `stanza` and `transformers` models could be trained from scratch, used in combination with pre-trained embeddings, or be fine-tuned. +`spaCy` and `stanza` models could be trained from scratch, used in combination with pre-trained embeddings, or retrained to detect new entities. +When integrating such a model into Presidio, a class inheriting from the [`EntityRecognizer`](https://github.com/microsoft/presidio/blob/main/presidio-analyzer/presidio_analyzer/entity_recognizer.py) should be created. -In addition to those, it is also possible to use other ML models. In that case, a new `EntityRecognizer` should be created. -See an example using [Flair here](https://github.com/microsoft/presidio/blob/main/docs/samples/python/flair_recognizer.py). +#### Utilize Scikit-learn or Similar + +`Scikit-learn` models tend to be fast, but usually have lower accuracy than deep learning methods. However, for well defined problems with well defined features, they can provide very good results. +When integrating such a model into Presidio, a class inheriting from the [`EntityRecognizer`](https://github.com/microsoft/presidio/blob/main/presidio-analyzer/presidio_analyzer/entity_recognizer.py) should be created. #### Apply Custom Logic -In some cases, rule-based logic provides reasonable ways for detecting entities. -The Presidio `EntityRecognizer` API allows you to use `spaCy` extracted features like lemmas, part of speech, dependencies and more to create your logic. -When integrating such logic into Presidio, a class inheriting from the [`EntityRecognizer`](https://github.com/microsoft/presidio/blob/main/presidio-analyzer/presidio_analyzer/entity_recognizer.py) should be created. +In some cases, rule-based logic provides the best way of detecting entities. +The Presidio `EntityRecognizer` API allows you to use `spaCy`/`stanza` extracted features like lemmas, part of speech, dependencies and more to create your logic. When integrating such logic into Presidio, a class inheriting from the [`EntityRecognizer`](https://github.com/microsoft/presidio/blob/main/presidio-analyzer/presidio_analyzer/entity_recognizer.py) should be created. + +#### Deep Learning Based Methods + +Deep learning methods offer excellent detection rates for NER. +They are however more complex to train, deploy and tend to be slower than traditional approaches. +When creating a DL based method for PII detection, there are two main alternatives for integrating it with Presidio: + +1. Create an external endpoint (either local or remote) which is isolated from the `presidio-analyzer` process. On the `presidio-analyzer` side, one would extend the [`RemoteRecognizer`](https://github.com/microsoft/presidio/blob/main/presidio-analyzer/presidio_analyzer/remote_recognizer.py) class and implement the network interface between `presidio-analyzer` and the endpoint of the model's container. +2. Integrate the model as an additional [`EntityRecognizer`](https://github.com/microsoft/presidio/blob/main/presidio-analyzer/presidio_analyzer/entity_recognizer.py) within the `presidio-analyzer` flow. !!! attention "Considerations for selecting one option over another" - - Accuracy. - Ease of integration. - Runtime considerations (For example if the new model requires a GPU). - 3rd party dependencies of the new model vs. the existing `presidio-analyzer` package. diff --git a/docs/analyzer/index.md b/docs/analyzer/index.md index 6412834ade..3a98f8cb20 100644 --- a/docs/analyzer/index.md +++ b/docs/analyzer/index.md @@ -14,7 +14,42 @@ Named Entity Recognition and other types of logic to detect PII in unstructured ## Installation -see [Installing Presidio](../installation.md). +=== "Using pip" + + !!! note "Note" + Consider installing the Presidio python packages on a virtual environment like venv or conda. + + To get started with Presidio-analyzer, + download the package and the `en_core_web_lg` spaCy model: + + ```sh + pip install presidio-analyzer + python -m spacy download en_core_web_lg + ``` + +=== "Using Docker" + + !!! note "Note" + This requires Docker to be installed. [Download Docker](https://docs.docker.com/get-docker/). + + ```sh + # Download image from Dockerhub + docker pull mcr.microsoft.com/presidio-analyzer + + # Run the container with the default port + docker run -d -p 5002:3000 mcr.microsoft.com/presidio-analyzer:latest + ``` + +=== "From source" + + First, clone the Presidio repo. [See here for instructions](../installation.md#install-from-source). + + Then, build the presidio-analyzer container: + + ```sh + cd presidio-analyzer + docker build . -t presidio/presidio-analyzer + ``` ## Getting started diff --git a/docs/analyzer/languages-config.yml b/docs/analyzer/languages-config.yml index fbd71caaa6..16c0e383df 100644 --- a/docs/analyzer/languages-config.yml +++ b/docs/analyzer/languages-config.yml @@ -3,26 +3,6 @@ models: - lang_code: en model_name: en_core_web_lg - - - lang_code: de - model_name: de_core_news_md - lang_code: es - model_name: es_core_news_md -ner_model_configuration: - - model_to_presidio_entity_mapping: - PER: PERSON - PERSON: PERSON - LOC: LOCATION - LOCATION: LOCATION - GPE: LOCATION - ORG: ORGANIZATION - DATE: DATE_TIME - TIME: DATE_TIME - NORP: NRP - - - low_confidence_score_multiplier: 0.4 - - low_score_entity_names: - - ORGANIZATION - - ORG - - default_score: 0.85 + model_name: es_core_news_md \ No newline at end of file diff --git a/docs/analyzer/languages.md b/docs/analyzer/languages.md index 7d51dcd17c..aee03bcec1 100644 --- a/docs/analyzer/languages.md +++ b/docs/analyzer/languages.md @@ -64,7 +64,6 @@ analyzer = AnalyzerEngine( analyzer.analyze(text="My name is David", language="en") ``` -Link to LANGUAGES_CONFIG_FILE=[languages-config.yml](https://github.com/microsoft/presidio/blob/main/docs/analyzer/languages-config.yml) ### Automatically install NLP models into the Docker container @@ -74,4 +73,4 @@ update the [conf/default.yaml](https://github.com/microsoft/presidio/blob/main/p the `docker build` phase and the models defined in it are installed automatically. For `transformers` based models, the configuration [can be found here](https://github.com/microsoft/presidio/blob/main/presidio-analyzer/conf/transformers.yaml). -A docker file supporting transformers models [can be found here](https://github.com/microsoft/presidio/blob/main/presidio-analyzer/Dockerfile.transformers). +In addition, make sure the Docker file contains the relevant packages for `transformers`, which are not loaded automatically with Presidio. diff --git a/docs/analyzer/nlp_engines/spacy_stanza.md b/docs/analyzer/nlp_engines/spacy_stanza.md index d0372570f4..c7e6e9fc8e 100644 --- a/docs/analyzer/nlp_engines/spacy_stanza.md +++ b/docs/analyzer/nlp_engines/spacy_stanza.md @@ -30,26 +30,11 @@ For the available models, follow these links: [spaCy](https://spacy.io/usage/mod !!! tip "Tip" For Person, Location and Organization detection, it could be useful to try out the transformers based models (e.g. `en_core_web_trf`) which uses a more modern deep-learning architecture, but is generally slower than the default `en_core_web_lg` model. + ### Configure Presidio to use the pre-trained model Once created, see [the NLP configuration documentation](../customizing_nlp_models.md#Configure-Presidio-to-use-the-new-model) for more information. -## How NER results flow within Presidio -This diagram describes the flow of NER results within Presidio, and the relationship between the `SpacyNlpEngine` component and the `SpacyRecognizer` component: -```mermaid -sequenceDiagram - AnalyzerEngine->>SpacyNlpEngine: Call engine.process_text(text)
to get model results - SpacyNlpEngine->>spaCy: Call spaCy pipeline - spaCy->>SpacyNlpEngine: return entities and other attributes - Note over SpacyNlpEngine: Map entity names to Presidio's,
update scores,
remove unwanted entities
based on NerModelConfiguration - SpacyNlpEngine->>AnalyzerEngine: Pass NlpArtifacts
(Entities, lemmas, tokens, scores etc.) - Note over AnalyzerEngine: Call all recognizers - AnalyzerEngine->>SpacyRecognizer: Pass NlpArtifacts - Note over SpacyRecognizer: Extract PII entities out of NlpArtifacts - SpacyRecognizer->>AnalyzerEngine: Return List[RecognizerResult] - -``` - ## Training your own model !!! note "Note" diff --git a/docs/analyzer/nlp_engines/transformers.md b/docs/analyzer/nlp_engines/transformers.md index bee44ea89e..89a8a9f37c 100644 --- a/docs/analyzer/nlp_engines/transformers.md +++ b/docs/analyzer/nlp_engines/transformers.md @@ -4,26 +4,11 @@ Presidio's `TransformersNlpEngine` consists of a spaCy pipeline which encapsulat ![image](../../assets/spacy-transformers-ner.png) -Presidio leverages other types of information from spaCy such as tokens, lemmas and part-of-speech. +Presidio leverages other types of information from spaCy such as tokens, lemmas and part-of-speech. Therefore the pipeline returns both the NER model results as well as results from other pipeline components. -## How NER results flow within Presidio -This diagram describes the flow of NER results within Presidio, and the relationship between the `TransformersNlpEngine` component and the `TransformersRecognizer` component: -```mermaid -sequenceDiagram - AnalyzerEngine->>TransformersNlpEngine: Call engine.process_text(text)
to get model results - TransformersNlpEngine->>spaCy: Call spaCy pipeline - spaCy->>transformers: call NER model - transformers->>spaCy: get entities - spaCy->>TransformersNlpEngine: return transformers entities
+ spaCy attributes - Note over TransformersNlpEngine: Map entity names to Presidio's,
update scores,
remove unwanted entities
based on NerModelConfiguration - TransformersNlpEngine->>AnalyzerEngine: Pass NlpArtifacts
(Entities, lemmas, tokens, scores etc.) - Note over AnalyzerEngine: Call all recognizers - AnalyzerEngine->>TransformersRecognizer: Pass NlpArtifacts - Note over TransformersRecognizer: Extract PII entities out of NlpArtifacts - TransformersRecognizer->>AnalyzerEngine: Return List[RecognizerResult] - -``` +!!! warning "Warning" + spaCy and transformers use a different tokenization approach. Therefore, it could be that there is no alignment between the spans identified by a transformers model and the spans created by spaCy. In this cases, there could be cases where the output of the transformers model is different from the output of Presidio's `TransformersNlpEngine` ## Adding a new model @@ -32,7 +17,6 @@ As the underlying transformers model, you can choose from either a public pretra ### Using a public pre-trained transformers model #### Downloading a pre-trained model - To download the desired NER model from HuggingFace: ```python @@ -50,99 +34,30 @@ AutoModelForTokenClassification.from_pretrained(transformers_model) ``` Then, also download a spaCy pipeline/model: - ```sh python -m spacy download en_core_web_sm ``` #### Creating a configuration file - -Once the models are downloaded, one option to configure them is to create a YAML configuration file. -Note that the configuration needs to contain both a `spaCy` pipeline name and a transformers model name. -In addition, different configurations for parsing the results of the transformers model can be added. - -Example configuration (in YAML): +Once the models are downloaded, the easiest option would be to create a YAML configuration file. +Note that this file needs to contain both a `spaCy` pipeline name and a transformers model name: ```yaml nlp_engine_name: transformers models: - - - lang_code: en - model_name: - spacy: en_core_web_sm - transformers: StanfordAIMI/stanford-deidentifier-base - -ner_model_configuration: - labels_to_ignore: - - O - aggregation_strategy: simple # "simple", "first", "average", "max" - stride: 16 - alignment_mode: strict # "strict", "contract", "expand" - model_to_presidio_entity_mapping: - PER: PERSON - LOC: LOCATION - ORG: ORGANIZATION - AGE: AGE - ID: ID - EMAIL: EMAIL - PATIENT: PERSON - STAFF: PERSON - HOSP: ORGANIZATION - PATORG: ORGANIZATION - DATE: DATE_TIME - PHONE: PHONE_NUMBER - HCW: PERSON - HOSPITAL: ORGANIZATION - - low_confidence_score_multiplier: 0.4 - low_score_entity_names: - - ID +- +lang_code: en +model_name: + spacy: + transformers: ``` - + Where: +- `` is a name of a spaCy model/pipeline, which would wrap the transformers NER model. For example, `en_core_web_sm`. +- The `` is the full path for a huggingface model. Models can be found on [HuggingFace Models Hub](https://huggingface.co/models?pipeline_tag=token-classification). For example, `obi/deid_roberta_i2b2` -- `model_name.spacy` is a name of a spaCy model/pipeline, which would wrap the transformers NER model. For example, `en_core_web_sm`. -- The `model_name.transformers` is the full path for a huggingface model. Models can be found on [HuggingFace Models Hub](https://huggingface.co/models?pipeline_tag=token-classification). For example, `obi/deid_roberta_i2b2` - -The `ner_model_configuration` section contains the following parameters: - -- `labels_to_ignore`: A list of labels to ignore. For example, `O` (no entity) or entities you are not interested in returning. -- `aggregation_strategy`: The strategy to use when aggregating the results of the transformers model. -- `stride`: The value is the length of the window overlap in transformer tokenizer tokens. -- `alignment_mode`: The strategy to use when aligning the results of the transformers model to the original text. -- `model_to_presidio_entity_mapping`: A mapping between the transformers model labels and the Presidio entity types. -- `low_confidence_score_multiplier`: A multiplier to apply to the score of entities with low confidence. -- `low_score_entity_names`: A list of entity types to apply the low confidence score multiplier to. - -See more information on parameters on the [spacy-huggingface-pipelines Github repo](https://github.com/explosion/spacy-huggingface-pipelines#token-classification). - Once created, see [the NLP configuration documentation](../customizing_nlp_models.md#Configure-Presidio-to-use-the-new-model) for more information. -#### Calling the new model - -Once the configuration file is created, it can be used to create a new `TransformersNlpEngine`: - -```python - from presidio_analyzer import AnalyzerEngine, RecognizerRegistry - from presidio_analyzer.nlp_engine import NlpEngineProvider - - # Create configuration containing engine name and models - conf_file = PATH_TO_CONF_FILE - - # Create NLP engine based on configuration - provider = NlpEngineProvider(conf_file=conf_file) - nlp_engine = provider.create_engine() - - # Pass the created NLP engine and supported_languages to the AnalyzerEngine - analyzer = AnalyzerEngine( - nlp_engine=nlp_engine, - supported_languages=["en"] - ) - - results_english = analyzer.analyze(text="My name is Morris", language="en") - print(results_english) -``` - ### Training your own model !!! note "Note" @@ -151,8 +66,3 @@ Once the configuration file is created, it can be used to create a new `Transfor For more information on model training and evaluation for Presidio, see the [Presidio-Research Github repository](https://github.com/microsoft/presidio-research). To train your own model, see this tutorial: [Train your own transformers model](https://huggingface.co/docs/transformers/training). - -### Using a transformers model as an `EntityRecognizer` - -In addition to the approach described in this document, one can decide to integrate a transformers model as a recognizer. -We allow these two options, as a user might want to have multiple NER models running in parallel. In this case, one can create multiple `EntityRecognizer` instances, each serving a different model, instead of one model used in an `NlpEngine`. [See this sample](../../samples/python/transformers_recognizer/index.md) for more info on integrating a transformers model as a Presidio recognizer and not as a Presidio `NLPEngine`. diff --git a/docs/anonymizer/index.md b/docs/anonymizer/index.md index b0c272a346..78a0145084 100644 --- a/docs/anonymizer/index.md +++ b/docs/anonymizer/index.md @@ -17,7 +17,40 @@ with some other value by applying a certain operator (e.g. replace, mask, redact ## Installation -see [Installing Presidio](../installation.md). +=== "Using pip" + + !!! note "Note" + Consider installing the Presidio python packages on a virtual environment like venv or conda. + + To install Presidio Anonymizer, run: + + ```sh + pip install presidio-anonymizer + ``` + +=== "Using Docker" + + !!! note "Note" + This requires Docker to be installed. [Download Docker](https://docs.docker.com/get-docker/). + + ```sh + # Download image from Dockerhub + docker pull mcr.microsoft.com/presidio-anonymizer + + # Run the container with the default port + docker run -d -p 5001:3000 mcr.microsoft.com/presidio-anonymizer:latest + ``` + +=== "From source" + + First, clone the Presidio repo. [See here for instructions](../installation.md#install-from-source). + + Then, build the presidio-anonymizer container: + + ```sh + cd presidio-anonymizer + docker build . -t presidio/presidio-anonymizer + ``` ## Getting started diff --git a/docs/api/analyzer_python.md b/docs/api/analyzer_python.md index 9e0665a220..4267add111 100644 --- a/docs/api/analyzer_python.md +++ b/docs/api/analyzer_python.md @@ -2,5 +2,5 @@ ::: presidio_analyzer handler: python - options: - docstring_style: sphinx + selection: + docstring_style: sphinx \ No newline at end of file diff --git a/docs/api/anonymizer_python.md b/docs/api/anonymizer_python.md index f59ee12554..bf0b428321 100644 --- a/docs/api/anonymizer_python.md +++ b/docs/api/anonymizer_python.md @@ -2,5 +2,5 @@ ::: presidio_anonymizer handler: python - options: + selection: docstring_style: sphinx diff --git a/docs/api/image_redactor_python.md b/docs/api/image_redactor_python.md index 33aa583ada..2eb5290b61 100644 --- a/docs/api/image_redactor_python.md +++ b/docs/api/image_redactor_python.md @@ -1,6 +1,15 @@ # Presidio Image Redactor API Reference -::: presidio_image_redactor +## ImageRedactorEngine class + +::: presidio_image_redactor.ImageRedactorEngine + handler: python + selection: + docstring_style: sphinx + +## ImageAnalyzerEngine class + +::: presidio_image_redactor.ImageAnalyzerEngine handler: python - options: + selection: docstring_style: sphinx diff --git a/docs/faq.md b/docs/faq.md index 113230ad4c..37e3afafe0 100644 --- a/docs/faq.md +++ b/docs/faq.md @@ -1,27 +1,26 @@ # Frequently Asked Questions (FAQ) - [General](#general) - - [What is Presidio?](#what-is-presidio) - - [Why did Microsoft create Presidio?](#why-did-microsoft-create-presidio) - - [Is Microsoft Presidio an official Microsoft product?](#is-microsoft-presidio-an-official-microsoft-product) - - [What is the difference between Presidio and different PII detection services like Azure Text Analytics and Amazon Comprehend?](#what-is-the-difference-between-presidio-and-different-pii-detection-services-like-azure-text-analytics-and-amazon-comprehend) + - [What is Presidio?](#what-is-presidio) + - [Why did Microsoft create Presidio?](#why-did-microsoft-create-presidio) + - [Is Microsoft Presidio an official Microsoft product?](#is-microsoft-presidio-an-official-microsoft-product) + - [What is the difference between Presidio and different PII detection services like Azure Text Analytics and Amazon Comprehend?](#what-is-the-difference-between-presidio-and-different-pii-detection-services-like-azure-text-analytics-and-amazon-comprehend) - [Using Presidio](#using-presidio) - - [How can I start using Presidio?](#how-can-i-start-using-presidio) - - [What are the main building blocks in Presidio?](#what-are-the-main-building-blocks-in-presidio) + - [How can I start using Presidio?](#how-can-i-start-using-presidio) + - [What are the main building blocks in Presidio?](#what-are-the-main-building-blocks-in-presidio) - [Customizing Presidio](#customizing-presidio) - - [How can Presidio be customized to my needs?](#how-can-presidio-be-customized-to-my-needs) - - [What NLP frameworks does Presidio support?](#what-nlp-frameworks-does-presidio-support) - - [Can Presidio be used for Pseudonymization?](#can-presidio-be-used-for-pseudonymization) - - [Does Presidio work on structured/tabular data?](#does-presidio-work-on-structuredtabular-data) + - [How can Presidio be customized to my needs?](#how-can-presidio-be-customized-to-my-needs) + - [What NLP frameworks does Presidio support?](#what-nlp-frameworks-does-presidio-support) + - [Can Presidio be used for Pseudonymization?](#can-presidio-be-used-for-pseudonymization) + - [Does Presidio work on structured/tabular data?](#does-presidio-work-on-structuredtabular-data) - [Improving detection accuracy](#improving-detection-accuracy) - - [What can I do if Presidio does not detect some of the PII entities in my data (False Negatives)?](#what-can-i-do-if-presidio-does-not-detect-some-of-the-pii-entities-in-my-data-false-negatives) - - [What can I do if Presidio falsely detects text as PII entities (False Positives)?](#what-can-i-do-if-presidio-falsely-detects-text-as-pii-entities-false-positives) - - [How can I evaluate the performance of my Presidio instance?](#how-can-i-evaluate-the-performance-of-my-presidio-instance) + - [What can I do if Presidio does not detect some of the PII entities in my data (False Negatives)?](#what-can-i-do-if-presidio-does-not-detect-some-of-the-pii-entities-in-my-data-false-negatives) + - [What can I do if Presidio falsely detects text as PII entities (False Positives)?](#what-can-i-do-if-presidio-falsely-detects-text-as-pii-entities-false-positives) + - [How can I evaluate the performance of my Presidio instance?](#how-can-i-evaluate-the-performance-of-my-presidio-instance) - [Deployment](#deployment) - - [How can I deploy Presidio into my environment?](#how-can-i-deploy-presidio-into-my-environment) + - [How can I deploy Presidio into my environment?](#how-can-i-deploy-presidio-into-my-environment) - [Contributing](#contributing) - - [How can I contribute to Presidio?](#how-can-i-contribute-to-presidio) - - [How can I report security vulnerabilities?](#how-can-i-report-security-vulnerabilities) + - [How can I contribute to Presidio?](#how-can-i-contribute-to-presidio) ## General @@ -45,7 +44,7 @@ By developing Presidio, our goals are: ### Is Microsoft Presidio an official Microsoft product? -The authors and maintainers of Presidio come from the [Industry Solutions Engineering](https://microsoft.github.io/code-with-engineering-playbook) team. We work with customers on various engineering problems, and have found the proper handling of private and sensitive data a recurring challenge across many customers and industries. +The authors and maintainers of Presidio come from the [Commercial Software Engineering]([https://microsoft/github.io/code-with-engineering-playbook/cse](https://microsoft.github.io/code-with-engineering-playbook/CSE/)) team. We work with customers on various engineering problems, and have found the proper handling of private and sensitive data a recurring challenge across many customers and industries. !!! note "Note" Microsoft Presidio is not an official Microsoft product. Usage terms are defined in the [repository's license](https://github.com/microsoft/presidio/blob/main/LICENSE). @@ -95,11 +94,11 @@ For more information, see the [docs](https://microsoft.github.io/presidio/analyz ### Can Presidio be used for Pseudonymization? -Pseudonymization is a de-identification technique in which the real data is replaced with fake data in a reversible way. Since there are various ways and approaches for this, we provide a simple [sample](https://github.com/microsoft/presidio/blob/main/docs/samples/python/example_custom_lambda_anonymizer.py) which can be extended for more sophisticated usage. If you have a question or a request on this topic, please open an issue on the repo. +Pseudonymization is a de-identification technique in which the real data is replaced with fake data. Since there are various ways and approaches for this, we provide a simple [sample](https://microsoft.github.io/presidio/samples/python/example_custom_lambda_anonymizer/) which can be extended for more sophisticated usage. If you have a question or a request on this topic, please open an issue on the repo. ### Does Presidio work on structured/tabular data? -This is an area we are actively looking into. We have an [example implementation](https://microsoft.github.io/presidio/samples/python/batch_processing/) of using Presidio on structured/semi-structured data. Also see the different discussions on this topic on the [Discussions](https://github.com/microsoft/presidio/discussions) section. If you have a question, suggestion, or a contribution in this area, please reach out by opening an issue, starting a discussion or reaching us directly at +This is an area we are actively looking into. We have an [example implementation](https://microsoft.github.io/presidio/samples/python/batch_processing/) of using Presidio on structured/semi-structured data. Also see the different discussions on this topic on the [Discussions](https://github.com/microsoft/presidio/discussions) section. If you have a question, suggestion, or a contribution in this area, please reach out by opening an issue, starting a discussion or reaching us directly at presidio@microsoft.com ## Improving detection accuracy @@ -134,8 +133,7 @@ The main Presidio modules (analyzer, anonymizer, image-redactor) can be used bot ### How can I contribute to Presidio? -First, review the [contribution guidelines](https://github.com/microsoft/presidio/blob/main/CONTRIBUTING.md), and feel free to reach out by opening an issue, posting a discussion or emailing us at - -### How can I report security vulnerabilities? +First, review the [contribution guidelines](https://github.com/microsoft/presidio/blob/main/CONTRIBUTING.md), and feel free to reach out by opening an issue, posting a discussion or emailing us at presidio@microsoft.com +### How can I report security vulnerabilities? Please see the [security information](https://github.com/microsoft/presidio/blob/main/SECURITY.md). diff --git a/docs/getting_started.md b/docs/getting_started.md index 2339bc79cc..49def7a26c 100644 --- a/docs/getting_started.md +++ b/docs/getting_started.md @@ -2,10 +2,9 @@ ## Simple flow -Using Presidio's modules as Python packages to get started: - -===+ "Anonymize PII in text (Default spaCy model)" +Using Presidio's modules as Python packages to get started +=== "Anonymize PII in text" 1. Install Presidio @@ -42,56 +41,6 @@ Using Presidio's modules as Python packages to get started: print(anonymized_text) ``` -=== "Anonymize PII in text (transformers)" - - 1. Install Presidio - - ```sh - pip install "presidio-analyzer[transformers]" - pip install presidio-anonymizer - python -m spacy download en_core_web_sm - ``` - - 2. Analyze + Anonymize - - ```py - from presidio_analyzer import AnalyzerEngine - from presidio_analyzer.nlp_engine import TransformersNlpEngine - from presidio_anonymizer import AnonymizerEngine - - text = "My name is Don and my phone number is 212-555-5555" - - # Define which transformers model to use - model_config = [{"lang_code": "en", "model_name": { - "spacy": "en_core_web_sm", # use a small spaCy model for lemmas, tokens etc. - "transformers": "dslim/bert-base-NER" - } - }] - - nlp_engine = TransformersNlpEngine(models=model_config) - - # Set up the engine, loads the NLP module (spaCy model by default) - # and other PII recognizers - analyzer = AnalyzerEngine(nlp_engine=nlp_engine) - - # Call analyzer to get results - results = analyzer.analyze(text=text, language='en') - print(results) - - # Analyzer results are passed to the AnonymizerEngine for anonymization - - anonymizer = AnonymizerEngine() - - anonymized_text = anonymizer.anonymize(text=text, analyzer_results=results) - - print(anonymized_text) - - ``` - !!! tip "Tip: Downloading models" - If not available, the transformers model and the spacy model would be downloaded on the first call to the `AnalyzerEngine`. To pre-download, see [this doc](./analyzer/nlp_engines/transformers.md#downloading-a-pre-trained-model). - -## Simple flow: Images - === "Anonymize PII in images" 1. Install presidio-image-redactor diff --git a/docs/index.md b/docs/index.md index 3c7c1ae1c4..50a098a4ae 100644 --- a/docs/index.md +++ b/docs/index.md @@ -46,12 +46,12 @@ bitcoin wallets, US phone numbers, financial data and more. ## Running Presidio -1. [Samples for running Presidio via code](samples/index.md) +1. [Running Presidio via code](samples/python/index.md) 2. [Running Presidio as an HTTP service](samples/docker/index.md) 3. [Setting up a development environment](development.md) 4. [Perform PII identification using presidio-analyzer](analyzer/index.md) -5. [Perform PII de-identification using presidio-anonymizer](anonymizer/index.md) -6. [Perform PII identification and redaction in images using presidio-image-redactor](image-redactor/index.md) +5. [Perform PII anonymization using presidio-anonymizer](anonymizer/index.md) +6. [Perform PII identification and anonymization in images using presidio-image-redactor](image-redactor/index.md) 7. [Example deployments](samples/deployments/index.md) --- diff --git a/docs/installation.md b/docs/installation.md index 8a0b6fb014..dcaf66b83a 100644 --- a/docs/installation.md +++ b/docs/installation.md @@ -2,16 +2,17 @@ ## Description -This document describes the installation of the entire +This document describes how to download and install the Presidio services locally. +As Presidio is comprised of several packages/services, +this document describes the installation of the entire Presidio suite using `pip` (as Python packages) or using `Docker` (As containerized services). ## Using pip !!! note "Note" - - Consider installing the Presidio python packages - in a virtual environment like [venv](https://docs.python.org/3/tutorial/venv.html) - or [conda](https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html). + Consider installing the Presidio python packages + on a virtual environment like [venv](https://docs.python.org/3/tutorial/venv.html) + or [conda](https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html). ### Supported Python Versions @@ -25,42 +26,20 @@ Presidio is supported for the following python versions: ### PII anonymization on text -For PII anonymization on text, install the `presidio-analyzer` and `presidio-anonymizer` packages -with at least one NLP engine (`spaCy`, `transformers` or `stanza`): - -===+ "spaCy (default)" - - ``` - pip install presidio_analyzer - pip install presidio_anonymizer - python -m spacy download en_core_web_lg - ``` - -=== "Transformers" - - ``` - pip install "presidio_analyzer[transformers]" - pip install presidio_anonymizer - python -m spacy download en_core_web_sm - ``` +For PII anonymization on text, install the `presidio-analyzer` and `presidio-anonymizer` packages: - !!! note "Note" - - When using a transformers NLP engine, Presidio would still use spaCy for other capabilities, - therefore a small spaCy model (such as en_core_web_sm) is required. - Transformers models would be loaded lazily. To pre-load them, see: [Downloading a pre-trained model](./analyzer/nlp_engines/transformers.md#downloading-a-pre-trained-model) - -=== "Stanza" +```sh +pip install presidio_analyzer +pip install presidio_anonymizer - ``` - pip install "presidio_analyzer[stanza]" - pip install presidio_anonymizer - ``` +# Presidio analyzer requires a spaCy language model. +python -m spacy download en_core_web_lg +``` +For a more detailed installation of each package, refer to the specific documentation: - !!! note "Note" - - Stanza models would be loaded lazily. To pre-load them, see: [Downloading a pre-trained model](./analyzer/nlp_engines/spacy_stanza.md#download-the-pre-trained-model). +* [presidio-analyzer](analyzer/index.md). +* [presidio-anonymizer](anonymizer/index.md). ### PII redaction in images @@ -74,6 +53,8 @@ pip install presidio_image_redactor python -m spacy download en_core_web_lg ``` +[Click here](image-redactor/index.md) for more information on the presidio-image-redactor package. + ## Using Docker Presidio can expose REST endpoints for each service using Flask and Docker. diff --git a/docs/samples/python/image_redaction_allow_list_approach.ipynb b/docs/samples/python/image_redaction_allow_list_approach.ipynb index 91ed0f2f6e..fc7b381664 100644 --- a/docs/samples/python/image_redaction_allow_list_approach.ipynb +++ b/docs/samples/python/image_redaction_allow_list_approach.ipynb @@ -146,7 +146,7 @@ "metadata": {}, "source": [ "### 1.2 DICOM medical image\n", - "For more information on DICOM image redaction, please see [example_dicom_image_redactor.ipynb](./example_dicom_image_redactor.ipynb) and the [Image redactor module documentation](../../image-redactor/index.md)." + "For more information on DICOM image redaction, please see [example_dicom_image_redactor.ipynb](./example_dicom_image_redactor.ipynb) and the [Image redactor module documentation](../../../image-redactor/index.md)." ] }, { diff --git a/docs/samples/python/transformers_recognizer/index.md b/docs/samples/python/transformers_recognizer/index.md index 7c31b446d7..bd9e466796 100644 --- a/docs/samples/python/transformers_recognizer/index.md +++ b/docs/samples/python/transformers_recognizer/index.md @@ -1,30 +1,24 @@ -# Add a Transformers model based EntityRecognizer - -!!! note "Note" - - This example demonstrates how to create a **Presidio Recognizer**. - To integrate a transformers model as a **Presidio NLP Engine**, see [this documentation](../../../analyzer/nlp_engines/transformers.md). - - We allow these two options, as a user might want to have multiple NER models running in parallel. In this case, one can create multiple `EntityRecognizer` instances, each serving a different model. If you only plan to use one NER model, consider creating a [`TransformersNlpEngine`](../../../analyzer/nlp_engines/transformers.md) instead of the [`TransformersRecognizer`](https://github.com/microsoft/presidio/blob/main/docs/samples/python/transformers_recognizer/transformer_recognizer.py) described in this document. +# Run Presidio With Transformers Models +This example demonstrates how to extract PII entities using transformers models. When initializing the `TransformersRecognizer`, choose from the following options: - -1. A string referencing an uploaded model to HuggingFace. See the different available options for models [here](https://huggingface.co/models?pipeline_tag=token-classification&sort=downloads>). +1. A string referencing an uploaded model to HuggingFace. Use this url to access all TokenClassification models - https://huggingface.co/models?pipeline_tag=token-classification&sort=downloads 2. Initialize your own `TokenClassificationPipeline` instance using your custom transformers model and use it for inference. 3. Provide the path to your own local custom trained model. !!! note "Note" - For each combination of model & dataset, it is recommended to create a configuration object which includes setting necessary parameters for getting the correct results. Please reference this [configuraion.py](https://github.cim/microsoft/presidio/blob/miN/configuration.py) file for examples. +For each combination of model & dataset, it is recommended to create a configuration object which includes setting necessary parameters for getting the correct results. Please reference this [configuraion.py](configuration.py) file for examples. -## Example Code + + + +### Example Code This example code uses a `TransformersRecognizer` for NER, and removes the default `SpacyRecognizer`. In order to be able to use spaCy features such as lemmas, we introduce the small (and faster) `en_core_web_sm` model. -[link to full TransformersRecognizer code](https://github.com/microsoft/presidio/blob/main/docs/samples/python/transformers_recognizer/transformer_recognizer.py) - ```python from presidio_analyzer import AnalyzerEngine, RecognizerRegistry from presidio_analyzer.nlp_engine import NlpEngineProvider diff --git a/docs/text_anonymization.md b/docs/text_anonymization.md index 1f989b2b0d..a73a48cf30 100644 --- a/docs/text_anonymization.md +++ b/docs/text_anonymization.md @@ -2,8 +2,8 @@ Presidio's features two main modules for anonymization PII in text: -- [Presidio analyzer](analyzer/index.md): Identification of PII in text -- [Presidio anonymizer](anonymizer/index.md): De-identify detected PII entities using different operators +- [Presidio analyzer](analyzer/index.md): Identification PII in text +- [Presidio anonymizer](anonymizer/index.md): Anonymize detected PII entities using different operators In most cases, we would run the Presidio analyzer to detect where PII entities exist, and then the Presidio anonymizer to remove those using specific operators (such as redact, replace, hash or encrypt) @@ -14,3 +14,4 @@ This figure presents the overall flow in high level: - The [Presidio Analyzer](analyzer/index.md) holds multiple recognizers, each one capable of detecting specific PII entities. These recognizers leverage regular expressions, deny lists, checksum, rule based logic, Named Entity Recognition ML models and context from surrounding words. - The [Presidio Anonymizer](anonymizer/index.md) holds multiple operators, each one can be used to anonymize the PII entity in a different way. Additionally, it can be used to de-anonymize an already anonymized entity (For example, decrypt an encrypted entity) + diff --git a/docs/tutorial/04_external_services.md b/docs/tutorial/04_external_services.md index 47d231c08f..92990e0c4b 100644 --- a/docs/tutorial/04_external_services.md +++ b/docs/tutorial/04_external_services.md @@ -13,5 +13,5 @@ In a similar way to example 3, we can write logic to call external services for ## Calling a model in a different framework -- [This example](https://github.com/microsoft/presidio/blob/main/docs/samples/python/flair_recognizer.py) shows a Presidio wrapper for a Flair model. +- [This example](../samples/python/flair_recognizer.py) shows a Presidio wrapper for a Flair model. - Using a similar approach, we could create wrappers for HuggingFace models, Conditional Random Fields or any other framework. diff --git a/docs/tutorial/05_languages.md b/docs/tutorial/05_languages.md index 4cc2fd06af..453a4b2cb0 100644 --- a/docs/tutorial/05_languages.md +++ b/docs/tutorial/05_languages.md @@ -47,10 +47,10 @@ print("Results from English request:") print(results_english) ``` -[See this documentation](https://microsoft.github.io/presidio/analyzer/languages/) for more details on setting up additional NLP models and languages. +[See this documentation](https://microsoft.github.io/presidio/analyzer/languages/) for more details on how to configure Presidio support additional NLP models and languages. ## Using external models/frameworks -Some languages are not supported by spaCy/Stanza/huggingface, or have very limited support in those. In this case, other frameworks could be leveraged. (see [example 4](04_external_services.md) for more information). +Some languages are not supported by spaCy/Stanza, or have very limited support in those. In this case, other frameworks could be leveraged. (see [example 4](04_external_services.md) for more information). Since Presidio requires a spaCy model to be passed, we propose to use a simple spaCy pipeline such as `en_core_web_sm` as the NLP engine's model, and a recognizer calling an external framework/service as the Named Entity Recognition (NER) model. diff --git a/docs/tutorial/index.md b/docs/tutorial/index.md index 63af3e4d03..6d2ea90454 100644 --- a/docs/tutorial/index.md +++ b/docs/tutorial/index.md @@ -16,7 +16,7 @@ This tutorials covers different customization use cases to: - [Supporting new models and languages](05_languages.md) - [Calling an external service for PII detection](04_external_services.md) - [Using context words](06_context.md) -- [Tracing the decision process](07_decision_process.md) +- [Tracing the decision process](07_decision_process) - [Loading recognizers from file](08_no_code.md) - [Ad-Hoc recognizers](09_ad_hoc.md) - [Simple anonymization](10_simple_anonymization.md)