Revert "fork of new docs"

This reverts commit 2b7c313.
microsoft · Oct 11, 2023 · 854b095 · 854b095
1 parent 7879432
commit 854b095
Show file tree

Hide file tree

Showing 21 changed files with 192 additions and 333 deletions.
diff --git a/docs/analyzer/customizing_nlp_models.md b/docs/analyzer/customizing_nlp_models.md
@@ -1,11 +1,11 @@
-# Customizing the NLP engine in Presidio Analyzer
-
-Presidio uses NLP engines for two main tasks: NER based PII identification,
-and feature extraction for downstream rule based logic (such as leveraging context words for improved detection).
-While Presidio comes with an open-source model (the `en_core_web_lg` model from spaCy),
-additional NLP models and frameworks could be plugged in, either public or proprietary.
-These models can be trained or downloaded from existing NLP frameworks like [spaCy](https://spacy.io/usage/models),
-[Stanza](https://github.com/stanfordnlp/stanza) and
+# Customizing the NLP models in Presidio Analyzer
+
+Presidio uses NLP engines for two main tasks: NER based PII identification, 
+and feature extraction for custom rule based logic (such as leveraging context words for improved detection).
+While Presidio comes with an open-source model (the `en_core_web_lg` model from spaCy), 
+it can be customized by leveraging other NLP models, either public or proprietary.
+These models can be trained or downloaded from existing NLP frameworks like [spaCy](https://spacy.io/usage/models), 
+[Stanza](https://github.com/stanfordnlp/stanza) and 
 [transformers](https://github.com/huggingface/transformers).
 
 In addition, other types of NLP frameworks [can be integrated into Presidio](developing_recognizers.md#machine-learning-ml-based-or-rule-based).
@@ -63,30 +63,9 @@ Configuration can be done in two ways:
         -
         lang_code: es
         model_name: es_core_news_md 
-    ner_model_configuration:
-    labels_to_ignore:
-    - O
-    model_to_presidio_entity_mapping:
-        PER: PERSON
-        LOC: LOCATION
-        ORG: ORGANIZATION
-        AGE: AGE
-        ID: ID
-        DATE: DATE_TIME
-    low_confidence_score_multiplier: 0.4
-    low_score_entity_names:
-    - ID
-    - ORG
     ```
 
-    The `ner_model_configuration` section contains the following parameters:
-
-  - `labels_to_ignore`: A list of labels to ignore. For example, `O` (no entity) or entities you are not interested in returning.
-  - `model_to_presidio_entity_mapping`: A mapping between the transformers model labels and the Presidio entity types.
-  - `low_confidence_score_multiplier`: A multiplier to apply to the score of entities with low confidence.
-  - `low_score_entity_names`: A list of entity types to apply the low confidence score multiplier to.
-
-    The [default conf file](https://github.com/microsoft/presidio/blob/main/presidio-analyzer/conf/default.yaml) is read during the default initialization of the `AnalyzerEngine`. Alternatively, the path to a custom configuration file can be passed to the `NlpEngineProvider`:
+    The default conf file is read during the default initialization of the `AnalyzerEngine`. Alternatively, the path to a custom configuration file can be passed to the `NlpEngineProvider`:
 
     ```python
     from presidio_analyzer import AnalyzerEngine, RecognizerRegistry
@@ -118,14 +97,12 @@ Configuration can be done in two ways:
         c. pass requests in each of these languages.
 
     !!! note "Note"
-        Presidio can currently use one NER model per language via the `NlpEngine`. If multiple are required,
-        consider wrapping NER models as additional recognizers ([see sample here](https://github.com/microsoft/presidio/blob/main/docs/samples/python/example_remote_recognizer.py)).
+        Presidio can currently use one NLP model per language.
 
 ## Leverage frameworks other than spaCy, Stanza and transformers for ML based PII detection
 
 In addition to the built-in spaCy/Stanza/transformers capabitilies, it is possible to create new recognizers which serve as interfaces to other models.
 For more information:
-
 - [Remote recognizer documentation](adding_recognizers.md#creating-a-remote-recognizer) and [samples](../samples/python/integrating_with_external_services.ipynb).
 - [Flair recognizer example](../samples/python/flair_recognizer.py)
 

diff --git a/docs/analyzer/developing_recognizers.md b/docs/analyzer/developing_recognizers.md
@@ -7,8 +7,7 @@ Recognizers define the logic for detection, as well as the confidence a predicti
 
 ### Accuracy
 
-Each recognizer, regardless of its complexity, could have false positives and false negatives. When adding new recognizers, we try to balance the effect of each recognizer on the entire system. 
-A recognizer with many false positives would affect the system's usability, while a recognizer with many false negatives might require more work before it can be integrated. For reproducibility purposes, it is be best to note how the recognizer's accuracy was tested, and on which datasets.
+Each recognizer, regardless of its complexity, could have false positives and false negatives. When adding new recognizers, we try to balance the effect of each recognizer on the entire system. A recognizer with many false positives would affect the system's usability, while a recognizer with many false negatives might require more work before it can be integrated. For reproducibility purposes, it is be best to note how the recognizer's accuracy was tested, and on which datasets.
 For tools and documentation on evaluating and analyzing recognizers, refer to the [presidio-research Github repository](https://github.com/microsoft/presidio-research).
 
 !!! note "Note"
@@ -23,8 +22,7 @@ Make sure your recognizer doesn't take too long to process text. Anything above
 
 ### Environment
 
-When adding new recognizers that have 3rd party dependencies, make sure that the new dependencies don't interfere with Presidio's dependencies. 
-In the case of a conflict, one can create an isolated model environment (outside the main presidio-analyzer process) and implement a [`RemoteRecognizer`](https://github.com/microsoft/presidio/blob/main/presidio-analyzer/presidio_analyzer/remote_recognizer.py) on the presidio-analyzer side to interact with the model's endpoint.
+When adding new recognizers that have 3rd party dependencies, make sure that the new dependencies don't interfere with Presidio's dependencies. In the case of a conflict, one can create an isolated model environment (outside the main presidio-analyzer process) and implement a [`RemoteRecognizer`](https://github.com/microsoft/presidio/blob/main/presidio-analyzer/presidio_analyzer/remote_recognizer.py) on the presidio-analyzer side to interact with the model's endpoint. In addition, make sure the license on the 3rd party dependency allows you to use it for any purpose.
 
 ## Recognizer Types
 
@@ -34,7 +32,7 @@ Generally speaking, there are three types of recognizers:
 
 A deny list is a list of words that should be removed during text analysis. For example, it can include a list of titles (`["Mr.", "Mrs.", "Ms.", "Dr."]` to detect a "Title" entity.)
 
-See [this documentation](index.md#how-to-add-a-new-recognizer) on adding a new recognizer. The [`PatternRecognizer`](https://github.com/microsoft/presidio/blob/main/presidio-analyzer/presidio_analyzer/pattern_recognizer.py) class has built-in support for a deny-list input.
+See [this documentation](index.md#how-to-add-a-new-recognizer) on adding a new recognizer. The [`PatternRecognizer`](/presidio-analyzer/presidio_analyzer/pattern_recognizer.py) class has built-in support for a deny-list input.
 
 ### Pattern Based
 
@@ -49,26 +47,36 @@ See some examples here:
 ### Machine Learning (ML) Based or Rule-Based
 
 Many PII entities are undetectable using naive approaches like deny-lists or regular expressions.
-In these cases, we would wish to utilize a Machine Learning model capable of identifying entities in free text, or a rule-based recognizer.
+In these cases, we would wish to utilize a Machine Learning model capable of identifying entities in free text, or a rule-based recognizer. There are four options for adding ML and rule based recognizers:
 
-#### ML: Utilize SpaCy, Stanza or Transformers
+#### Utilize SpaCy or Stanza
 
-Presidio currently uses [spaCy](https://spacy.io/) as a framework for text analysis and Named Entity Recognition (NER), and [stanza](https://stanfordnlp.github.io/stanza/) and [huggingface transformers](https://huggingface.co/docs/transformers/index) as an alternative. To avoid introducing new tools, it is recommended to first try to use `spaCy`, `stanza` or `transformers` over other tools if possible.
+Presidio currently uses [spaCy](https://spacy.io/) as a framework for text analysis and Named Entity Recognition (NER), and [stanza](https://stanfordnlp.github.io/stanza/) as an alternative. To avoid introducing new tools, it is recommended to first try to use `spaCy` or `stanza` over other tools if possible.
 `spaCy` provides descent results compared to state-of-the-art NER models, but with much better computational performance.
-`spaCy`, `stanza` and `transformers` models could be trained from scratch, used in combination with pre-trained embeddings, or be fine-tuned.
+`spaCy` and `stanza` models could be trained from scratch, used in combination with pre-trained embeddings, or retrained to detect new entities.
+When integrating such a model into Presidio, a class inheriting from the [`EntityRecognizer`](https://github.com/microsoft/presidio/blob/main/presidio-analyzer/presidio_analyzer/entity_recognizer.py) should be created.
 
-In addition to those, it is also possible to use other ML models. In that case, a new `EntityRecognizer` should be created. 
-See an example using [Flair here](https://github.com/microsoft/presidio/blob/main/docs/samples/python/flair_recognizer.py).
+#### Utilize Scikit-learn or Similar
+
+`Scikit-learn` models tend to be fast, but usually have lower accuracy than deep learning methods. However, for well defined problems with well defined features, they can provide very good results.
+When integrating such a model into Presidio, a class inheriting from the [`EntityRecognizer`](https://github.com/microsoft/presidio/blob/main/presidio-analyzer/presidio_analyzer/entity_recognizer.py) should be created.
 
 #### Apply Custom Logic
 
-In some cases, rule-based logic provides reasonable ways for detecting entities.
-The Presidio `EntityRecognizer` API allows you to use `spaCy` extracted features like lemmas, part of speech, dependencies and more to create your logic. 
-When integrating such logic into Presidio, a class inheriting from the [`EntityRecognizer`](https://github.com/microsoft/presidio/blob/main/presidio-analyzer/presidio_analyzer/entity_recognizer.py) should be created.
+In some cases, rule-based logic provides the best way of detecting entities.
+The Presidio `EntityRecognizer` API allows you to use `spaCy`/`stanza` extracted features like lemmas, part of speech, dependencies and more to create your logic. When integrating such logic into Presidio, a class inheriting from the [`EntityRecognizer`](https://github.com/microsoft/presidio/blob/main/presidio-analyzer/presidio_analyzer/entity_recognizer.py) should be created.
+
+#### Deep Learning Based Methods
+
+Deep learning methods offer excellent detection rates for NER.
+They are however more complex to train, deploy and tend to be slower than traditional approaches.
+When creating a DL based method for PII detection, there are two main alternatives for integrating it with Presidio:
+
+1. Create an external endpoint (either local or remote) which is isolated from the `presidio-analyzer` process. On the `presidio-analyzer` side, one would extend the [`RemoteRecognizer`](https://github.com/microsoft/presidio/blob/main/presidio-analyzer/presidio_analyzer/remote_recognizer.py) class and implement the network interface between `presidio-analyzer` and the endpoint of the model's container.
+2. Integrate the model as an additional [`EntityRecognizer`](https://github.com/microsoft/presidio/blob/main/presidio-analyzer/presidio_analyzer/entity_recognizer.py) within the `presidio-analyzer` flow.
 
 !!! attention "Considerations for selecting one option over another"
 
-    - Accuracy.
     - Ease of integration.
     - Runtime considerations (For example if the new model requires a GPU).
     - 3rd party dependencies of the new model vs. the existing `presidio-analyzer` package.
diff --git a/docs/analyzer/index.md b/docs/analyzer/index.md
@@ -14,7 +14,42 @@ Named Entity Recognition and other types of logic to detect PII in unstructured
 
 ## Installation
 
-see [Installing Presidio](../installation.md).
+=== "Using pip"
+
+    !!! note "Note"
+        Consider installing the Presidio python packages on a virtual environment like venv or conda.
+
+    To get started with Presidio-analyzer,
+    download the package and the `en_core_web_lg` spaCy model:
+
+    ```sh
+    pip install presidio-analyzer
+    python -m spacy download en_core_web_lg
+    ```
+
+=== "Using Docker"
+
+    !!! note "Note"
+        This requires Docker to be installed. [Download Docker](https://docs.docker.com/get-docker/).
+
+    ```sh
+    # Download image from Dockerhub
+    docker pull mcr.microsoft.com/presidio-analyzer
+
+    # Run the container with the default port
+    docker run -d -p 5002:3000 mcr.microsoft.com/presidio-analyzer:latest
+    ```
+
+=== "From source"
+
+    First, clone the Presidio repo. [See here for instructions](../installation.md#install-from-source).
+
+    Then, build the presidio-analyzer container:
+
+    ```sh
+    cd presidio-analyzer
+    docker build . -t presidio/presidio-analyzer
+    ```
 
 ## Getting started
 

diff --git a/docs/analyzer/languages-config.yml b/docs/analyzer/languages-config.yml
@@ -3,26 +3,6 @@ models:
   -
     lang_code: en
     model_name: en_core_web_lg
-  -
-    lang_code: de
-    model_name: de_core_news_md
   -
     lang_code: es
-    model_name: es_core_news_md
-ner_model_configuration:
-  - model_to_presidio_entity_mapping:
-    PER: PERSON
-    PERSON: PERSON
-    LOC: LOCATION
-    LOCATION: LOCATION
-    GPE: LOCATION
-    ORG: ORGANIZATION
-    DATE: DATE_TIME
-    TIME: DATE_TIME
-    NORP: NRP
-
-  - low_confidence_score_multiplier: 0.4
-  - low_score_entity_names:
-    - ORGANIZATION
-    - ORG
-  - default_score: 0.85
+    model_name: es_core_news_md
diff --git a/docs/analyzer/languages.md b/docs/analyzer/languages.md
@@ -64,7 +64,6 @@ analyzer = AnalyzerEngine(
 
 analyzer.analyze(text="My name is David", language="en")
 ```
-Link to LANGUAGES_CONFIG_FILE=[languages-config.yml](https://github.com/microsoft/presidio/blob/main/docs/analyzer/languages-config.yml)
 
 ### Automatically install NLP models into the Docker container
 
@@ -74,4 +73,4 @@ update the [conf/default.yaml](https://github.com/microsoft/presidio/blob/main/p
 the `docker build` phase and the models defined in it are installed automatically.
 
 For `transformers` based models, the configuration [can be found here](https://github.com/microsoft/presidio/blob/main/presidio-analyzer/conf/transformers.yaml). 
-A docker file supporting transformers models [can be found here](https://github.com/microsoft/presidio/blob/main/presidio-analyzer/Dockerfile.transformers).
+In addition, make sure the Docker file contains the relevant packages for `transformers`, which are not loaded automatically with Presidio.
diff --git a/docs/analyzer/nlp_engines/spacy_stanza.md b/docs/analyzer/nlp_engines/spacy_stanza.md
@@ -30,26 +30,11 @@ For the available models, follow these links: [spaCy](https://spacy.io/usage/mod
 !!! tip "Tip"
     For Person, Location and Organization detection, it could be useful to try out the transformers based models (e.g. `en_core_web_trf`) which uses a more modern deep-learning architecture, but is generally slower than the default `en_core_web_lg` model.
 
+
 ### Configure Presidio to use the pre-trained model
 
 Once created, see [the NLP configuration documentation](../customizing_nlp_models.md#Configure-Presidio-to-use-the-new-model) for more information.
 
-## How NER results flow within Presidio
-This diagram describes the flow of NER results within Presidio, and the relationship between the `SpacyNlpEngine` component and the `SpacyRecognizer` component:
-```mermaid
-sequenceDiagram
-    AnalyzerEngine->>SpacyNlpEngine: Call engine.process_text(text) <br>to get model results
-    SpacyNlpEngine->>spaCy: Call spaCy pipeline
-    spaCy->>SpacyNlpEngine: return entities and other attributes
-    Note over SpacyNlpEngine: Map entity names to Presidio's, <BR>update scores, <BR>remove unwanted entities <BR> based on NerModelConfiguration
-    SpacyNlpEngine->>AnalyzerEngine: Pass NlpArtifacts<BR>(Entities, lemmas, tokens, scores etc.)
-    Note over AnalyzerEngine: Call all recognizers
-    AnalyzerEngine->>SpacyRecognizer: Pass NlpArtifacts
-    Note over SpacyRecognizer: Extract PII entities out of NlpArtifacts
-    SpacyRecognizer->>AnalyzerEngine: Return List[RecognizerResult]
-
-```
-
 ## Training your own model
 
 !!! note "Note"