Skip to content

Commit

Permalink
Revert "fork of new docs"
Browse files Browse the repository at this point in the history
This reverts commit 2b7c313.
  • Loading branch information
omri374 committed Oct 11, 2023
1 parent 7879432 commit 854b095
Show file tree
Hide file tree
Showing 21 changed files with 192 additions and 333 deletions.
43 changes: 10 additions & 33 deletions docs/analyzer/customizing_nlp_models.md
Original file line number Diff line number Diff line change
@@ -1,11 +1,11 @@
# Customizing the NLP engine in Presidio Analyzer

Presidio uses NLP engines for two main tasks: NER based PII identification,
and feature extraction for downstream rule based logic (such as leveraging context words for improved detection).
While Presidio comes with an open-source model (the `en_core_web_lg` model from spaCy),
additional NLP models and frameworks could be plugged in, either public or proprietary.
These models can be trained or downloaded from existing NLP frameworks like [spaCy](https://spacy.io/usage/models),
[Stanza](https://github.com/stanfordnlp/stanza) and
# Customizing the NLP models in Presidio Analyzer

Presidio uses NLP engines for two main tasks: NER based PII identification,
and feature extraction for custom rule based logic (such as leveraging context words for improved detection).
While Presidio comes with an open-source model (the `en_core_web_lg` model from spaCy),
it can be customized by leveraging other NLP models, either public or proprietary.
These models can be trained or downloaded from existing NLP frameworks like [spaCy](https://spacy.io/usage/models),
[Stanza](https://github.com/stanfordnlp/stanza) and
[transformers](https://github.com/huggingface/transformers).

In addition, other types of NLP frameworks [can be integrated into Presidio](developing_recognizers.md#machine-learning-ml-based-or-rule-based).
Expand Down Expand Up @@ -63,30 +63,9 @@ Configuration can be done in two ways:
-
lang_code: es
model_name: es_core_news_md
ner_model_configuration:
labels_to_ignore:
- O
model_to_presidio_entity_mapping:
PER: PERSON
LOC: LOCATION
ORG: ORGANIZATION
AGE: AGE
ID: ID
DATE: DATE_TIME
low_confidence_score_multiplier: 0.4
low_score_entity_names:
- ID
- ORG
```

The `ner_model_configuration` section contains the following parameters:

- `labels_to_ignore`: A list of labels to ignore. For example, `O` (no entity) or entities you are not interested in returning.
- `model_to_presidio_entity_mapping`: A mapping between the transformers model labels and the Presidio entity types.
- `low_confidence_score_multiplier`: A multiplier to apply to the score of entities with low confidence.
- `low_score_entity_names`: A list of entity types to apply the low confidence score multiplier to.

The [default conf file](https://github.com/microsoft/presidio/blob/main/presidio-analyzer/conf/default.yaml) is read during the default initialization of the `AnalyzerEngine`. Alternatively, the path to a custom configuration file can be passed to the `NlpEngineProvider`:
The default conf file is read during the default initialization of the `AnalyzerEngine`. Alternatively, the path to a custom configuration file can be passed to the `NlpEngineProvider`:

```python
from presidio_analyzer import AnalyzerEngine, RecognizerRegistry
Expand Down Expand Up @@ -118,14 +97,12 @@ Configuration can be done in two ways:
c. pass requests in each of these languages.

!!! note "Note"
Presidio can currently use one NER model per language via the `NlpEngine`. If multiple are required,
consider wrapping NER models as additional recognizers ([see sample here](https://github.com/microsoft/presidio/blob/main/docs/samples/python/example_remote_recognizer.py)).
Presidio can currently use one NLP model per language.

## Leverage frameworks other than spaCy, Stanza and transformers for ML based PII detection

In addition to the built-in spaCy/Stanza/transformers capabitilies, it is possible to create new recognizers which serve as interfaces to other models.
For more information:

- [Remote recognizer documentation](adding_recognizers.md#creating-a-remote-recognizer) and [samples](../samples/python/integrating_with_external_services.ipynb).
- [Flair recognizer example](../samples/python/flair_recognizer.py)

Expand Down
38 changes: 23 additions & 15 deletions docs/analyzer/developing_recognizers.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,8 +7,7 @@ Recognizers define the logic for detection, as well as the confidence a predicti

### Accuracy

Each recognizer, regardless of its complexity, could have false positives and false negatives. When adding new recognizers, we try to balance the effect of each recognizer on the entire system.
A recognizer with many false positives would affect the system's usability, while a recognizer with many false negatives might require more work before it can be integrated. For reproducibility purposes, it is be best to note how the recognizer's accuracy was tested, and on which datasets.
Each recognizer, regardless of its complexity, could have false positives and false negatives. When adding new recognizers, we try to balance the effect of each recognizer on the entire system. A recognizer with many false positives would affect the system's usability, while a recognizer with many false negatives might require more work before it can be integrated. For reproducibility purposes, it is be best to note how the recognizer's accuracy was tested, and on which datasets.
For tools and documentation on evaluating and analyzing recognizers, refer to the [presidio-research Github repository](https://github.com/microsoft/presidio-research).

!!! note "Note"
Expand All @@ -23,8 +22,7 @@ Make sure your recognizer doesn't take too long to process text. Anything above

### Environment

When adding new recognizers that have 3rd party dependencies, make sure that the new dependencies don't interfere with Presidio's dependencies.
In the case of a conflict, one can create an isolated model environment (outside the main presidio-analyzer process) and implement a [`RemoteRecognizer`](https://github.com/microsoft/presidio/blob/main/presidio-analyzer/presidio_analyzer/remote_recognizer.py) on the presidio-analyzer side to interact with the model's endpoint.
When adding new recognizers that have 3rd party dependencies, make sure that the new dependencies don't interfere with Presidio's dependencies. In the case of a conflict, one can create an isolated model environment (outside the main presidio-analyzer process) and implement a [`RemoteRecognizer`](https://github.com/microsoft/presidio/blob/main/presidio-analyzer/presidio_analyzer/remote_recognizer.py) on the presidio-analyzer side to interact with the model's endpoint. In addition, make sure the license on the 3rd party dependency allows you to use it for any purpose.

## Recognizer Types

Expand All @@ -34,7 +32,7 @@ Generally speaking, there are three types of recognizers:

A deny list is a list of words that should be removed during text analysis. For example, it can include a list of titles (`["Mr.", "Mrs.", "Ms.", "Dr."]` to detect a "Title" entity.)

See [this documentation](index.md#how-to-add-a-new-recognizer) on adding a new recognizer. The [`PatternRecognizer`](https://github.com/microsoft/presidio/blob/main/presidio-analyzer/presidio_analyzer/pattern_recognizer.py) class has built-in support for a deny-list input.
See [this documentation](index.md#how-to-add-a-new-recognizer) on adding a new recognizer. The [`PatternRecognizer`](/presidio-analyzer/presidio_analyzer/pattern_recognizer.py) class has built-in support for a deny-list input.

### Pattern Based

Expand All @@ -49,26 +47,36 @@ See some examples here:
### Machine Learning (ML) Based or Rule-Based

Many PII entities are undetectable using naive approaches like deny-lists or regular expressions.
In these cases, we would wish to utilize a Machine Learning model capable of identifying entities in free text, or a rule-based recognizer.
In these cases, we would wish to utilize a Machine Learning model capable of identifying entities in free text, or a rule-based recognizer. There are four options for adding ML and rule based recognizers:

#### ML: Utilize SpaCy, Stanza or Transformers
#### Utilize SpaCy or Stanza

Presidio currently uses [spaCy](https://spacy.io/) as a framework for text analysis and Named Entity Recognition (NER), and [stanza](https://stanfordnlp.github.io/stanza/) and [huggingface transformers](https://huggingface.co/docs/transformers/index) as an alternative. To avoid introducing new tools, it is recommended to first try to use `spaCy`, `stanza` or `transformers` over other tools if possible.
Presidio currently uses [spaCy](https://spacy.io/) as a framework for text analysis and Named Entity Recognition (NER), and [stanza](https://stanfordnlp.github.io/stanza/) as an alternative. To avoid introducing new tools, it is recommended to first try to use `spaCy` or `stanza` over other tools if possible.
`spaCy` provides descent results compared to state-of-the-art NER models, but with much better computational performance.
`spaCy`, `stanza` and `transformers` models could be trained from scratch, used in combination with pre-trained embeddings, or be fine-tuned.
`spaCy` and `stanza` models could be trained from scratch, used in combination with pre-trained embeddings, or retrained to detect new entities.
When integrating such a model into Presidio, a class inheriting from the [`EntityRecognizer`](https://github.com/microsoft/presidio/blob/main/presidio-analyzer/presidio_analyzer/entity_recognizer.py) should be created.

In addition to those, it is also possible to use other ML models. In that case, a new `EntityRecognizer` should be created.
See an example using [Flair here](https://github.com/microsoft/presidio/blob/main/docs/samples/python/flair_recognizer.py).
#### Utilize Scikit-learn or Similar

`Scikit-learn` models tend to be fast, but usually have lower accuracy than deep learning methods. However, for well defined problems with well defined features, they can provide very good results.
When integrating such a model into Presidio, a class inheriting from the [`EntityRecognizer`](https://github.com/microsoft/presidio/blob/main/presidio-analyzer/presidio_analyzer/entity_recognizer.py) should be created.

#### Apply Custom Logic

In some cases, rule-based logic provides reasonable ways for detecting entities.
The Presidio `EntityRecognizer` API allows you to use `spaCy` extracted features like lemmas, part of speech, dependencies and more to create your logic.
When integrating such logic into Presidio, a class inheriting from the [`EntityRecognizer`](https://github.com/microsoft/presidio/blob/main/presidio-analyzer/presidio_analyzer/entity_recognizer.py) should be created.
In some cases, rule-based logic provides the best way of detecting entities.
The Presidio `EntityRecognizer` API allows you to use `spaCy`/`stanza` extracted features like lemmas, part of speech, dependencies and more to create your logic. When integrating such logic into Presidio, a class inheriting from the [`EntityRecognizer`](https://github.com/microsoft/presidio/blob/main/presidio-analyzer/presidio_analyzer/entity_recognizer.py) should be created.

#### Deep Learning Based Methods

Deep learning methods offer excellent detection rates for NER.
They are however more complex to train, deploy and tend to be slower than traditional approaches.
When creating a DL based method for PII detection, there are two main alternatives for integrating it with Presidio:

1. Create an external endpoint (either local or remote) which is isolated from the `presidio-analyzer` process. On the `presidio-analyzer` side, one would extend the [`RemoteRecognizer`](https://github.com/microsoft/presidio/blob/main/presidio-analyzer/presidio_analyzer/remote_recognizer.py) class and implement the network interface between `presidio-analyzer` and the endpoint of the model's container.
2. Integrate the model as an additional [`EntityRecognizer`](https://github.com/microsoft/presidio/blob/main/presidio-analyzer/presidio_analyzer/entity_recognizer.py) within the `presidio-analyzer` flow.

!!! attention "Considerations for selecting one option over another"

- Accuracy.
- Ease of integration.
- Runtime considerations (For example if the new model requires a GPU).
- 3rd party dependencies of the new model vs. the existing `presidio-analyzer` package.
37 changes: 36 additions & 1 deletion docs/analyzer/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,42 @@ Named Entity Recognition and other types of logic to detect PII in unstructured

## Installation

see [Installing Presidio](../installation.md).
=== "Using pip"

!!! note "Note"
Consider installing the Presidio python packages on a virtual environment like venv or conda.

To get started with Presidio-analyzer,
download the package and the `en_core_web_lg` spaCy model:

```sh
pip install presidio-analyzer
python -m spacy download en_core_web_lg
```

=== "Using Docker"

!!! note "Note"
This requires Docker to be installed. [Download Docker](https://docs.docker.com/get-docker/).

```sh
# Download image from Dockerhub
docker pull mcr.microsoft.com/presidio-analyzer

# Run the container with the default port
docker run -d -p 5002:3000 mcr.microsoft.com/presidio-analyzer:latest
```

=== "From source"

First, clone the Presidio repo. [See here for instructions](../installation.md#install-from-source).

Then, build the presidio-analyzer container:

```sh
cd presidio-analyzer
docker build . -t presidio/presidio-analyzer
```

## Getting started

Expand Down
22 changes: 1 addition & 21 deletions docs/analyzer/languages-config.yml
Original file line number Diff line number Diff line change
Expand Up @@ -3,26 +3,6 @@ models:
-
lang_code: en
model_name: en_core_web_lg
-
lang_code: de
model_name: de_core_news_md
-
lang_code: es
model_name: es_core_news_md
ner_model_configuration:
- model_to_presidio_entity_mapping:
PER: PERSON
PERSON: PERSON
LOC: LOCATION
LOCATION: LOCATION
GPE: LOCATION
ORG: ORGANIZATION
DATE: DATE_TIME
TIME: DATE_TIME
NORP: NRP

- low_confidence_score_multiplier: 0.4
- low_score_entity_names:
- ORGANIZATION
- ORG
- default_score: 0.85
model_name: es_core_news_md
3 changes: 1 addition & 2 deletions docs/analyzer/languages.md
Original file line number Diff line number Diff line change
Expand Up @@ -64,7 +64,6 @@ analyzer = AnalyzerEngine(

analyzer.analyze(text="My name is David", language="en")
```
Link to LANGUAGES_CONFIG_FILE=[languages-config.yml](https://github.com/microsoft/presidio/blob/main/docs/analyzer/languages-config.yml)

### Automatically install NLP models into the Docker container

Expand All @@ -74,4 +73,4 @@ update the [conf/default.yaml](https://github.com/microsoft/presidio/blob/main/p
the `docker build` phase and the models defined in it are installed automatically.

For `transformers` based models, the configuration [can be found here](https://github.com/microsoft/presidio/blob/main/presidio-analyzer/conf/transformers.yaml).
A docker file supporting transformers models [can be found here](https://github.com/microsoft/presidio/blob/main/presidio-analyzer/Dockerfile.transformers).
In addition, make sure the Docker file contains the relevant packages for `transformers`, which are not loaded automatically with Presidio.
17 changes: 1 addition & 16 deletions docs/analyzer/nlp_engines/spacy_stanza.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,26 +30,11 @@ For the available models, follow these links: [spaCy](https://spacy.io/usage/mod
!!! tip "Tip"
For Person, Location and Organization detection, it could be useful to try out the transformers based models (e.g. `en_core_web_trf`) which uses a more modern deep-learning architecture, but is generally slower than the default `en_core_web_lg` model.


### Configure Presidio to use the pre-trained model

Once created, see [the NLP configuration documentation](../customizing_nlp_models.md#Configure-Presidio-to-use-the-new-model) for more information.

## How NER results flow within Presidio
This diagram describes the flow of NER results within Presidio, and the relationship between the `SpacyNlpEngine` component and the `SpacyRecognizer` component:
```mermaid
sequenceDiagram
AnalyzerEngine->>SpacyNlpEngine: Call engine.process_text(text) <br>to get model results
SpacyNlpEngine->>spaCy: Call spaCy pipeline
spaCy->>SpacyNlpEngine: return entities and other attributes
Note over SpacyNlpEngine: Map entity names to Presidio's, <BR>update scores, <BR>remove unwanted entities <BR> based on NerModelConfiguration
SpacyNlpEngine->>AnalyzerEngine: Pass NlpArtifacts<BR>(Entities, lemmas, tokens, scores etc.)
Note over AnalyzerEngine: Call all recognizers
AnalyzerEngine->>SpacyRecognizer: Pass NlpArtifacts
Note over SpacyRecognizer: Extract PII entities out of NlpArtifacts
SpacyRecognizer->>AnalyzerEngine: Return List[RecognizerResult]
```
## Training your own model

!!! note "Note"
Expand Down
Loading

0 comments on commit 854b095

Please sign in to comment.