-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
0 parents
commit 304160c
Showing
24 changed files
with
6,952 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
*.ipynb linguist-detectable=false |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,161 @@ | ||
# Byte-compiled / optimized / DLL files | ||
__pycache__/ | ||
*.py[cod] | ||
*$py.class | ||
*.db | ||
|
||
# C extensions | ||
*.so | ||
|
||
# Distribution / packaging | ||
.Python | ||
build/ | ||
develop-eggs/ | ||
dist/ | ||
downloads/ | ||
eggs/ | ||
.eggs/ | ||
lib/ | ||
lib64/ | ||
parts/ | ||
sdist/ | ||
var/ | ||
wheels/ | ||
share/python-wheels/ | ||
*.egg-info/ | ||
.installed.cfg | ||
*.egg | ||
MANIFEST | ||
|
||
# PyInstaller | ||
# Usually these files are written by a python script from a template | ||
# before PyInstaller builds the exe, so as to inject date/other infos into it. | ||
*.manifest | ||
*.spec | ||
|
||
# Installer logs | ||
pip-log.txt | ||
pip-delete-this-directory.txt | ||
|
||
# Unit test / coverage reports | ||
htmlcov/ | ||
.tox/ | ||
.nox/ | ||
.coverage | ||
.coverage.* | ||
.cache | ||
nosetests.xml | ||
coverage.xml | ||
*.cover | ||
*.py,cover | ||
.hypothesis/ | ||
.pytest_cache/ | ||
cover/ | ||
|
||
# Translations | ||
*.mo | ||
*.pot | ||
|
||
# Django stuff: | ||
*.log | ||
local_settings.py | ||
db.sqlite3 | ||
db.sqlite3-journal | ||
|
||
# Flask stuff: | ||
instance/ | ||
.webassets-cache | ||
|
||
# Scrapy stuff: | ||
.scrapy | ||
|
||
# Sphinx documentation | ||
docs/_build/ | ||
|
||
# PyBuilder | ||
.pybuilder/ | ||
target/ | ||
|
||
# Jupyter Notebook | ||
.ipynb_checkpoints | ||
|
||
# IPython | ||
profile_default/ | ||
ipython_config.py | ||
|
||
# pyenv | ||
# For a library or package, you might want to ignore these files since the code is | ||
# intended to run in multiple environments; otherwise, check them in: | ||
# .python-version | ||
|
||
# pipenv | ||
# According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control. | ||
# However, in case of collaboration, if having platform-specific dependencies or dependencies | ||
# having no cross-platform support, pipenv may install dependencies that don't work, or not | ||
# install all needed dependencies. | ||
#Pipfile.lock | ||
|
||
# poetry | ||
# Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control. | ||
# This is especially recommended for binary packages to ensure reproducibility, and is more | ||
# commonly ignored for libraries. | ||
# https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control | ||
#poetry.lock | ||
|
||
# pdm | ||
# Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control. | ||
#pdm.lock | ||
# pdm stores project-wide configurations in .pdm.toml, but it is recommended to not include it | ||
# in version control. | ||
# https://pdm.fming.dev/#use-with-ide | ||
.pdm.toml | ||
|
||
# PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm | ||
__pypackages__/ | ||
|
||
# Celery stuff | ||
celerybeat-schedule | ||
celerybeat.pid | ||
|
||
# SageMath parsed files | ||
*.sage.py | ||
|
||
# Environments | ||
.env | ||
.venv | ||
env/ | ||
venv/ | ||
ENV/ | ||
env.bak/ | ||
venv.bak/ | ||
|
||
# Spyder project settings | ||
.spyderproject | ||
.spyproject | ||
|
||
# Rope project settings | ||
.ropeproject | ||
|
||
# mkdocs documentation | ||
/site | ||
|
||
# mypy | ||
.mypy_cache/ | ||
.dmypy.json | ||
dmypy.json | ||
|
||
# Pyre type checker | ||
.pyre/ | ||
|
||
# pytype static type analyzer | ||
.pytype/ | ||
|
||
# Cython debug symbols | ||
cython_debug/ | ||
|
||
# PyCharm | ||
# JetBrains specific template is maintained in a separate JetBrains.gitignore that can | ||
# be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore | ||
# and can be added to the global gitignore or merged into this file. For a more nuclear | ||
# option (not recommended) you can uncomment the following to ignore the entire idea folder. | ||
.idea/ |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,53 @@ | ||
repos: | ||
- repo: https://github.com/pre-commit/pre-commit-hooks | ||
rev: v4.4.0 | ||
hooks: | ||
- id: check-yaml | ||
exclude: docs | ||
- id: end-of-file-fixer | ||
types: [ python ] | ||
- id: trailing-whitespace | ||
- id: pretty-format-json | ||
args: [--autofix, --no-sort-keys] | ||
- repo: local | ||
hooks: | ||
- id: ruff | ||
name: ruff | ||
language: system | ||
entry: poetry run ruff format | ||
minimum_pre_commit_version: 2.9.2 | ||
require_serial: true | ||
types_or: [ python, pyi ] | ||
- repo: https://github.com/myint/autoflake | ||
rev: v2.1.1 | ||
hooks: | ||
- id: autoflake | ||
args: | ||
- --expand-star-imports | ||
- --ignore-init-module-imports | ||
- --in-place | ||
- --remove-all-unused-imports | ||
- --remove-duplicate-keys | ||
- --remove-unused-variables | ||
- repo: local | ||
hooks: | ||
- id: flake8 | ||
name: flake8 | ||
language: system | ||
entry: poetry run ruff check | ||
types: [ python ] | ||
- repo: https://github.com/pycqa/isort | ||
rev: 5.12.0 | ||
hooks: | ||
- id: isort | ||
name: isort | ||
args: [ "--profile", "black", "--skip", "__init__.py", "--filter-files" ] | ||
- repo: local | ||
hooks: | ||
- id: mypy | ||
name: mypy | ||
language: system | ||
verbose: true | ||
entry: bash -c 'make mypy || true' -- | ||
files: domain_matcher | ||
pass_filenames: false |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,32 @@ | ||
LINT_FILES := domain_matcher tests | ||
|
||
.PHONY: format | ||
format: | ||
if [ -n "${POETRY_ACTIVE}" ]; then make _format $(LINT_FILES); else poetry run make _format $(LINT_FILES); fi | ||
|
||
.PHONY: _format | ||
_format: | ||
ruff format $(LINT_FILES) | ||
nb-clean clean notebooks --remove-empty-cells --preserve-cell-metadata | ||
|
||
$(MAKE) lint | ||
|
||
test: lint mypy unit-test | ||
|
||
.PHONY: lint | ||
lint: | ||
@# calling make _lint within poetry make it so that we only init poetry once | ||
if [ -n "${POETRY_ACTIVE}" ]; then make _lint $(LINT_FILES); else poetry run make _lint $(LINT_FILES); fi | ||
|
||
.PHONY: _lint | ||
_lint: | ||
ruff check $(LINT_FILES) | ||
# nb-clean check notebooks --remove-empty-cells --preserve-cell-metadata | ||
|
||
.PHONY: mypy | ||
mypy: | ||
poetry run mypy domain_matcher | ||
|
||
.PHONY: unit-test | ||
unit-test: | ||
poetry run pytest tests --durations 5 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,49 @@ | ||
# Domain Matcher | ||
|
||
[![](https://img.shields.io/badge/Read_our_Blog-blue?logo=readdotcv)](https://dref360.github.io/domainmatching/) | ||
|
||
Domain Matcher is a library that aims at matching a pre-defined domain to your input data. | ||
Input without domain are deemed not important and thus can be safely filtered out. | ||
|
||
> Domain Matching performs very cheap OoD detection using topic modeling and keyword extraction. | ||
`pip install domain-matcher` | ||
|
||
## Usage | ||
|
||
```python | ||
from datasets import load_dataset | ||
from domain_matcher.core import DomainMatcher, DMConfig | ||
|
||
# Custom version of `clinc-oos` where non-banking classes are assigned to oos. | ||
ds = load_dataset("GlowstickAI/banking-clinc-oos", "plus") | ||
config = DMConfig(text_column='text', label_column='intent', oos_class='oos') | ||
dmatcher = DomainMatcher(config) | ||
# Fit DM on your train data see our blog to see what's happening! | ||
dmatcher.fit(ds['train']) | ||
|
||
# Predict: You can predict on a string, List[str] or Dataset | ||
dmatcher.transform("Can you cancel my credit card?")['in_domain'] | ||
# >>> True | ||
dmatcher.transform("Can you cancel my reservation at Giorgi's?")['in_domain'] | ||
# >>> False | ||
``` | ||
|
||
### Troubleshooting | ||
|
||
For troubleshooting, please see our [wiki](https://github.com/GlowstickAI/domain-matcher/wiki) or [submit an issue](https://github.com/GlowstickAI/domain-matcher/issues) if you can't find what you're looking for. | ||
|
||
## Development | ||
|
||
* Install Pyenv | ||
* `curl https://pyenv.run | bash` | ||
* `pyenv install 3.9.13 && pyenv global 3.9.13` | ||
* [Install Poetry](https://python-poetry.org/docs/master/#installing-with-the-official-installer) | ||
* `poetry install` | ||
* Add precommits | ||
* `poetry run pre-commit install` | ||
|
||
### Tooling | ||
|
||
* `make format`: format the code with Ruff | ||
* `make test`: run unit tests and mypy. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,84 @@ | ||
# Domain Matching to filter out-of-distribution observations | ||
|
||
|
||
More than ever, protecting models from out-of-distribution samples is of tremendous importance. | ||
|
||
While your model can accurately predict in-domain data, users will often throw out-of-distribution data at it. It only takes a couple of wrong interactions to discourage users from using your service ever again. | ||
|
||
Here's how domain matching can quickly and efficiently save you a lot of headaches. | ||
|
||
|
||
|
||
|
||
|
||
| ![](overview.png) | | ||
|:--------------------------------------------------------------------:| | ||
| Domain Matcher overview | | ||
|
||
|
||
**References** | ||
|
||
To my knowledge, Kawahara et al.(1) were the first to propose this idea. The main difference between our approaches is that Kawahara et al. train an SVM to detect "in-domain" topics whereas this implementation automatically finds topics based on similarity and is thus compatible with online topic modeling. | ||
|
||
### Problem setting | ||
|
||
You have an intent classificaiton model predicting $$f(x) \rightarrow y$$ where $$y\in\{y_0,...,y_N,oos\}$$. *oos* is the "out-of-scope" class meaning, that it is out of distribution. | ||
|
||
We can imagine that the *oos* class has a lot of support in our dataset. Due to the nature of this class, we need a wide range of examples to accurately predict it. | ||
|
||
For this example, let's imagine that our model is a Banking chatbot guiding users to the relevant resources. Users can ask about canceling their credit card, making a payment or asking for a loan. | ||
|
||
Now, what happens if someone asks "I want to cancel my phone plan"? If this particular example wasn't in our dataset, our model might predict "cancel_credit_card" instead of the "oos" class. | ||
|
||
### How we solve it? | ||
|
||
Domain Matching works by using topic modeling and semantic embeddings. Using the excellent BERTopic (2) and Sentence Transformers (3), we can train a topic model on our training dataset. Then, we define our domains. In our case, we have a Banking domain made of keywords related to banking: credit card, loan, payment, money, etc. | ||
|
||
Using our topic model, we assign topics to each keyword creating a mapping between topics and our domain. | ||
|
||
|
||
|
||
|
||
|
||
| ![](topic_assignment.png) | | ||
|:----------------------------------------------------------------------------:| | ||
| How topics get assigned to a domain | | ||
|
||
|
||
|
||
At inference time, we assign a topic to the utterance. If the topic is part of our domains, we let the model predict, otherwise we return *oos* as shown below. | ||
|
||
As you see, it is quite a simple technique that is not expensive and is easy to set up. | ||
|
||
### Examples | ||
|
||
I trained a topic model on Banking CLINC, a reduced version of CLINC-151 focusing on banking intents. In the table below, I show the intent, its predicted domain and the model's prediction. | ||
|
||
| Snippet | Domain | Prediction | | ||
|-----------------------------------------|------------|------------------| | ||
| Freeze my credit Card | Banking ✅ | Damage Card ✅ | | ||
| I need a Christmas card for my children | OOS ✅ | Replace Card ❌ | | ||
| Upgrade my RuneScape account | OOS ✅ | Freeze Account ❌ | | ||
|
||
By using Domain Matching, we were able to protect our model from oos examples without adding more examples to our dataset. | ||
|
||
This solution is not foolproof and one should verify that the topics are matched correctly. | ||
|
||
**Domain Matching at Glowstick** | ||
|
||
We've used Domain Matching extensively at [Glowstick](https://glowstick.ai). Our goal was to detect sales opportunities from transcripts. If we have two chatbot companies, what matters to one can be useless to the other. | ||
|
||
In order to deliver maximum value, we've created "Domains" for what our customers care about and would filter out *oos* items. | ||
|
||
**Conclusion** | ||
|
||
We've covered a simple, yet effective out-of-distribution detection approach, Domain Matching. By using great libraries like BERTopic and SBERT, we were able to reduce the number of out-of-scope utterances that could reach our model. | ||
|
||
More work is yet to be done, the lack of complex datasets in the intent classification community is a much-needed improvments to showcase the benefits of these approaches. | ||
|
||
|
||
### Resources | ||
|
||
1. Kawahara et al. [Topic classification and verification modeling for out-of-domain utterance detection](https://www.isca-archive.org/interspeech_2004/kawahara04c_interspeech.html) | ||
2. The most excellent [BERTopic](https://maartengr.github.io/BERTopic/) library | ||
3. Now a staple in all my projects [Sentence Transformers](https://sbert.net/) |
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Empty file.
Empty file.
Oops, something went wrong.