Skip to content

Commit

Permalink
First commit
Browse files Browse the repository at this point in the history
  • Loading branch information
Dref360 committed Jul 22, 2024
0 parents commit 304160c
Show file tree
Hide file tree
Showing 24 changed files with 6,952 additions and 0 deletions.
1 change: 1 addition & 0 deletions .gitattributes
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
*.ipynb linguist-detectable=false
161 changes: 161 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,161 @@
# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
*$py.class
*.db

# C extensions
*.so

# Distribution / packaging
.Python
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/
share/python-wheels/
*.egg-info/
.installed.cfg
*.egg
MANIFEST

# PyInstaller
# Usually these files are written by a python script from a template
# before PyInstaller builds the exe, so as to inject date/other infos into it.
*.manifest
*.spec

# Installer logs
pip-log.txt
pip-delete-this-directory.txt

# Unit test / coverage reports
htmlcov/
.tox/
.nox/
.coverage
.coverage.*
.cache
nosetests.xml
coverage.xml
*.cover
*.py,cover
.hypothesis/
.pytest_cache/
cover/

# Translations
*.mo
*.pot

# Django stuff:
*.log
local_settings.py
db.sqlite3
db.sqlite3-journal

# Flask stuff:
instance/
.webassets-cache

# Scrapy stuff:
.scrapy

# Sphinx documentation
docs/_build/

# PyBuilder
.pybuilder/
target/

# Jupyter Notebook
.ipynb_checkpoints

# IPython
profile_default/
ipython_config.py

# pyenv
# For a library or package, you might want to ignore these files since the code is
# intended to run in multiple environments; otherwise, check them in:
# .python-version

# pipenv
# According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
# However, in case of collaboration, if having platform-specific dependencies or dependencies
# having no cross-platform support, pipenv may install dependencies that don't work, or not
# install all needed dependencies.
#Pipfile.lock

# poetry
# Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control.
# This is especially recommended for binary packages to ensure reproducibility, and is more
# commonly ignored for libraries.
# https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control
#poetry.lock

# pdm
# Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control.
#pdm.lock
# pdm stores project-wide configurations in .pdm.toml, but it is recommended to not include it
# in version control.
# https://pdm.fming.dev/#use-with-ide
.pdm.toml

# PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm
__pypackages__/

# Celery stuff
celerybeat-schedule
celerybeat.pid

# SageMath parsed files
*.sage.py

# Environments
.env
.venv
env/
venv/
ENV/
env.bak/
venv.bak/

# Spyder project settings
.spyderproject
.spyproject

# Rope project settings
.ropeproject

# mkdocs documentation
/site

# mypy
.mypy_cache/
.dmypy.json
dmypy.json

# Pyre type checker
.pyre/

# pytype static type analyzer
.pytype/

# Cython debug symbols
cython_debug/

# PyCharm
# JetBrains specific template is maintained in a separate JetBrains.gitignore that can
# be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore
# and can be added to the global gitignore or merged into this file. For a more nuclear
# option (not recommended) you can uncomment the following to ignore the entire idea folder.
.idea/
53 changes: 53 additions & 0 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
repos:
- repo: https://github.com/pre-commit/pre-commit-hooks
rev: v4.4.0
hooks:
- id: check-yaml
exclude: docs
- id: end-of-file-fixer
types: [ python ]
- id: trailing-whitespace
- id: pretty-format-json
args: [--autofix, --no-sort-keys]
- repo: local
hooks:
- id: ruff
name: ruff
language: system
entry: poetry run ruff format
minimum_pre_commit_version: 2.9.2
require_serial: true
types_or: [ python, pyi ]
- repo: https://github.com/myint/autoflake
rev: v2.1.1
hooks:
- id: autoflake
args:
- --expand-star-imports
- --ignore-init-module-imports
- --in-place
- --remove-all-unused-imports
- --remove-duplicate-keys
- --remove-unused-variables
- repo: local
hooks:
- id: flake8
name: flake8
language: system
entry: poetry run ruff check
types: [ python ]
- repo: https://github.com/pycqa/isort
rev: 5.12.0
hooks:
- id: isort
name: isort
args: [ "--profile", "black", "--skip", "__init__.py", "--filter-files" ]
- repo: local
hooks:
- id: mypy
name: mypy
language: system
verbose: true
entry: bash -c 'make mypy || true' --
files: domain_matcher
pass_filenames: false
32 changes: 32 additions & 0 deletions Makefile
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
LINT_FILES := domain_matcher tests

.PHONY: format
format:
if [ -n "${POETRY_ACTIVE}" ]; then make _format $(LINT_FILES); else poetry run make _format $(LINT_FILES); fi

.PHONY: _format
_format:
ruff format $(LINT_FILES)
nb-clean clean notebooks --remove-empty-cells --preserve-cell-metadata

$(MAKE) lint

test: lint mypy unit-test

.PHONY: lint
lint:
@# calling make _lint within poetry make it so that we only init poetry once
if [ -n "${POETRY_ACTIVE}" ]; then make _lint $(LINT_FILES); else poetry run make _lint $(LINT_FILES); fi

.PHONY: _lint
_lint:
ruff check $(LINT_FILES)
# nb-clean check notebooks --remove-empty-cells --preserve-cell-metadata

.PHONY: mypy
mypy:
poetry run mypy domain_matcher

.PHONY: unit-test
unit-test:
poetry run pytest tests --durations 5
49 changes: 49 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
# Domain Matcher

[![](https://img.shields.io/badge/Read_our_Blog-blue?logo=readdotcv)](https://dref360.github.io/domainmatching/)

Domain Matcher is a library that aims at matching a pre-defined domain to your input data.
Input without domain are deemed not important and thus can be safely filtered out.

> Domain Matching performs very cheap OoD detection using topic modeling and keyword extraction.
`pip install domain-matcher`

## Usage

```python
from datasets import load_dataset
from domain_matcher.core import DomainMatcher, DMConfig

# Custom version of `clinc-oos` where non-banking classes are assigned to oos.
ds = load_dataset("GlowstickAI/banking-clinc-oos", "plus")
config = DMConfig(text_column='text', label_column='intent', oos_class='oos')
dmatcher = DomainMatcher(config)
# Fit DM on your train data see our blog to see what's happening!
dmatcher.fit(ds['train'])

# Predict: You can predict on a string, List[str] or Dataset
dmatcher.transform("Can you cancel my credit card?")['in_domain']
# >>> True
dmatcher.transform("Can you cancel my reservation at Giorgi's?")['in_domain']
# >>> False
```

### Troubleshooting

For troubleshooting, please see our [wiki](https://github.com/GlowstickAI/domain-matcher/wiki) or [submit an issue](https://github.com/GlowstickAI/domain-matcher/issues) if you can't find what you're looking for.

## Development

* Install Pyenv
* `curl https://pyenv.run | bash`
* `pyenv install 3.9.13 && pyenv global 3.9.13`
* [Install Poetry](https://python-poetry.org/docs/master/#installing-with-the-official-installer)
* `poetry install`
* Add precommits
* `poetry run pre-commit install`

### Tooling

* `make format`: format the code with Ruff
* `make test`: run unit tests and mypy.
84 changes: 84 additions & 0 deletions docs/blog.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,84 @@
# Domain Matching to filter out-of-distribution observations


More than ever, protecting models from out-of-distribution samples is of tremendous importance.

While your model can accurately predict in-domain data, users will often throw out-of-distribution data at it. It only takes a couple of wrong interactions to discourage users from using your service ever again.

Here's how domain matching can quickly and efficiently save you a lot of headaches.





| ![](overview.png) |
|:--------------------------------------------------------------------:|
| Domain Matcher overview |


**References**

To my knowledge, Kawahara et al.(1) were the first to propose this idea. The main difference between our approaches is that Kawahara et al. train an SVM to detect "in-domain" topics whereas this implementation automatically finds topics based on similarity and is thus compatible with online topic modeling.

### Problem setting

You have an intent classificaiton model predicting $$f(x) \rightarrow y$$ where $$y\in\{y_0,...,y_N,oos\}$$. *oos* is the "out-of-scope" class meaning, that it is out of distribution.

We can imagine that the *oos* class has a lot of support in our dataset. Due to the nature of this class, we need a wide range of examples to accurately predict it.

For this example, let's imagine that our model is a Banking chatbot guiding users to the relevant resources. Users can ask about canceling their credit card, making a payment or asking for a loan.

Now, what happens if someone asks "I want to cancel my phone plan"? If this particular example wasn't in our dataset, our model might predict "cancel_credit_card" instead of the "oos" class.

### How we solve it?

Domain Matching works by using topic modeling and semantic embeddings. Using the excellent BERTopic (2) and Sentence Transformers (3), we can train a topic model on our training dataset. Then, we define our domains. In our case, we have a Banking domain made of keywords related to banking: credit card, loan, payment, money, etc.

Using our topic model, we assign topics to each keyword creating a mapping between topics and our domain.





| ![](topic_assignment.png) |
|:----------------------------------------------------------------------------:|
| How topics get assigned to a domain |



At inference time, we assign a topic to the utterance. If the topic is part of our domains, we let the model predict, otherwise we return *oos* as shown below.

As you see, it is quite a simple technique that is not expensive and is easy to set up.

### Examples

I trained a topic model on Banking CLINC, a reduced version of CLINC-151 focusing on banking intents. In the table below, I show the intent, its predicted domain and the model's prediction.

| Snippet | Domain | Prediction |
|-----------------------------------------|------------|------------------|
| Freeze my credit Card | Banking ✅ | Damage Card ✅ |
| I need a Christmas card for my children | OOS ✅ | Replace Card ❌ |
| Upgrade my RuneScape account | OOS ✅ | Freeze Account ❌ |

By using Domain Matching, we were able to protect our model from oos examples without adding more examples to our dataset.

This solution is not foolproof and one should verify that the topics are matched correctly.

**Domain Matching at Glowstick**

We've used Domain Matching extensively at [Glowstick](https://glowstick.ai). Our goal was to detect sales opportunities from transcripts. If we have two chatbot companies, what matters to one can be useless to the other.

In order to deliver maximum value, we've created "Domains" for what our customers care about and would filter out *oos* items.

**Conclusion**

We've covered a simple, yet effective out-of-distribution detection approach, Domain Matching. By using great libraries like BERTopic and SBERT, we were able to reduce the number of out-of-scope utterances that could reach our model.

More work is yet to be done, the lack of complex datasets in the intent classification community is a much-needed improvments to showcase the benefits of these approaches.


### Resources

1. Kawahara et al. [Topic classification and verification modeling for out-of-domain utterance detection](https://www.isca-archive.org/interspeech_2004/kawahara04c_interspeech.html)
2. The most excellent [BERTopic](https://maartengr.github.io/BERTopic/) library
3. Now a staple in all my projects [Sentence Transformers](https://sbert.net/)
Binary file added docs/overview.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/topic_assignment.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Empty file added domain_matcher/__init__.py
Empty file.
Empty file.
Loading

0 comments on commit 304160c

Please sign in to comment.