Skip to content

Commit

Permalink
Merge pull request #217 from KennethEnevoldsen/paper
Browse files Browse the repository at this point in the history
Added paper
  • Loading branch information
KennethEnevoldsen authored Dec 9, 2023
2 parents 053affa + ae5870f commit fad364a
Show file tree
Hide file tree
Showing 5 changed files with 171 additions and 0 deletions.
23 changes: 23 additions & 0 deletions .github/workflows/draft_pdf.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
on: [push]

jobs:
paper:
runs-on: ubuntu-latest
name: Paper Draft
steps:
- name: Checkout
uses: actions/checkout@v2
- name: Build draft PDF
uses: openjournals/openjournals-draft-action@master
with:
journal: joss
# This should be the path to the paper within your repo.
paper-path: paper/paper.md
- name: Upload
uses: actions/upload-artifact@v1
with:
name: paper
# This is the output path where Pandoc will write the compiled
# PDF. Note, this should be the same directory as the input
# paper.md
path: paper/paper.pdf
59 changes: 59 additions & 0 deletions paper/paper.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
---
title: 'Augmenty: A Python Library for Structured Text Augmentation'
tags:
- Python
- natural language processing
- spacy
- augmentation
authors:
- name: Kenneth Enevoldsen
orcid: 0000-0001-8733-0966
affiliation: "1"

affiliations:
- name: Center for Humanities Computing, Aarhus University, Aarhus, Denmark
index: 1
date: 7 December 2023
bibliography: paper.bib
---

# Summary
Text augmentation is useful for tool for training [@wei-zou-2019-eda] and evaluating [@ribeiro-etal-2020-beyond] natural language processing models and systems. Despite its utility existing libraries for text augmentation often exhibit limitations in terms of functionality and flexibility, being confined to basic tasks such as text-classification or cater to specific downstream use-cases such as estimating robustness [@goel-etal-2021-robustness]. Recognizing these constraints, `Augmenty` is a tool for structured text augmentation of the text along with its annotations. `Augmenty` integrates seamlessly with the popular NLP library `spaCy` [@honnibal_efficient_2020] and seeks to be compatible with all models and tasks supported by `spaCy`. Augmenty provides a wide range of augmenters which can be combined in a flexible manner to create complex augmentation pipelines. It also includes a set of primitives that can be used to create custom augmenters such as word replacement augmenters. This functionality allows for augmentations within a range of applications such as named entity recognition (NER), part-of-speech tagging, and dependency parsing.

# Statement of need
<!-- augmentation is useful -->
Augmentation is a powerful tool within disciplines such as computer vision [@wang2017effectiveness] and speech recognition [@Park2019SpecAugmentAS] and it used for both training more robust models and evaluating models ability to handle pertubations. Within natural language processing (NLP) augmentation has seen some uses as a tool for generating additional training data [@wei-zou-2019-eda], but have really shined as a tool for model evaluation, such as estimating robustness [@goel-etal-2021-robustness] and bias [@lassen-etal-2023-detecting], or for creating novel datasets [@nielsen-2023-scandeval].

Despite its utility, existing libraries for text augmentation often exhibit limitations in terms of functionality and flexibility. Commonly they only provide pure string augmentation which typically leads to the annotations becoming misaligned with the text. This has limited the use of augmentation to tasks such as text classification while neglected structured prediction tasks such as named entity recognition (NER) or coreference resolution. This has limited the use of augmentation to a wide range of tasks both for training and evaluation.

<!-- limitation of existing methods -->
Existing tools such as `textgenie` [@pandya_hetpandyatextgenie_2023], and `textaugment` [@marivate2020improving] implements powerful techniques such as backtranslation and paraprashing, which are useful for augmentation for text-classification tasks. However, these tools neglect a category of tasks which require that the annotations are aligned with the augmentation of the text. For instance even simple augmentation such as replacing the named entity "Jane Doe" with "John" will lead to a misalignment of the NER annotation, part-of-speech tags, etc., which if not properly handled will lead to a misinterpretation of the model performance or generation of incorrect training samples.

`Augmenty` seeks to remedy this by providing a flexible and easy-to-use interface for structured text augmentation. `Augmenty` is built to integrate well with of the `spaCy` [@honnibal_efficient_2020] and seeks to be compatible with the broads set of tasks supported by `spaCy`. Augmenty provides augmenters which takes in a spaCy `Doc`-object (but works just as well with `string`-objects) and returns a new `Doc`-object with the augmentations applied. This allows for augmentations of both the text and the annotations present in the `Doc`-object.

Other tools for data augmentation focus on specific downstream application such `textattack` [@morris2020textattack] which is useful for adversarial attacks of classification systems or `robustnessgym` [@goel-etal-2021-robustness] which is useful for evaluating robustness of classification systems. `Augmenty` does not seek to replace any of these tools
but seeks to provide a general purpose tool for augmentation of both the text and its annotations. This allows for augmentations within a range of applications such as named entity recognition, part-of-speech tagging, and dependency parsing.

# Features & Functionality

`Augmenty` is a Python library that implements augmentation based on `spaCy`'s `Doc` object. `spaCy`'s `Doc` object is a container for a text and its annotations. This makes it easy to augment text and annotations simultaneously. The `Doc` object can easily be extended to include custom augmention not available in `spaCy` by adding custom attributes to the `Doc` object. While `Augmenty` is built to augment `Doc`s the object is easily converted into strings, lists or other formats. The annotations within a `Doc` can be provided either by existing annotations or by annotations provided by an existing model.

Augmenty implements a series of augmenters for token-, span- and sentence-level augmentation. These augmenters range from primitive augmentations such as word replacement which can be used to quickly construct new augmenters to language specific augmenters such as keystroke error augmentations based on a French keyboard layout. Augmenty also integrates with other libraries such as `NLTK` [bird2009natural] to allow for augmentations based on WordNet [@miller-1994-wordnet] and allows for specification of static word vectors [pennington-etal-2014-glove] to allow for augmentations based on word similarity. Lastly, `augmenty` provides a set of utility functions for repeating augmentations, combining augmenters or adjust the percentage of documents that should be augmented. This allow for the flexible construction of augmentation pipelines specific to the task at hand.

Augmenty is furthemore designed to be compatible with `spaCy` and thus its augmenters can easily be utilized during the training of `spaCy` models.

# Example Use Cases

Augmenty have already seen used in a number of projects. The code base was initially developed for evaluating the robustness and bias of `DaCy` [@Enevoldsen_DaCy_A_Unified_2021], a state-of-the-art Danish NLP pipeline. It is also continually used to evaluate Danish NER systems for biases and robustness on the DaCy website.
Augmenty has also been used to detect intersectional biases [@lassen-etal-2023-detecting] and used within benchmark of Danish language models [@sloth_dadebiasgenda-lens_2023].

Besides its existing use-cases `Augmenty` could for example also be used to a) upsample minority classes without duplicating samples, b) train less biased models by e.g. replacing names with names of minority groups c) train more robust models e.g. by augmenting with typos or d) generate pseudo historical data by augmenting with known spelling variations of words.


# Target Audience

The package is mainly targeted at NLP researchers and practitioners who wish to augment their data for training or evaluation. The package is also targeted at researchers who wish to evaluate their models either augmentations or generating new datasets.


# Acknowledgements
The authors thank the [contributors](https://github.com/KennethEnevoldsen/augmenty/graphs/contributors) of the package notably Lasse Hansen which provided meaningful feedback on the design of the package at early stages of development.
Binary file added paper/paper.pdf
Binary file not shown.
70 changes: 70 additions & 0 deletions paper/paper.qmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,70 @@
---
title: 'Augmenty: A Python Library for Structured Text Augmentation'
format:
arxiv-pdf:
keep-tex: true
linenumbers: false
doublespacing: false
runninghead: "Augmenty"
arxiv-html: default

authors:
- name: Kenneth Enevoldsen
orcid: 0000-0001-8733-0966
department: Center for Humanities Computing
corresponding: true
city: Aarhus, Denmark

abstract: |
Augmnety is a Python library for structured text augmentation. It is built on top of spaCy and allows for augmentation of both the text and its annotations. Augmenty provides a wide range of augmenters which can be combined in a flexible manner to create complex augmentation pipelines. It also includes a set of primitives that can be used to create custom augmenters such as word replacement augmenters. This functionality allows for augmentations within a range of applications such as named entity recognition (NER), part-of-speech tagging, and dependency parsing.
keywords:
- Python
- natural language processing
- spacy
- augmentation

bibliography: paper.bib
---


# Summary
Text augmentation is useful for tool for training [@wei-zou-2019-eda] and evaluating [@ribeiro-etal-2020-beyond] natural language processing models and systems. Despite its utility existing libraries for text augmentation often exhibit limitations in terms of functionality and flexibility, being confined to basic tasks such as text-classification or cater to specific downstream use-cases such as estimating robustness [@goel-etal-2021-robustness]. Recognizing these constraints, `Augmenty` is a tool for structured text augmentation of the text along with its annotations. `Augmenty` integrates seamlessly with the popular NLP library `spaCy` [@spacy] and seeks to be compatible with all models and tasks supported by `spaCy`. Augmenty provides a wide range of augmenters which can be combined in a flexible manner to create complex augmentation pipelines. It also includes a set of primitives that can be used to create custom augmenters such as word replacement augmenters. This functionality allows for augmentations within a range of applications such as named entity recognition (NER), part-of-speech tagging, and dependency parsing.

# Statement of need
<!-- augmentation is useful -->
Augmentation is a powerful tool within disciplines such as computer vision [@wang2017effectiveness] and speech recognition [@Park2019SpecAugmentAS] and it used for both training more robust models and evaluating models ability to handle pertubations. Within natural language processing (NLP) augmentation has seen some uses as a tool for generating additional training data [@wei-zou-2019-eda], but have really shined as a tool for model evaluation, such as estimating robustness [@goel-etal-2021-robustness] and bias [@lassen-etal-2023-detecting], or for creating novel datasets [@nielsen-2023-scandeval].

Despite its utility, existing libraries for text augmentation often exhibit limitations in terms of functionality and flexibility. Commonly they only provide pure string augmentation which typically leads to the annotations becoming misaligned with the text. This has limited the use of augmentation to tasks such as text classification while neglected structured prediction tasks such as named entity recognition (NER) or coreference resolution. This has limited the use of augmentation to a wide range of tasks both for training and evaluation.

<!-- limitation of existing methods -->
Existing tools such as `textgenie` [@pandya_hetpandyatextgenie_2023], and `textaugment` [@marivate2020improving] implements powerful techniques such as backtranslation and paraprashing, which are useful for augmentation for text-classification tasks. However, these tools neglect a category of tasks which require that the annotations are aligned with the augmentation of the text. For instance even simple augmentation such as replacing the named entity "Jane Doe" with "John" will lead to a misalignment of the NER annotation, part-of-speech tags, etc., which if not properly handled will lead to a misinterpretation of the model performance or generation of incorrect training samples.

`Augmenty` seeks to remedy this by providing a flexible and easy-to-use interface for structured text augmentation. `Augmenty` is built to integrate well with of the `spaCy` [@spacy] and seeks to be compatible with the broads set of tasks supported by `spaCy`. Augmenty provides augmenters which takes in a spaCy `Doc`-object (but works just as well with `string`-objects) and returns a new `Doc`-object with the augmentations applied. This allows for augmentations of both the text and the annotations present in the `Doc`-object.

Other tools for data augmentation focus on specific downstream application such `textattack` [@morris2020textattack] which is useful for adversarial attacks of classification systems or `robustnessgym` [@goel-etal-2021-robustness] which is useful for evaluating robustness of classification systems. `Augmenty` does not seek to replace any of these tools
but seeks to provide a general purpose tool for augmentation of both the text and its annotations. This allows for augmentations within a range of applications such as named entity recognition, part-of-speech tagging, and dependency parsing.

# Features & Functionality

`Augmenty` is a Python library that implements augmentation based on `spaCy`'s `Doc` object. `spaCy`'s `Doc` object is a container for a text and its annotations. This makes it easy to augment text and annotations simultaneously. The `Doc` object can easily be extended to include custom augmention not available in `spaCy` by adding custom attributes to the `Doc` object. While `Augmenty` is built to augment `Doc`s the object is easily converted into strings, lists or other formats. The annotations within a `Doc` can be provided either by existing annotations or by annotations provided by an existing model.

Augmenty implements a series of augmenters for token-, span- and sentence-level augmentation. These augmenters range from primitive augmentations such as word replacement which can be used to quickly construct new augmenters to language specific augmenters such as keystroke error augmentations based on a French keyboard layout. Augmenty also integrates with other libraries such as `NLTK` [bird2009natural] to allow for augmentations based on WordNet [@miller-1994-wordnet] and allows for specification of static word vectors [pennington-etal-2014-glove] to allow for augmentations based on word similarity. Lastly, `augmenty` provides a set of utility functions for repeating augmentations, combining augmenters or adjust the percentage of documents that should be augmented. This allow for the flexible construction of augmentation pipelines specific to the task at hand.

Augmenty is furthemore designed to be compatible with `spaCy` and thus its augmenters can easily be utilized during the training of `spaCy` models.

# Example Use Cases

Augmenty have already seen used in a number of projects. The code base was initially developed for evaluating the robustness and bias of `DaCy` [@Enevoldsen_DaCy_A_Unified_2021], a state-of-the-art Danish NLP pipeline. It is also continually used to evaluate Danish NER systems for biases and robustness on the DaCy website.
Augmenty has also been used to detect intersectional biases [@lassen-etal-2023-detecting] and used within benchmark of Danish language models [@sloth_dadebiasgenda-lens_2023].

Besides its existing use-cases `Augmenty` could for example also be used to a) upsample minority classes without duplicating samples, b) train less biased models by e.g. replacing names with names of minority groups c) train more robust models e.g. by augmenting with typos or d) generate pseudo historical data by augmenting with known spelling variations of words.


# Target Audience

The package is mainly targeted at NLP researchers and practitioners who wish to augment their data for training or evaluation. The package is also targeted at researchers who wish to evaluate their models either augmentations or generating new datasets.


# Acknowledgements
The authors thank the [contributors](https://github.com/KennethEnevoldsen/augmenty/graphs/contributors) of the package notably Lasse Hansen which provided meaningful feedback on the design of the package at early stages of development.
19 changes: 19 additions & 0 deletions paper/readme.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
# Paper

`paper.md` contains the paper submitted to JOSS

`paper_quarter.qmd` contains the same content as `paper.md` but with a different yaml preamble that allows it to be converted to a nice arxiv-style pdf. To render to arxiv preprint do the following:

```
# install the quarto-arxiv template (https://github.com/mikemahoney218/quarto-arxiv)
quarto install extension mikemahoney218/quarto-arxiv
# render to pdf
quarto render paper.qmd --to arxiv-pdf
```

to render to pdf with the JOSS template do the following:

```
pandoc --citeproc paper.md -o paper.pdf
```

1 comment on commit fad364a

@github-actions
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Coverage

Coverage Report
FileStmtsMissCoverMissing
/opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages/augmenty
   __init__.py220100% 
   augment_utilities.py36197%57
   keyboard.py41198%40
   util.py41198%71
/opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages/augmenty/character
   __init__.py50100% 
   casing.py200100% 
   replace.py340100% 
   spacing.py200100% 
   swap.py210100% 
/opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages/augmenty/doc
   __init__.py30100% 
   casing.py270100% 
   subset.py44589%30, 49, 53, 55, 61
/opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages/augmenty/lang
   __init__.py170100% 
/opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages/augmenty/lang/da
   __init__.py30100% 
   augmenters.py170100% 
   keyboard.py50100% 
/opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages/augmenty/lang/de
   __init__.py10100% 
   keyboard.py50100% 
/opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages/augmenty/lang/el
   __init__.py10100% 
   keyboard.py50100% 
/opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages/augmenty/lang/en
   __init__.py10100% 
   keyboard.py50100% 
/opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages/augmenty/lang/es
   __init__.py10100% 
   keyboard.py50100% 
/opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages/augmenty/lang/fr
   __init__.py10100% 
   keyboard.py50100% 
/opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages/augmenty/lang/hu
   __init__.py10100% 
   keyboard.py50100% 
/opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages/augmenty/lang/it
   __init__.py10100% 
   keyboard.py50100% 
/opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages/augmenty/lang/lt
   __init__.py10100% 
   keyboard.py50100% 
/opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages/augmenty/lang/mk
   __init__.py10100% 
   keyboard.py50100% 
/opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages/augmenty/lang/nb
   __init__.py10100% 
   keyboard.py50100% 
/opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages/augmenty/lang/nl
   __init__.py10100% 
   keyboard.py50100% 
/opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages/augmenty/lang/pl
   __init__.py10100% 
   keyboard.py50100% 
/opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages/augmenty/lang/pt
   __init__.py10100% 
   keyboard.py50100% 
/opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages/augmenty/lang/ro
   __init__.py10100% 
   keyboard.py50100% 
/opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages/augmenty/lang/ru
   __init__.py10100% 
   keyboard.py50100% 
/opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages/augmenty/span
   __init__.py30100% 
   entities.py1501093%35, 53–61, 364
   utils.py15380%20, 31, 37
/opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages/augmenty/token
   __init__.py90100% 
   casing.py42198%111
   insert.py103694%36, 53–54, 198, 265, 283
   replace.py103694%36, 201, 205, 241, 287, 301
   spacing.py390100% 
   static_embedding_util.py35197%67
   swap.py62494%50, 80, 91, 104
   wordnet_util.py10370%7–9
TOTAL10174296% 

Please sign in to comment.