Skip to content

Releases: UCREL/pymusas

v0.3.0

04 May 14:22
v0.3.0
85a2855
Compare
Choose a tag to compare

What's new

Added 🎉

  • Roadmap added.
  • Define the MWE template and it's syntax, this is stated in Notes -> Multi Word Expression Syntax in the Usage section of the documentation. This is the first task of issue #24.
  • PEP 561 (Distributing and Packaging Type Information) compatible by adding py.typed file.
  • Added srsly as a pip requirement, we use srsly to serialise components to bytes, for example the pymusas.lexicon_collection.LexiconCollection.to_bytes function uses srsly to serialise the LexiconCollection to bytes.
  • An abstract class, pymusas.base.Serialise, that requires sub-classes to create two methods to_bytes and from_bytes so that the class can be serialised.
  • pymusas.lexicon_collection.LexiconCollection has three new methods to_bytes, from_bytes, and __eq__. This allows the collection to be serialised and to be compared to other collections.
  • A Lexicon Collection class for Multi Word Expression (MWE), pymusas.lexicon_collection.MWELexiconCollection, which allows a user to easily create and / or load in from a TSV file a MWE lexicon, like the MWE lexicons from the Multilingual USAS repository. In addition it contains the functionality to match a MWE template to templates stored in the MWELexiconCollection class following the MWE special syntax rules, this is all done through the mwe_match method. It also supports Part Of Speech mapping so that you can map from the lexicon's POS tagset to the tagset of your choice, in both a one-to-one and one-to-many mapping. Like the pymusas.lexicon_collection.LexiconCollection it contains to_bytes, from_bytes, and __eq__ methods for serialisation and comparisons.
  • The rule based taggers have now been componentised so that they are based off a List of Rules and a Ranker whereby each Rule defines how a token(s) in a text can be matched to a semantic category. Given the matches from the Rules the for each token, a token can have zero or more matches, the Ranker ranks each match and finds the global best match for each token in the text. The taggers now support direct match and wildcard Multi Word Expressions. Due to this:
    • pymusas.taggers.rule_based.USASRuleBasedTagger has been changed and re-named to pymusas.taggers.rule_based.RuleBasedTagger and now only has a __call__ method.
    • pymusas.spacy_api.taggers.rule_based.USASRuleBasedTagger has been changed and re-named to pymusas.spacy_api.taggers.rule_based.RuleBasedTagger.
  • A Rule system, of which all rules can be found in pymusas.taggers.rules:
    • pymusas.taggers.rules.rule.Rule an abstract class that describes how other sub-classes define the __call__ method and it's signature. This abstract class is sub-classed from pymusas.base.Serialise.
    • pymusas.taggers.rules.single_word.SingleWordRule a concrete sub-class of Rule for finding Single word lexicon entry matches.
    • pymusas.taggers.rules.mwe.MWERule a concrete sub-class of Rule for finding Multi Word Expression entry matches.
  • A Ranking system, of which all of the components that are linked to ranking can be found in pymusas.rankers:
    • pymusas.rankers.ranking_meta_data.RankingMetaData describes a lexicon entry match, that are typically generated from pymusas.taggers.rules.rule.Rule classes being called. These matches indicate that some part of a text, one or more tokens, matches a lexicon entry whether that is a Multi Word Expression or single word lexicon.
    • pymusas.rankers.lexicon_entry.LexiconEntryRanker an abstract class that describes how other sub-classes should rank each token in the text and the expected output through the class's __call__ method. This abstract class is sub-classed from pymusas.base.Serialise.
    • pymusas.rankers.lexicon_entry.ContextualRuleBasedRanker a concrete sub-class of LexiconEntryRanker based off the ranking rules from Piao et al. 2003.
    • pymusas.rankers.lexical_match.LexicalMatch describes the lexical match within a pymusas.rankers.ranking_meta_data.RankingMetaData object.
  • pymusas.utils.unique_pos_tags_in_lexicon_entry a function that given a lexicon entry, either Multi Word Expression or Single word, returns a Set[str] of unique POS tags in the lexicon entry.
  • pymusas.utils.token_pos_tags_in_lexicon_entry a function that given a lexicon entry, either Multi Word Expression or Single word, yields a Tuple[str, str] of word and POS tag from the lexicon entry.
  • A mapping from USAS core to Universal Part Of Speech (UPOS) tagset.
  • A mapping from USAS core to basic CorCenCC POS tagset.
  • A mapping from USAS core to Penn Chinese Treebank POS tagset tagset.
  • pymusas.lexicon_collection.LexiconMetaData, object that contains all of the meta data about a single or Multi Word Expression lexicon entry.
  • pymusas.lexicon_collection.LexiconType which describes the different types of single and Multi Word Expression (MWE) lexicon entires and templates that PyMUSAS uses or will use in the case of curly braces.
  • The usage documentation, for the "How-to Tag Text", has been updated so that it includes an Indonesian example which does not use spaCy instead uses the Indonesian TreeTagger.
  • spaCy registered functions for reading in a LexiconCollection or MWELexiconCollection from a TSV. These can be found in pymusas.spacy_api.lexicon_collection.
  • spaCy registered functions for creating SingleWordRule and MWERule. These can be found in pymusas.spacy_api.taggers.rules.
  • spaCy registered function for creating ContextualRuleBasedRanker. This can be found in pymusas.spacy_api.rankers.
  • spaCy registered function for creating a List of Rules, this can be found here: pymusas.spacy_api.taggers.rules.rule_list.
  • LexiconCollection and MWELexiconCollection open the TSV file downloaded through from_tsv method by default using utf-8 encoding.
  • pymusas_rule_based_tagger is now a spacy registered factory by using an entry point.
  • MWELexiconCollection warns users that it does not support curly braces MWE template expressions.
  • All of the POS mappings can now be called through a spaCy registered function, all of these functions can be found in the pymusas.spacy_api.pos_mapper module.
  • Updated the Introduction and How-to Tag Text usage documentation with the new updates that PyMUSAS now supports, e.g. MWE's. Also the How-to Tag Text is updated so that it uses the pre-configured spaCy components that have been created for each language, this spaCy components can be found and downloaded from the pymusas-models repository.

Removed 🗑

  • pymusas.taggers.rule_based.USASRuleBasedTagger this is now replaced with pymusas.taggers.rule_based.RuleBasedTagger.
  • pymusas.spacy_api.taggers.rule_based.USASRuleBasedTagger this is now replaced with pymusas.spacy_api.taggers.rule_based.RuleBasedTagger.
  • Using PyMUSAS usage documentation page as it requires updating.

Commits

cc52c6d Added languages that we support
a0f748b Merge pull request #32 from UCREL/mwe
5feb6ef Added the changes to the documentation
39b88ae Added link to MWE syntax notes
9b63279 Updated so that it uses the pre-configured models
91a7089 Added that we support MWE and have models that can be downloaded
61b8265 Needs to be updated before being added back into the documentation
4ff95aa version 0.3.0
2ab0d4b Added spacy registered functions for pos mappers
0b288bb Changed API loading page to the base module
6da04a9 MWE Lexicon Collection can handle curly braces being added but will be ignored
5042323 @reader to @misc due to config file format
f186803 isort
1e2d045 spacy factory entry point
17f7821 spacy factory entry point
014f73d Added rule_list spacy registered function
37fb15e No longer use OS default encoding
8c21fc8 CI does not fail on windows when it should, Fixed
67ee480 CI does not fail on windows when it should, DEBUGGING
e48017a isort
745b57a spacy registered function for ContextualRuleBasedRanker
fed00b2 Click issue with version 8.1.0
543b251 spacy registered functions for tagger rules
89d59ec Click issue with version 8.1.0
4b8a22c pytest issue with version 7.1.0
e4b75a5 Click issue with version 8.1.0
787496e spacy registered functions for lexicon collections
6fb5882 Added roadmap link
ca53cc6 ROADMAP from main branch
1626496 update
5a98ccd Now up to date
404da49 PEP 561 compatible by adding py.typed file
c55d991 Added py.typed
2575138 Added srsly as a requirement
bdc84bb Added srsly as a requirement
03ddc79 Moved the new_rule_based tagger into rule_based
d97aa08 Moved the new_rule_based tagger into rule_based
67e60ba flake8
92b43ab Updated examples
ea7fd40 Updated examples
f2a7d47 Added lexicon TSV file that was deleted after removing old tagger
20ba93b Removed old tagger
f316ef5 Serialised methods for custom classes
4135e67 eq methods for the LexiconEntryRanker classes
4ea243b eq methods for the LexiconCollection classes
15a9013 to and from bytes method for the ranker classes
85f96b6 to and from bytes method for SingleWordRule and LexiconCollection
2f1275b Compare meta data directly rather than through a for loop
c71c3b5 to and from bytes method for rule and MWE rule
75e341d to and from bytes methods for MWELexiconCollection
8b862e8 Added srsly as known third party package
dcd90e0 First version of roadmap
e631dd0 ignore abstract method in code coverage results
b017816 update_factory_attri...

Read more

v0.2.0

18 Jan 12:09
v0.2.0
78db9ca
Compare
Choose a tag to compare

What's new

Added 🎉

  • Release process guide adapted from the AllenNLP release process guide, many thanks to the AllenNLP team for creating the original guide.
  • A mapping from the basic CorCenCC POS tagset to USAS core POS tagset.
  • The usage documentation, for the "How-to Tag Text", has been updated so that it includes a Welsh example which does not use spaCy instead uses the CyTag toolkit.
  • A mapping from the Penn Chinese Treebank POS tagset to USAS core POS tagset.
  • In the documentation it clarifies that we used the Universal Dependencies Treebank version of the UPOS tagset rather than the original version from the paper by Petrov et al. 2012.
  • The usage documentation, for the "How-to Tag Text", has been updated so that the Chinese example includes using POS information.
  • A CHANGELOG file has been added. The format of the CHANGELOG file will now be used for the formats of all current and future GitHub release notes. For more information on the CHANGELOG file format see Keep a Changelog.

Commits

9283107 Changed the publish release part of the instructions
fea9510 Prepare for release v0.2.0
5581882 Prepare for release v0.2.0
bd4c74f Prepare for release v0.2.0
3fa0346 Prepare for release v0.2.0
f548e08 Publish to PyPI only on releases rather than tags
85ac891 Merge pull request #23 from UCREL/welsh-example
e6efe4a Welsh USAS example
e476cde Welsh usage example
854bce6 Merge pull request #22 from UCREL/chinese-pos-tagset-mapping
e5e33bf Updated CHANGELOG
e6afd2d Corrected English
e7b0502 Clarification on UPOS tagset used
eab65a6 Changed name from Chinese Penn Treebank to Penn Chinese Treebank
aa8dc1d Added Chinese Penn Treebank to USAS core POS mapping
8344730 Updated the Chinese tag-text example to include POS information #19
c726a06 Benchmarking the welsh tagger
4bcc239 Added CHANGELOG file fixes #17
4ad6ab2 Added PyPI downloads badge
cf38ff0 Updated to docusaurus version 2.0.0-beta.14
60f0bb5 Updated to latest alpine and node
4b09298 Merge pull request #20 from UCREL/language-documentation
c18f092 Alphabetical order
d11b332 Portuguese example
a7170d7 English syntax mistake
d697df9 Spanish example
ea601c4 Italian example
b7af3b5 French example
6c441b0 Dutch example
01ccf6f position of the sections
547fd08 renamed
77e9a9c Clarification in the introduction
a20eac3 Chinese example
0202c5d Added better note formatting
207e023 Merge pull request #16 from UCREL/citation
b7299de added RSPM environment variable
ae02909 package missing error in Validate-CITATION-cff job
0a5eebf Added sudo
f3aeb47 Added citation.cff validator
12e7e65 Citation file and how to validate it
5033c98 Increased version number in preparation for next release
df67477 Changed homepage URL and removed bug tracker
414b3a4 Added badges from pypi and changed the emojis

v0.1.0 Initial Release

07 Dec 14:38
17d462a
Compare
Choose a tag to compare

In the initial release we have created a rule based tagger that has been built into different ways:

  1. As a spaCy component that can be added to a spaCy pipeline, of which this is called pymusas.spacy_api.taggers.rule_based.USASRuleBasedTagger
  2. A non-spaCy version, of which this is called pymusas.taggers.rule_based.USASRuleBasedTagger

In this initial release we have concentrated on building the 1 tagger, the spaCy version, of which all of the usage guides are built around this version of the tagger. However the 2 tagger, non-spaCy version, does work, but has fewer capabilities, e.g. no way of easily saving the tagger to a JSON file etc.

We have also created a LexiconCollection class that allows a user to easily create and / or load in from a TSV file a lexicon, like the single word lexicons from the Multilingual USAS repository. This LexiconCollection can be used to format a lexicon file so that it can be used within the rule based tagger, as shown in the using PyMUSAS tutorial.

Lastly we have created a POS mapping module that contains a mapping between the Universal Part Of Speech (UPOS) tagset and the USAS core POS tagset. This can be used within the spaCy component version of the rule based tagger to convert the POS tags outputted from the spaCy POS model, which use UPOS tagset, to the USAS core tagset, which are used by the single word lexicons from the Multilingual USAS repository. For more information on the use of this mapping feature in the rule based tagger see the using PyMUSAS tutorial