-
Notifications
You must be signed in to change notification settings - Fork 9
Changelog
Xav edited this page Jul 22, 2015
·
25 revisions
- Improvements of
interrogative
type detection (fix of some test cases, add new test cases) - Numeric tokens now provide a
value
attribute representing the real value of the number, typed as Javascriptnumber
- Fix
singular
attributes for noun token not being set - Detectors can now be executed before dependency parsing
-
compendium.detectors.add
is deprecated in favor ofcompendium.detectors.after
, will be removed in v1.0.0 -
compendium.detectors.after
registers detectors that will be executed after dependency parsing -
compendium.detectors.before
registers detectors that will be executed before dependency parsing
-
- Better
like
token handling: transform into preposition when possible (I like that
vsIt's like that
) - Better
have
token handling: rarely a noun - Roman numerals handling (
Chapter IV
,Henri III
) - Improved Natural Entity Recognition (more patterns such as
IO2009
,CamelCased Inc.
,Henry III
...) - Bug fixes
- Avoid duplicate items in lexicon (leading to wrong PoS tagging and sentiment analysis)
- Avoid
raw
tokens being normalized
- Fix Missing infinitives for some verb tokens
- Add
tense
attribute to verb tokens
- Fix #3:
raw
field is a reconstruction of the sentence, not the actualraw
string. Fixed by providing the real raw string. - Scaffolds some code for multilingual use of compendium - for now one build per language
- Add post processors to lexer for language-specific tokens handling
- Reorganize sources to have a clean multilingual directory structure (to be continued)
- Create gulp build tasks for french language
- Add initial tests for french language
- Minor improvements of english dependency parsing with new tests
- Minor improvements of profiling
- NER fixes
- Fix infinite loop in dependency parsing
- New dependency parsing rules and tests
- New PoS rules
- Fix some token sentiment scores being skipped when building lexicon
- Add experimental dependency-based sentiment score propagation
- Allow lexicon sentiment scores to be floats
- Sentiment analysis: better "mixed" tagging by comparing amplitude to score in the case of low score + medium amplitude
- Better handling of quotes (lexer, PoS)
- Slight cleanup of some lexicon symbols
- Sentence types: add
refusal
type - Negation detection slight refactoring (negation is expended to negation mark master verb)
- Remove cleaner step (replaced by synonyms handler)
- Sentence types: add
approval
type - Dependency parsing: add new governors ranks
- Token attributes: add
is_punc
attribute - Add new Brill rules (+0.1% on Penn Treebank)
- Statistics: add
words
stat (number of actual words in a sentence: tokens length - punc, emots...) - New tests + some tests refactoring
- Fix issues
- Missing
're
contraction - Lexer bit too greedy with emoticons (was catching
-s
ininter-sport
)
- Missing
- Improved dependency parsing
- Third rank of governors
- More governor tag candidates
- Sentence type
imperative
by looking up forVB
governors - New Brill rules
- Remove annoying
console.log
- Few new Brill rules
- Better looking example page + readme screenshot
- Fix bug that skipped lot of emoticons when building lexicons
- Verbs
- Irregular verbs conjugation + integration in lexicon
- Regular verbs in Lexicon
- Basic tense detection (for simple sentences, based on dependency parsing)
- Numerous new Brill's rules for PoS tagging (92.519% on Penn Treebank)
- Improved dependency parsing
- Trie class interface
- Bit of code documentation
- Sentence detectors are now applied directly in analysis sentence loop (not anymore in a dedicated second loop)
- New attributes for tokens (
is_verb
,infinitive
,is_noun
,plural
,singular
) -
*in
>*ing
inference (if a word ends within
, is not in lexicon, and the same word plusg
exists in lexicon, then infer it asVBG
) - New tests
- Basically working dependency parsing
- Bug fixes/improvements
- Bit of project cleanup
- Move benchmark folder to test/
- Remove find utility (use grep!)
- Improved token PoS tagging (+0.8% on Penn treebank!):
- Order of detectors changed
- Better management of composed words
- First step of scaffolding for dependency parsing feature
- All regular verbs now conjugated (and/or conjugable)
- PoS tagging for verbs greatly improved
- Better packing of verbs and nationalities (-2ko)
- Better filtering of lexicon (-1ko)
- Reorganised a bit the project
- Lexicon data files moved to src/lexicon
- Compendium data files moved to src/dictionaries
- Lot of news tests (isSingular, verbs, lexicon...)
- Refactored detectors API so it's a bid less verbose
- Better sentiment profiling for mixed sentiment, in particular when using multiple adverbs
- Politeness, dirtiness scores
- Synonyms feature for tokens normalization
- Used by PoS tagger in case no other method returned a tag
- Add
interrogative
andexclamatory
sentence types - Fix low confidence for obvious PoS tagging (CD, SYM...)
- [Gulpfile] Add test run on live rebuild
- Statistics skips punctuation tokens
- Improve verb inflector
- Better sentiment profiling
- Better breakpoint detection
Compendium-js, English NLP for Node.js and the browser, MIT Licensed