ja-compromise は、英語の JavaScript ライブラリ nlp-compromise を日本語で移植したものです。
このプロジェクトの目標は、小さくて基本的なルール ベースの POS タグを提供することです。
ja-compromise
(妥協) is a port of compromise in japanese.
The goal of this project is to provide a small, basic, rule-based POS-tagger.
import nlp from 'ja-compromise'
let doc = ldv('小さな子供は食料品店に歩いた')
doc.match('#Noun').out('array')
// [ '子', '食料品店']
またはブラウザで
<script src="https://unpkg.com/de-compromise"></script>
<script>
let txt = '小さな子供が食料品を買いました。 彼はとても怖がっていた'
let doc = jaCompromise(txt)
console.log(doc.sentences(1).json())
// { text:'小さな子供が食...', terms:[ ... ] }
</script>
see en-compromise/api for full API documentation.
ja-compromise には、compromise/one
のすべてのメソッドが含まれます:
- .text() - return the document as text
- .json() - return the document as data
- .debug() - pretty-print the interpreted document
- .out() - a named or custom output
- .html({}) - output custom html tags for matches
- .wrap({}) - produce custom output for document matches
- .found [getter] - is this document empty?
- .docs [getter] get term objects as json
- .length [getter] - count the # of characters in the document (string length)
- .isView [getter] - identify a compromise object
- .compute() - run a named analysis on the document
- .clone() - deep-copy the document, so that no references remain
- .termList() - return a flat list of all Term objects in match
- .cache({}) - freeze the current state of the document, for speed-purposes
- .uncache() - un-freezes the current state of the document, so it may be transformed
- .all() - return the whole original document ('zoom out')
- .terms() - split-up results by each individual term
- .first(n) - use only the first result(s)
- .last(n) - use only the last result(s)
- .slice(n,n) - grab a subset of the results
- .eq(n) - use only the nth result
- .firstTerms() - get the first word in each match
- .lastTerms() - get the end word in each match
- .fullSentences() - get the whole sentence for each match
- .groups() - grab any named capture-groups from a match
- .wordCount() - count the # of terms in the document
- .confidence() - an average score for pos tag interpretations
(match methods use the match-syntax.)
- .match('') - return a new Doc, with this one as a parent
- .not('') - return all results except for this
- .matchOne('') - return only the first match
- .if('') - return each current phrase, only if it contains this match ('only')
- .ifNo('') - Filter-out any current phrases that have this match ('notIf')
- .has('') - Return a boolean if this match exists
- .before('') - return all terms before a match, in each phrase
- .after('') - return all terms after a match, in each phrase
- .union() - return combined matches without duplicates
- .intersection() - return only duplicate matches
- .complement() - get everything not in another match
- .settle() - remove overlaps from matches
- .growRight('') - add any matching terms immediately after each match
- .growLeft('') - add any matching terms immediately before each match
- .grow('') - add any matching terms before or after each match
- .sweep(net) - apply a series of match objects to the document
- .splitOn('') - return a Document with three parts for every match ('splitOn')
- .splitBefore('') - partition a phrase before each matching segment
- .splitAfter('') - partition a phrase after each matching segment
- .lookup([]) - quick find for an array of string matches
- .autoFill() - create type-ahead assumptions on the document
- .tag('') - Give all terms the given tag
- .tagSafe('') - Only apply tag to terms if it is consistent with current tags
- .unTag('') - Remove this term from the given terms
- .canBe('') - return only the terms that can be this tag
- .toLowerCase() - turn every letter of every term to lower-cse
- .toUpperCase() - turn every letter of every term to upper case
- .toTitleCase() - upper-case the first letter of each term
- .toCamelCase() - remove whitespace and title-case each term
- .pre('') - add this punctuation or whitespace before each match
- .post('') - add this punctuation or whitespace after each match
- .trim() - remove start and end whitespace
- .hyphenate() - connect words with hyphen, and remove whitespace
- .dehyphenate() - remove hyphens between words, and set whitespace
- .toQuotations() - add quotation marks around these matches
- .toParentheses() - add brackets around these matches
- .map(fn) - run each phrase through a function, and create a new document
- .forEach(fn) - run a function on each phrase, as an individual document
- .filter(fn) - return only the phrases that return true
- .find(fn) - return a document with only the first phrase that matches
- .some(fn) - return true or false if there is one matching phrase
- .random(fn) - sample a subset of the results
- .replace(match, replace) - search and replace match with new content
- .replaceWith(replace) - substitute-in new text
- .remove() - fully remove these terms from the document
- .insertBefore(str) - add these new terms to the front of each match (prepend)
- .insertAfter(str) - add these new terms to the end of each match (append)
- .concat() - add these new things to the end
- .swap(fromLemma, toLemma) - smart replace of root-words,using proper conjugation
- .sort('method') - re-arrange the order of the matches (in place)
- .reverse() - reverse the order of the matches, but not the words
- .normalize({}) - clean-up the text in various ways
- .unique() - remove any duplicate matches
(these methods are on the main nlp
object)
-
nlp.tokenize(str) - parse text without running POS-tagging
-
nlp.lazy(str, match) - scan through a text with minimal analysis
-
nlp.plugin({}) - mix in a compromise-plugin
-
nlp.parseMatch(str) - pre-parse any match statements into json
-
nlp.world() - grab or change library internals
-
nlp.model() - grab all current linguistic data
-
nlp.methods() - grab or change internal methods
-
nlp.hooks() - see which compute methods run automatically
-
nlp.verbose(mode) - log our decision-making for debugging
-
nlp.version - current semver version of the library
-
nlp.addWords(obj) - add new words to the lexicon
-
nlp.addTags(obj) - add new tags to the tagSet
-
nlp.typeahead(arr) - add words to the auto-fill dictionary
-
nlp.buildTrie(arr) - compile a list of words into a fast lookup form
-
nlp.buildNet(arr) - compile a list of matches into a fast match form
参加して助けてください! - please join to help!
git clone https://github.com/nlp-compromise/ja-compromise.git
cd ja-compromise
npm install
npm test
npm watch
- spacy/japanese - python tagger/tokenizer, by explosionAI
- meCab - C/C++ tokenizer/tagger, by Taku Kudo
- fugashi - Cython wrapper for MeCab, by Paul O'Leary McCann
- janome - python tokenizer/tagger, by Tomoko Uchida
- sudachi - tokenizer/tagger, by Arseny Tolmachev