A utility for analyzing frequency of text chunks on the web.
Supply a bit o' text to the Methodius class, and let it determine your bigrams, trigrams, ngrams, letter-frequencies, word frequencies, bigram relationships, and create ngram trees.
const { Methodius } = require('methodius');
// or import { Methodius } from 'methodius';
const udhr1 = `
All human beings are born free and equal in dignity and rights. They are endowed with reason and conscience and should act towards one another in a spirit of brotherhood.
`;
const nGrams = new Methodius(udhr1);
const topLetters = nGrams.getTopLetters(10);
const topWords = nGrams.getTopWords(10);
Global Class
new Methodius(text)
Parameters
name | type | Description |
---|---|---|
text | string | raw text to be analyzed |
characters to ignore when analyzing text period, comma, semicolon, colon, bang, question mark, interrobang, Spanish bang+, parens, bracket, brace, single quote, some spaces
\\.,;:!?‽¡¿⸘()\\[\\]{}<>’'…\"\n\t\r
characters to ignore AND CONSUME when trying to find words em-dash, period, comma, semicolon, colon, bang, question mark, interrobang, Spanish bang+, parens, bracket, brace, single quote, space
—\\.,;:!?‽¡¿⸘()\\[\\]{}<>…"\\s
determines if string contains punctuation
Parameters
name | type | Description |
---|---|---|
string | string |
Returns
boolean
determines if string contains symbols
Parameters
name | type | Description |
---|---|---|
string | string |
Returns
boolean
determines if a string has a space
Parameters
name | type | Description |
---|---|---|
string | string |
Returns
boolean
lowercases text and removes diacritics and other characters that would throw off n-gram analysis
Parameters
name | type | Description |
---|---|---|
string | string |
Returns
string
extracts an array of words from a string
Parameters
name | type | Description |
---|---|---|
text | string |
Returns
Array<string>
gets ngrams from text
Parameters
name | type | Description |
---|---|---|
text | string | |
gramSize | Number | Default = 2 |
Returns
Array<string>
Gets average size of a word
Parameters
name | type | Description |
---|---|---|
wordArray | string[] |
Returns
number
Gets the median (middle) size of a word
Parameters
name | type | Description |
---|---|---|
wordArray | string[] |
Returns
number
Gets 2-word pairs from text.
Note: This doesn't use sentence punctuation as a boundary. Should it?
Parameters
name | type | Description |
---|---|---|
text | string | |
gramSize | number | default=2 |
Returns
Array<string>
converts an array of strings into a map of those strings and number of occurences
Parameters
name | type | Description |
---|---|---|
ngramArray | Array<string> |
Returns
Map<string, number>
converts a frequency map into a map of percentages
Parameters
name | type | Description |
---|---|---|
frequencyMap | Map<string, number> |
Returns
Map<string, number>
filters a frequency map into only a small subset of the most frequent ones
Parameters
name | type | Description |
---|---|---|
frequencyMap | Map<string, number> |
|
limit | number | default=20 |
Returns
Map<string, number>
Returns an array of items that occur in both iterables
Parameters
name | type | Description |
---|---|---|
iterable1 | `Map | Array` |
iterable2 | `Map | Array` |
Returns
Array<any>
An array of items that occur in both iterables. It will compare the keys, if sent a map
Returns an array that is the union of two iterables
Parameters
name | type | Description |
---|---|---|
iterable1 | `Map | Array` |
iterable2 | `Map | Array` |
Returns
Array<any>
A union of the items that occur in both iterables.
Returns an array of arrays of the unique items in either iterable. Also known as the symmetric difference
Parameters
name | type | Description |
---|---|---|
iterable1 | `Map | Array` |
iterable2 | `Map | Array` |
Returns
Array<Array<any>
An array of arrays of the unique items. The first item is the first parameter, 2nd item second param
Returns an array of items that are unique only to the first parameter.
Parameters
name | type | Description |
---|---|---|
iterable1 | `Map | Array` |
iterable2 | `Map | Array` |
Returns
Array<Array<any>
An array of items unique only to the first parameter
Returns a map containing various comparisons between two iterables
Parameters
name | type | Description |
---|---|---|
iterable1 | `Map | Array` |
iterable2 | `Map | Array` |
Returns
Map<string, <array>>
A map containing various comparisons between two iterables. Those comparisons will be arrays of intersection, disjunctiveUnion, difference, and union.
determines the placement of a single ngram in an array of words
Parameters
name | type | Description |
---|---|---|
ngram | string |
|
wordsArray | Array<string> |
Returns
Map<string, number>
a map with the keys 'start', 'middle', and 'end' whose values correspond to how often the provided ngram occurs in this position
determines the placement of ngrams in an array of words
Parameters
name | type | Description |
---|---|---|
ngram | Array<string> |
|
wordsArray | Array<string> |
Returns
Map<string, Map<string, number>>
a map with the key of the ngram, and the value that is a map containing start, middle, end
gets ngrams from an array of words
Parameters
name | type | Description |
---|---|---|
wordArray | Array<string> |
an array of words |
ngramSize | number |
default = 2. The size of the ngrams to return |
Returns
Array<Array<string>>
An array containing arrays of ngrams, each array corresponds to a word.
using a collection returned from getNgramCollections, searches for a string and returns what comes before and after it
Parameters
name | type | Description |
---|---|---|
searchText | string |
the string to search for |
ngramCollections | `Array | Array<Array>` |
siblingSize | number |
default = 1. How many siblings to find in front or behind |
Returns
Map<'before'|'after',Map<string, number>>
a Map with the keys 'before' and 'after' which contain maps of what comes before and after
Example
const words = ['revolution', 'nation'];
const ngramCollections = Methodius.getNgramCollections(words, 2);
const onSiblings = Methodius.getNgramSiblings('io', ngramCollections);
/*
new Map([
['before', new Map(
['ti', 2]
)],
['after', new Map(
['on', 2]
)]
])
*/
Gets the ngrams that will occur before or after other ngrams. Useful for finding patterns of ngrams.
Parameters
name | type | Description |
---|---|---|
words | Array<string> |
an array of words to evaluate |
ngrams | Map<string, number> |
a frequency map of ngrams |
ngramSize | number |
default = 2. the size of the ngram |
Returns
Map<string, number>
A frequency map of how often ngrams occured before or after other ngrams
Example
This requires several steps. You'll need an array of words and a frequency map of ngrams.
const ngrams = getNGrams('the revolution of the nation was on television. It was about pollution and the terrible situation ', 2);
const frequencyMap = getFrequencyMap(ngrams);
const topNgrams = getTopGrams(frequencyMap, 5);
const words = ['the', 'revolution', 'of', 'the', 'nation', 'was', 'on', 'television', 'it', 'was', 'about', 'pollution', 'and', 'the', 'terrible', 'situation' ];
const relatedNgrams = getRelatedNgrams(words, topNgrams, 2, 5);
Gets a nested map of maps that breaks down unique words into their smallest ngrams
Parameters
name | type | Description |
---|---|---|
words | Array<string> |
an array of words to evaluate |
Returns
Map<string, Array<string>| Map<string, <Array|string>>
A nested map of maps that breaks down unique words into their smallest ngrams.
lowercased text with diacritics removed
string
an array of letters in the text
Array<string>
an array of words in the text
Array<string>
an array of letter bigrams in the text
Array<string>
an array of letter trigrams in the text
Array<string>
an array of unique letters in the text
Array<string>
an array of unique bigrams in the text
Array<string>
an array of unique trigrams in the text
Map<string, Map<string, number>>
a map of placements of letters within words
Map<string, Map<string, number>>
a map of placements of bigrams within words
Map<string, Map<string, number>>
a map of placements of trigrams within words
Array<string>
an array of unique words in the text
Array<string>
a map of letter frequencies in the sanitized text
Map<string, number>
a map of bigram frequencies in the sanitized text
Map<string, number>
a map of trigram frequencies in the sanitized text
Map<string, number>
a map of word frequencies in the sanitized text
Map<string, number>
a map of letter percentages in the sanitized text
Map<string, number>
a map of bigram percentages in the sanitized text
Map<string, number>
a map of trigram percentages in the sanitized text
Map<string, number>
a map of word percentages in the sanitized text
Map<string, number>
The average size of a word
number
The middle size of a word
number
A nested map of maps that breaks down unique words into their smallest ngrams.
gets an array of customizeable ngrams in the text
Parameters
name | type | Description |
---|---|---|
size | number |
default = 2 size of the n-gram to return |
Returns
Array<string>
a map of the most used letters in the text
Parameters
name | type | Description |
---|---|---|
limit | number |
default = 20 number of top letters to return |
Returns
Map<string, number>
a map of the most used bigrams in the text
Parameters
name | type | Description |
---|---|---|
limit | number |
default = 20 number of top bigrams to return |
Returns
Map<string, number>
a map of the most used trigrams in the text
Parameters
name | type | Description |
---|---|---|
limit | number |
default = 20 number of top trigrams to return |
Returns
Map<string, number>
a map of the most used words in the text
Parameters
name | type | Description |
---|---|---|
limit | number |
default = 20 number of top words to return |
Returns
Map<string, number>
Compare this methodius instance to another
Parameters
name | type | Description |
---|---|---|
methodius | Methodius |
another Methodius instance |
Returns
Map<string, Map>
A map of property names and their comparisons (intersection, disjunctiveUnions, etc) for a set of properties
Gets the ngrams that will occur before or after other ngrams based on what the most frequent ngrams are. Useful for finding patterns of ngrams.
Parameters
name | type | Description |
---|---|---|
ngramSize | number |
default = 2. the size of the ngram |
limit | number |
default = 20. the number of top ngrams to use |
Returns
Map<string, number>
A frequency map of how often the most common ngrams occured before or after other common ngrams