Skip to content

Tool for calculating n-gram frequency and associated metric

License

Notifications You must be signed in to change notification settings

NZRS/n-gram-frequency

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

n-gram-frequency

A Ruby command line tool for working with n-gram frequencies.

An n-gram is a sequence of characters 1 or more characters long. The default length used is 2 and a different size can be specified with the -n option. It is often useful to look at text from the perspective of the frequencies of certain n-grams and how those compare to the frequencies for a reference body of text, which this program helps to do by carrying out the following calculations on n-gram frequencies:

  1. calculating an n-gram frequency table for a file of strings
  2. Uses that frequency table (or another) to calculate a metric for the same or another file of strings.

calculating the n-gram frequency table

The program takes a file which has one word per list and for each words splits it into all the n-grams and records the total number of each n-gram found in the file in a hash table. For example, the word 'doctor' is split into the following 2-grams (n-grams of size 2): do oc ct to or.

This has table can be written out using the -w option. For example:

./n-gram-frequency.rb top-5000.csv -w top-5000-freq.csv

generates the file top-freq.csv, which has the following first five lines:

th,171
he,157
be,80
an,344
nd,182

calculating the metric

The second function is to calculate a metric for either the same file of strings or another file. Once again the file should have one word per line. To output the metric the -o option is used and optionally the name of a file can be given.

This metric is defined as

sum(1 / frequency_of_n_gram) / number_of_n_grams

For example:

./n-gram-frequency.rb test.csv -o - t top-5000-freq.csv

generates the file output.csv, which has the following first five lines:

trademe,0.005851064944013533
stuff,0.045435705559868325
nzherald,0.4342321309988272
westpac,0.08982746716824135
kiwibank,0.3035078751121378

Where an n-gram is not found in the frequency table then a score is added to the metric, defaulting to 1 and changed by the -n option.

Example data

The example data used to create the frequency table in the example is a list of 5000 most common words from http://www.wordfrequency.info/top5000.asp in file called top-5000.csv. The list of test names is 10 popular domain names in .nz with the TLD suffices removed.

command line syntax

usage: n-gram-frequency input [-o [output]] [-t [input_table]] [-n [size]] [-m [misses]] [-w [output_table]]
        
input:
      The name of the file containing one string per line.

Options:
  -o  Calculate the metric for each string in the inout file and save the
      output in the specified file name.  Defaults to 'output.csv'.

  -t  Use the specified frequency table file instead of one generated from
      the input file.  Defaults to 'table.csv'.  

  -n  Use n-grams of the specified size.  Defaults to 2.

  -m  Add the specified score to the metric for hash table misses.  
      Defaults to 2.0.

  -w  Write out generated hash table.  Pointless using this if -t
      has also been specified.  Defaults to 'table.csv'.

About

Tool for calculating n-gram frequency and associated metric

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages