A typing trainer boasting procedurally generated sentences created using a bigram language model.
Type Trials is a typing trainer built in Java that generates sentences based on a bigram language model.
When it comes to typable content used in a typing trainer there are two options:
- static selection of curated texts
- procedurally generated sentences
Although simple and easy to implement, choosing to use a pre-determined selection of text comes with some drawbacks. Namely, using static content requires having many samples of text in order to reduce the amount of repetition a user encounters while typing. If a user sees the same section of text multiple times throughout their training they may begin to memorize it, making it easier to type and artificially improving their score. To mitigate this, we can use procedurally generated sentences to achieve a much wider range of unique typable content than with static text samples alone.
A trivial approach to procedurally generated sentences is to use a big dictionary and randomly select words from it to build a sentence. The problem with this though is that these types of sentences are very difficult and unrealistic to type since normal sentences follow particular patterns and structures not present in a sequence of random words.
So, to accomodate this problem, we can use a bigram language model to generate the sentences. A bigram maps each of the words in a text sample to every word that may come after it. For example, the text "The fox jumped over the lazy dog"
generates the following bigram:
{
"the": ["fox", "lazy"],
"fox": ["jumped"],
"jumped": ["over"],
"over": ["the"],
"lazy": ["dog"],
"dog": []
}
With this information a sentence can be generated by choosing a starting word and randomly picking one of the words it maps to as the next word. By generating our bigram over a large section of text, we can allow for a huge range of possible sentences that all feel natural to type.
The process of generating a bigram begins with parsing the text it's based on. Before that though, the text needs to be cleaned up by removing/replacing punctuation and getting rid of extra whitespace. After that, the text is read word by word and mapped to a list of words that may come after it.
Since we're measuring the speed at which the user can type, the sentence generation must be fast to ensure that it doesn't bottleneck the user's typing speed. This introduces a new problem though. The bigram could be too large to store entirely in memory and searching through a file each time we choose the next word in a sentence is rather slow.
So, when generating the bigram, each word is given a numeric ID from [0-n)
where n
is the number of unique words. Then, we can use that ID as an index into a file containing information about the word. The file has three sections:
This section is made up of n
36 byte blocks where each block contains a word's id, file offsets to the start/end of it's bigram mapping, and file offsets to the start/end of it's string representation. Since this section is made up of evenly sized blocks, it can be indexed using a word's numeric ID, letting us lookup words in constant time.
This section contains the bigram mapping information. It is made up of n
sequences of word ID's whose boundaries are specified by the file offsets found in the indices section.
This section holds the ascii text of each word. Since, using this file structure, we mainly operate using word IDs we need a section to allow us to convert the word ID into text to display to the user. A word's string can be found by referring to the file offsets available in the indices section.
Using this structure, given a word ID we can quickly and easily find what ascii text it corresponds to and which words can come after it without storing any of the bigram in memory. First, we would find the byte offset into the file where the word's index information is stored by using the expression ID*36
. Then, we can parse the block and extract the offsets to the file containing the ascii text of the word and the offsets containing the list of mapped words. We can then read the sections of the file specified by the offsets and pick a new word ID from the read mapping to use as the next word in the sentence.