A while ago, I had a 4 page school assignment that I was very not motivated to do. Since I thought that the teacher will not be reading it anyways, I decided to generate the text with the Markov chain. I had one problem though, In English, it is easy to tokenize words: they are just separated with spaces. But in Japanese, words are concatenated without separation, which makes splitting much harder. That day, I finally ended up using an open source Japanese lexical analyzer: Wakame. Although my school assignment got submitted, I had a few more ambitions. I wanted to port this to the web. I first looked into turning Wakame to WebAssembly, but since it relies heavily on the linux syscalls, I will have to end up porting much of linux to the web, which is very inefficient to say the least. The next option that I came up with is to code something similar to Wakame from ground up. Fortunately for me, Wakame was an open source, and was well documented. So I was able to learn a few things about its inner workings. First, it uses a specialized dictionaries that contain all possible conjugations and permutations as separate entries. Second, it uses bi-gram markov model, which means it estimates the possible next words type using the previous word and its type. I realized that I can sort of copy this behavior using a connectivity graph of parts of speech, which I have constructed right here. (insert an image here)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

journal.md

journal.md

Files

journal.md

Latest commit

History

journal.md

File metadata and controls