High-quality frequency dictionaries ready to be imported into Yomichan.
Generate frequency dictionaries from source for customization.
A frequency dictionary displays the ranked frequency (1st most frequent, 2nd most frequent, ...) of a word inside a context (written language, spoken language, web, Showa era, Heisei era, ...).
Frequency dictionaries can help language learners distinguish common words from uncommon ones.
The data is kept up to date with NINJAL.
Learn how words changed in frequency throughout history (CHJ, SHW).
Learn about frequent words on the Japanese web (NWJC).
When compiling a frequency dictionary, one has to be careful to not count the same word occurrence twice. This would corrupt the resulting word frequency.
The dictionaries in this repo are vetted against double-counting.
The default dictionaries include the 50k most frequent words only. This keeps the files small and the learner focus on what is important: frequent words. Language fluency requires around 10k to 20k words of vocabulary.
You can find the dictionaries of the following corpora as GitHub releases.
The dictionary file shares the same license as its source data.
A corpus that covers different eras of Japanese history.
The corpus ranges from the Nara period through the Edo period and Meiji era up to the Taishō era.
To track words across eras, two dictionaries are generated:
- A dictionary for the premodern part (Nara to Edo)
- A dictionary for the modern part (Meiji to Taishō)
The corpus is likely too small to generate dictionaries for each era.
A corpus that covers the Showa and Heisei era of Japanese history.
There is one dictionary for both eras.
A corpus which was created by crawling the web.
The licence of the following corpora doesn't allow me to upload a derived dictionary.
My solution is to publish the raw data in a separate repo.
Use my script to generate a frequency dictionary on your local machine.
One of the largest and most popular corpora out there. It focuses on written language.
Another popular corpus with a focus on spoken language.
Enter the provided nix shell.
nix-shell
Create a virtual environment and use pip to install the dependencies.
python3 -m venv venv && source venv/bin/activate
pip install -r requirements.txt
Run the script on the command line with the desired arguments.
python3 main.py [arguments...]
For example, generate the frequency dictionary for BCCWJ (short-unit words) like so:
python3 main.py bccjw BCCWJ_frequencylist_suw_ver1_1.tsv
There is help in case you get stuck.
python3 main.py --help
python3 main.py bccjw --help
Open the Yomichan settings in your browser and click "Import Dictionary".
Select the zip file and wait for it to be processed.
The dictionary should now be working.