This project provides word frequency lists generated from cleaned-up Wikipedia dumps for several languages. The following table shows the number of words (types) for each language and mutation (2nd–5th column) with links to the list files, as well as the number of tokens and articles (6th and 7th column).
Language / Mutation | no norm. | no norm., lowercased | NFKC norm. | NFKC norm., lowercased | #tokens | #articles |
---|---|---|---|---|---|---|
Czechregex | 866,635 | 772,788 | 866,619 | 772,771 | 137,564,164 | 832,967 |
Englishregex | 2,419,333 | 2,162,061 | 2,419,123 | 2,161,820 | 2,489,387,103 | 16,699,990 |
EnglishPenn | 2,988,260 | 2,709,385 | 2,988,187 | 2,709,302 | 2,445,526,919 | 16,699,990 |
Frenchregex | 1,187,843 | 1,061,089 | 1,187,646 | 1,060,849 | 842,907,281 | 4,108,861 |
Germanregex | 2,690,869 | 2,556,353 | 2,690,793 | 2,556,249 | 893,385,641 | 4,455,795 |
Italianregex | 960,238 | 852,087 | 960,149 | 851,996 | 522,839,613 | 2,783,290 |
JapaneseUnidic Lite | 549,745 | 522,590 | 549,358 | 522,210 | 610,467,200 | 2,177,257 |
JapaneseUnidic 3.1.0 | 561,212 | 535,726 | 560,821 | 535,341 | 609,365,356 | 2,177,257 |
Portugueseregex | 668,333 | 580,948 | 668,262 | 580,862 | 300,324,703 | 1,852,956 |
Russianregex | 2,069,646 | 1,854,875 | 2,069,575 | 1,854,793 | 535,032,557 | 4,483,522 |
Spanishregex | 1,124,168 | 987,078 | 1,124,055 | 986,947 | 685,158,870 | 3,637,655 |
Chinesejieba, experimental | 1,422,002 | 1,403,896 | 1,421,875 | 1,403,791 | 271,230,431 | 2,456,160 |
Indonesianregex | 433,387 | 373,475 | 433,376 | 373,461 | 117,956,650 | 1,314,543 |
The word lists for all the above languages are generated from dumps dated 20 October 2022, with the exception of Indonesian, which is generated from a dump dated 1 August 2024.
Furthermore, the project provides a script for generating the lists that can be applied to other Wikipedia languages.
For each word, the files (linked in the table above) list:
- number of occurrences,
- number of documents (Wikipedia articles).
Words occurring in less than 3 articles are not included. The lists are sorted by the number of occurrences. The data is tab-separated with a header, and the file is compressed with LZMA2 (xz
).
Important: The last row labeled [TOTAL]
lists the total numbers of tokens and articles, and thus may require special handling. Also, note that the totals are not sums of the previous rows' values.
How is this different from wikipedia-word-frequency?
We strive for data that is cleaner (not containing spurious “words” such as br
or colspan
), and linguistically meaningful (correctly segmented, with consistent criteria for inclusion in the list). Here are the specific differences:
-
Cleanup: We remove HTML/Wikitext tags such as (
<br>
,<ref>
, etc.), table formatting (e.g.colspan
,rowspan
), some non-textual content (such as musical scores), placeholders for formulas and code (formula_…
,codice_…
) or ruby (furigana). -
Tokenization: We tokenize Japanese and Chinese, see About mutations. This is necessary because these languages do not separate words with spaces. (The wikipedia-word-frequency script simply extracts and counts any contiguous chunks of characters, which can range from a word to a whole sentence.)
We tokenize other languages using a regular expression for orthographic words, consistently treating hyphen
-
and apostrophe'
as punctuation that cannot occur inside a word. (The wikipedia-word-frequency script allows these characters except start or end of word, thus allowingwomen's
but excludingmens'
. It also blindly converts en-dashes to hyphens, e.g. tokenizingNew York–based
asNew
andYork-based
, and right single quotation marks to apostrophes, resulting into further discrepancies.)For English, in addition to the default regex tokenization, we also provide the Penn Treebank tokenization (e.g.
can't
segmented asca
andn't
). In this case, apostrophes are allowed, and we also do a smart conversion of right single quotation marks to apostrophes (to distinguish the intended apostrophe incan’t
from the actual quotation mark in‘tuna can’
). -
Normalization: For all languages, we provide mutations that are lowercased and/or normalized to NFKC.
Additionally, the script for generating the wordlists supports multiprocessing (processing several dump files of the same language in parallel), greatly reducing the wall-clock time necessary to process the dumps.
For each language we provide several mutations.
-
All languages have the following mutations distinguished by the filename suffixes:
….tsv.xz
: no normalization,…-lower.tsv.xz
: no normalization, lowercased,…-nfkc.tsv.xz
: NFKC normalization…-nfkc-lower.tsv.xz
: NFKC normalization, lowercased
In addition to that, there are two variants of English and Japanese tokenization:
-
English:
- regex tokenization (same as for Czech, French, etc.)
- filename contains
-penn
: improved Penn Treebank tokenization fromnltk.tokenize.word_tokenize
-
Japanese: We do the same processing and provide the same mutations for Japanese as in TUBELEX-JA:
- Unidic Lite tokenization
- filename contains
-310
: Unidic 3.1.0 tokenization
Chinese is tokenized using the jieba
tokenizer. See Further work and similar lists for caveats about experimental Chinese support.
-
English with Penn Treebank tokenization: Tokens that fulfil the following conditions:
- do not contain digits
- contains at least one word character (\w).
E.g.
a
,o'clock
,'coz
,pre-/post-
,U.S.A.
,LGBTQ+
, but not42
,R2D2
,...
or.
. -
Japanese and Chinese: Tokens that fulfil the following conditions:
- do not contain digits (characters, such
一二三
are not considered digits), - start and end with a word character (\w), or wave dash (
〜
) in case of Japanese (e.g.あ〜
).
- do not contain digits (characters, such
-
Other languages and English with the default regex tokenization:
- tokens that consist of word characters (\w) except digits.
The default regex tokenization considers all non-word characters (\W, i.e. not \w) word separators. Therefore, while in English with Penn Treebank tokenization some tokens (e.g. R2D2
) are excluded, with the regex tokenization, tokens that would have to be excluded do not occur in the first place (e.g. R2D2
is tokenized as R
and D
).
-
Install requirements:
pip install -r requirements.txt
-
Download and process dumps (default date and languages):
zsh run.sh
Alternatively, download and process dumps from specific date and languages:
zsh run.sh 20221020 cs sk
The run.sh
script also outputs the table in this readme.
For usage of the Python script for processing the dumps, see python word_frequency.py --help
.
The word lists contain only the surface forms of the words (segments). For many purposes, lemmas, POS, and other information would be more useful. We plan to add further processing later.
Support for Chinese is only experimental. Chinese is currently processed “as is” without any conversion, which means that it's a mix of traditional and simplified characters (and also of different varieties of Chinese used on the Chinese Wikipedia). We also do not filter vocabulary/script variants (e.g. -{zh-cn:域;zh-tw:體}-
or -{A|zh-hans:用户; zh-hant:使用者}-
), which has the side effect of increasing the occurrences of tokens such as zh
, hans
, etc. The word list may still be fine for some NLP applications.
We are using wikiextractor to extract plain text from Wikipedia dumps. Ideally, almost no cleanup would be necessary after using this tool, but there is actually a substantial amount of non-textual content such as maps, musical scores, tables, math formulas and random formatting that wikiextractor doesn't remove or removes in a haphazard fashion (see the issue on GitHub). We try to remove both the legit placeholders and markup and also the most common markup that ought to be filtered by wikiextractor but isn't. The results are still imperfect, but rather than extending the removal in this tool, it would be better to fix wikiextractor. Another option would be to use the Wikipedia Cirrus search dumps instead (see this issue and my comment). Note that both approaches have been used to get pretraining data for large language models.
In the current version we have added Indonesian from a later dump. We observed the string "https" among relatively high frequency words, which means that our cleanup is less effective for the current Wikipedia dumps.
You may also like TUBELEX-JA, a large word list based on Japanese subtitles for YouTube videos, which is processed in a similar way.