Skip to content

Commit

Permalink
Merge pull request #17 from neocl/dev
Browse files Browse the repository at this point in the history
[Version 0.1a7]
- Added Japanese Proper Names Dictionary (JMnedict) support
- Included built-in KRADFILE/RADKFile support
- Improved command line tools (json, compact mode, etc.)
  • Loading branch information
letuananh authored Jun 1, 2020
2 parents 4e13bbb + bce89f0 commit 21f8452
Show file tree
Hide file tree
Showing 20 changed files with 1,390 additions and 130 deletions.
10 changes: 8 additions & 2 deletions CHANGES.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,11 @@
2020-05-31
- [Version 0.1a7]
- Added Japanese Proper Names Dictionary (JMnedict) support
- Included built-in KRADFILE/RADKFile support
- Improved command line tools (json, compact mode, etc.)

2017-08-18
- Support for KanjiDic2 (XML/SQLite formats)
- Support KanjiDic2 (XML/SQLite formats)

2016-11-09
- Release first demo to Github
- Release first version to Github
238 changes: 175 additions & 63 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,101 +1,213 @@
Python library for manipulating Jim Breen's JMdict & KanjiDic2

# Main features
* Query JMDict and KanjiDic2 in XML format directly (but slow)
* Convert JMDict and KanjiDic2 into SQLite format for faster access
* Basic console lookup tool
* jamdol (jamdict-online) - REST API using Python/Flask (jamdol-flask)

# Installation
* Support querying different Japanese language resources
- Japanese-English dictionary JMDict
- Kanji dictionary KanjiDic2
- Kanji-radical and radical-kanji maps KRADFILE/RADKFILE
- Japanese Proper Names Dictionary (JMnedict)
* Data are stored using SQLite database
* Console lookup tool
* jamdol (jamdol-flask) - a Python/Flask server that provides Jamdict lookup via REST API (experimental state)

Homepage: [https://github.com/neocl/jamdict](https://github.com/neocl/jamdict)

Contributors are welcome! 🙇

# Installation

Jamdict is available on PyPI at [https://pypi.org/project/jamdict/](https://pypi.org/project/jamdict/) and can be installed using pip command

```bash
pip install jamdict
# pip script sometimes doesn't work properly, so you may want to try this instead
python3 -m pip install jamdict
```

# initial setup (this command will create ~/.jamdict for you
# it will also tell you where to copy the data files
python3 -m jamdict.tools info
## Install data file

1. Download the offical, pre-compiled jamdict database (`jamdict-0.1a7.tar.xz`) from Google Drive [https://drive.google.com/drive/u/1/folders/1z4zF9ImZlNeTZZplflvvnpZfJp3WVLPk](https://drive.google.com/drive/u/1/folders/1z4zF9ImZlNeTZZplflvvnpZfJp3WVLPk)
2. Extract and copy `jamdict.db` to jamdict data folder (defaulted to `~/.jamdict/data/jamdict.db`)
3. To know where to copy data files

```bash
# initial setup (this command will create ~/.jamdict for you
# it will also tell you where to copy the data files
python3 -m jamdict info
# Jamdict 0.1a7
# Python library for manipulating Jim Breen's JMdict, KanjiDic2, KRADFILE and JMnedict
#
# Basic configuration
# ------------------------------------------------------------
# JAMDICT_HOME : /home/tuananh/.jamdict
# Config file location: /home/tuananh/.jamdict/config.json
#
# Data files
# ------------------------------------------------------------
# Jamdict DB location: /home/tuananh/.jamdict/data/jamdict.db - [OK]
# JMDict XML file : /home/tuananh/.jamdict/data/JMdict_e.gz - [OK]
# KanjiDic2 XML file : /home/tuananh/.jamdict/data/kanjidic2.xml.gz - [OK]
# JMnedict XML file : /home/tuananh/.jamdict/data/JMnedict.xml.gz - [OK]
```

## Command line tools

To make sure that jamdict is configured properly, try to look up a word using command line

# to look up a word using command line
python3 -m jamdict.tools lookup たべる
```bash
python3 -m jamdict.tools lookup 言語学
========================================
Found entries
========================================
Entry: 1358280 | Kj: 食べる, 喰べる | Kn: たべる
Entry: 1264430 | Kj: 言語学 | Kn: げんごがく
--------------------
1. to eat ((Ichidan verb|transitive verb))
2. to live on (e.g. a salary)/to live off/to subsist on
1. linguistics ((noun (common) (futsuumeishi)))

========================================
Found characters
========================================
Char: | Strokes: 9
Char: | Strokes: 7
--------------------
Readings: shi2, si4, sig, sa, 식, 사, Thực, Tự, ショク, ジキ, く.う, く.らう, た.べる, は.む
Meanings: eat, food
Char: | Strokes: 12
Readings: yan2, eon, 언, Ngôn, Ngân, ゲン, ゴン, い.う, こと
Meanings: say, word
Char: | Strokes: 14
--------------------
Readings: shi2, si4, sig, 식, Thặc, Thực, Tự, く.う, く.らう
Meanings: eat, drink, receive (a blow), (kokuji)
Readings: yu3, yu4, eo, 어, Ngữ, Ngứ, ゴ, かた.る, かた.らう
Meanings: word, speech, language
Char: 学 | Strokes: 8
--------------------
Readings: xue2, hag, 학, Học, ガク, まな.ぶ
Meanings: study, learning, science

No name was found.
```

## Data
XML files (JMdict_e.xml, kanjidic2.xml) must be downloaded and copy into `~/.jamdict/data`
# Sample jamdict Python code

I have mirrored these files to Google Drive so you can download there too:
[https://drive.google.com/drive/folders/1z4zF9ImZlNeTZZplflvvnpZfJp3WVLPk](https://drive.google.com/drive/folders/1z4zF9ImZlNeTZZplflvvnpZfJp3WVLPk)
```python
from jamdict import Jamdict
jmd = Jamdict()

Official website
# use wildcard matching to find anything starts with 食べ and ends with る
result = jmd.lookup('食べ%る')

# print all word entries
for entry in result.entries:
print(entry)

# [id#1358280] たべる (食べる) : 1. to eat ((Ichidan verb|transitive verb)) 2. to live on (e.g. a salary)/to live off/to subsist on
# [id#1358300] たべすぎる (食べ過ぎる) : to overeat ((Ichidan verb|transitive verb))
# [id#1852290] たべつける (食べ付ける) : to be used to eating ((Ichidan verb|transitive verb))
# [id#2145280] たべはじめる (食べ始める) : to start eating ((Ichidan verb))
# [id#2449430] たべかける (食べ掛ける) : to start eating ((Ichidan verb))
# [id#2671010] たべなれる (食べ慣れる) : to be used to eating/to become used to eating/to be accustomed to eating/to acquire a taste for ((Ichidan verb))
# [id#2765050] たべられる (食べられる) : 1. to be able to eat ((Ichidan verb|intransitive verb)) 2. to be edible/to be good to eat ((pre-noun adjectival (rentaishi)))
# [id#2795790] たべくらべる (食べ比べる) : to taste and compare several dishes (or foods) of the same type ((Ichidan verb|transitive verb))
# [id#2807470] たべあわせる (食べ合わせる) : to eat together (various foods) ((Ichidan verb))

* JMdict: [http://edrdg.org/jmdict/edict_doc.html](http://edrdg.org/jmdict/edict_doc.html)
* kanjidic2: [http://www.edrdg.org/kanjidic/kanjd2index.html](http://www.edrdg.org/kanjidic/kanjd2index.html)
* KRADFILE: [http://www.edrdg.org/krad/kradinf.html](http://www.edrdg.org/krad/kradinf.html)
# print all related characters
for c in result.chars:
print(repr(c))

# 食:9:eat,food
# 喰:12:eat,drink,receive (a blow),(kokuji)
# 過:12:overdo,exceed,go beyond,error
# 付:5:adhere,attach,refer to,append
# 始:8:commence,begin
# 掛:11:hang,suspend,depend,arrive at,tax,pour
# 慣:14:accustomed,get used to,become experienced
# 比:4:compare,race,ratio,Philippines
# 合:6:fit,suit,join,0.1
```

## Using KRAD/RADK mapping

Jamdict has built-in support for KRAD/RADK (i.e. kanji-radical and radical-kanji mapping).
The terminology of radicals/components used by Jamdict can be different from else where.

- A radical in Jamdict is a principal component, each character has only one radical.
- A character may be decomposed into several writing components.

By default jamdict provides two maps:

# Sample codes
- jmd.krad is a Python dict that maps characters to list of components.
- jmd.radk is a Python dict that maps each available components to a list of characters.

```python
>>> from jamdict import Jamdict
>>> jmd = Jamdict()
# use wildcard matching to find anything starts with 食べ and ends with る
>>> result = jmd.lookup('食べ%る')
# print all found word entries
>>> for entry in result.entries:
... print(entry)
...
[id#1358280] たべる (食べる) : 1. to eat ((Ichidan verb|transitive verb)) 2. to live on (e.g. a salary)/to live off/to subsist on
[id#1358300] たべすぎる (食べ過ぎる) : to overeat ((Ichidan verb|transitive verb))
[id#1852290] たべつける (食べ付ける) : to be used to eating ((Ichidan verb|transitive verb))
[id#2145280] たべはじめる (食べ始める) : to start eating ((Ichidan verb))
[id#2449430] たべかける (食べ掛ける) : to start eating ((Ichidan verb))
[id#2671010] たべなれる (食べ慣れる) : to be used to eating/to become used to eating/to be accustomed to eating/to acquire a taste for ((Ichidan verb))
[id#2765050] たべられる (食べられる) : 1. to be able to eat ((Ichidan verb|intransitive verb)) 2. to be edible/to be good to eat ((pre-noun adjectival (rentaishi)))
[id#2795790] たべくらべる (食べ比べる) : to taste and compare several dishes (or foods) of the same type ((Ichidan verb|transitive verb))
[id#2807470] たべあわせる (食べ合わせる) : to eat together (various foods) ((Ichidan verb))
# print all related characters
>>> for c in result.chars:
... print(repr(c))
...
食:9:eat,food
喰:12:eat,drink,receive (a blow),(kokuji)
過:12:overdo,exceed,go beyond,error
付:5:adhere,attach,refer to,append
始:8:commence,begin
掛:11:hang,suspend,depend,arrive at,tax,pour
慣:14:accustomed,get used to,become experienced
比:4:compare,race,ratio,Philippines
合:6:fit,suit,join,0.1
# Find all writing components (often called "radicals") of the character 雲
print(jmd.krad[''])
# ['一', '雨', '二', '厶']

# Find all characters with the component 鼎
chars = jmd.radk['']
print(chars)
# {'鼏', '鼒', '鼐', '鼎', '鼑'}

# look up the characters info
result = jmd.lookup(''.join(chars))
for c in result.chars:
print(c, c.meanings())
# 鼏 ['cover of tripod cauldron']
# 鼒 ['large tripod cauldron with small']
# 鼐 ['incense tripod']
# 鼎 ['three legged kettle']
# 鼑 []
```

## Finding name entities

```bash
# Find all names with 鈴木 inside
result = jmd.lookup('%鈴木%')
for name in result.names:
print(name)

# [id#5025685] キューティーすずき (キューティー鈴木) : Kyu-ti- Suzuki (1969.10-) (full name of a particular person)
# [id#5064867] パパイヤすずき (パパイヤ鈴木) : Papaiya Suzuki (full name of a particular person)
# [id#5089076] ラジカルすずき (ラジカル鈴木) : Rajikaru Suzuki (full name of a particular person)
# [id#5259356] きつねざきすずきひなた (狐崎鈴木日向) : Kitsunezakisuzukihinata (place name)
# [id#5379158] こすずき (小鈴木) : Kosuzuki (family or surname)
# [id#5398812] かみすずき (上鈴木) : Kamisuzuki (family or surname)
# [id#5465787] かわすずき (川鈴木) : Kawasuzuki (family or surname)
# [id#5499409] おおすずき (大鈴木) : Oosuzuki (family or surname)
# [id#5711308] すすき (鈴木) : Susuki (family or surname)
# ...
```
## Exact matching
Use exact matching for faster search
```python
# Find an entry (word, name entity) by idseq
result = jmd.lookup('id#5711308')
print(result.names[0])
# [id#5711308] すすき (鈴木) : Susuki (family or surname)
result = jmd.lookup('id#1467640')
print(result.entries[0])
# ねこ (猫) : 1. cat 2. shamisen 3. geisha 4. wheelbarrow 5. clay bed-warmer 6. bottom/submissive partner of a homosexual relationship

# use exact matching to increase searching speed (thanks to @reem-codes)
result = jmd.lookup('食べる')
result = jmd.lookup('')

for entry in result.entries:
print(entry)

>>> for entry in result.entries:
... print(entry)
...
[id#1358280] たべる (食べる) : 1. to eat ((Ichidan verb|transitive verb)) 2. to live on (e.g. a salary)/to live off/to subsist on
# [id#1467640] ねこ (猫) : 1. cat ((noun (common) (futsuumeishi))) 2. shamisen 3. geisha 4. wheelbarrow 5. clay bed-warmer 6. bottom/submissive partner of a homosexual relationship
# [id#2698030] ねこま (猫) : cat ((noun (common) (futsuumeishi)))
```
See `jamdict_demo.py` and `jamdict/tools.py` for more information.
# Official website
* JMdict: [http://edrdg.org/jmdict/edict_doc.html](http://edrdg.org/jmdict/edict_doc.html)
* kanjidic2: [https://www.edrdg.org/wiki/index.php/KANJIDIC_Project](https://www.edrdg.org/wiki/index.php/KANJIDIC_Project)
* JMnedict: [https://www.edrdg.org/enamdict/enamdict_doc.html](https://www.edrdg.org/enamdict/enamdict_doc.html)
* KRADFILE: [http://www.edrdg.org/krad/kradinf.html](http://www.edrdg.org/krad/kradinf.html)
# Contributors
- [Matteo Fumagalli](https://github.com/matteofumagalli1275)
- [Reem Alghamdi](https://github.com/reem-codes)
4 changes: 2 additions & 2 deletions jamdict/__version__.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,10 +6,10 @@
__copyright__ = "Copyright (c) 2016, Le Tuan Anh"
__credits__ = []
__license__ = "MIT License"
__description__ = "Python library for manipulating Jim Breen's JMdict & KanjiDic2"
__description__ = "Python library for manipulating Jim Breen's JMdict, KanjiDic2, KRADFILE and JMnedict"
__url__ = "https://github.com/neocl/jamdict"
__maintainer__ = "Le Tuan Anh"
__version_major__ = "0.1"
__version__ = "{}a6".format(__version_major__)
__version__ = "{}a7".format(__version_major__)
__version_long__ = "{} - Alpha".format(__version_major__)
__status__ = "Prototype"
14 changes: 13 additions & 1 deletion jamdict/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -75,7 +75,18 @@ def read_config():


def home_dir():
''' Find JAMDICT_HOME folder.
if there is an environment variable that points to an existing directory
(e.g. export JAMDICT_HOME=/home/user/jamdict)
that folder will be used instead of the configured in jamdict JSON config file
'''
_config = read_config()
# [2020-06-01] Allow JAMDICT_HOME to be overridden by environment variables
if 'JAMDICT_HOME' in os.environ:
_env_jamdict_home = os.path.abspath(os.path.expanduser(os.environ['JAMDICT_HOME']))
if os.path.isdir(_env_jamdict_home):
getLogger().debug("JAMDICT_HOME: {}".format(_env_jamdict_home))
return _env_jamdict_home
return _config.get('JAMDICT_HOME', __jamdict_home)


Expand All @@ -88,4 +99,5 @@ def data_dir():
def get_file(file_key):
_config = read_config()
_data_dir = data_dir()
return _config.get(file_key).format(JAMDICT_DATA=_data_dir)
_value = _config.get(file_key)
return _value.format(JAMDICT_DATA=_data_dir) if _value else ''
13 changes: 7 additions & 6 deletions jamdict/data/config_template.json
Original file line number Diff line number Diff line change
@@ -1,8 +1,9 @@
{
"JAMDICT_HOME": "~/.jamdict",
"JAMDICT_DATA": "{JAMDICT_HOME}/data",
"JAMDICT_DB": "{JAMDICT_DATA}/jamdict.db",
"JMDICT_XML": "{JAMDICT_DATA}/JMdict_e.gz",
"KD2_XML": "{JAMDICT_DATA}/kanjidic2.xml.gz",
"KRADFILE": "{JAMDICT_DATA}/kradfile-u.gz"
"JAMDICT_HOME": "~/.jamdict",
"JAMDICT_DATA": "{JAMDICT_HOME}/data",
"JAMDICT_DB": "{JAMDICT_DATA}/jamdict.db",
"JMDICT_XML": "{JAMDICT_DATA}/JMdict_e.gz",
"JMNEDICT_XML": "{JAMDICT_DATA}/JMnedict.xml.gz",
"KD2_XML": "{JAMDICT_DATA}/kanjidic2.xml.gz",
"KRADFILE": "{JAMDICT_DATA}/kradfile-u.gz"
}
Loading

0 comments on commit 21f8452

Please sign in to comment.