HypheNN-de

A neural network that hyphenates German words. Following B. Fritzke and C. Nasahl, "A neural network that learns to do hyphenation", it uses a window of 8 characters and determines whether the word can be hyphenated after position 4 of the current window. Currently the network achieves a success rate of 99.2 %.

The network input is encoded using one-hot encoding the character set and outputs a single value indicating the probability of a hyphenation at the current position being valid.

Dependencies

Python 3 with TensorFlow and Keras installed
Rust

Download & prepare data

Download Wiktionary dump from https://dumps.wikimedia.org/dewiktionary/latest/ (look for dewiktionary-{timestamp}-pages-articles.xml.bz2) and unpack it
Compile and run prepare_data:
- $ cd prepare_data
- $ cargo run --release ../data/dewiktionary-*-pages-articles.xml ../wordlist.txt

Notes: There's a fair amount of post-processing happening to cleanup the data. The whole process may work with other languages, but the data cleanup probably will need some adaption. Note: The preprocessing randomizes the order of entries.

Train

To train the network, run:

$ python train.py
Using TensorFlow backend.
Building model...
Done

Training model...
2017-05-13 21:58:40.012420: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:857] OS X does not support NUMA - returning NUMA node zero
2017-05-13 21:58:40.012582: I tensorflow/core/common_runtime/gpu/gpu_device.cc:906] Found device 0 with properties: 
name: Quadro K2000
major: 3 minor: 0 memoryClockRate (GHz) 0.954
pciBusID 0000:02:00.0
Total memory: 2.00GiB
Free memory: 1001.96MiB
2017-05-13 21:58:40.012595: I tensorflow/core/common_runtime/gpu/gpu_device.cc:927] DMA: 0 
2017-05-13 21:58:40.012600: I tensorflow/core/common_runtime/gpu/gpu_device.cc:937] 0:   Y 
2017-05-13 21:58:40.012610: I tensorflow/core/common_runtime/gpu/gpu_device.cc:996] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Quadro K2000, pci bus id: 0000:02:00.0)
Epoch 1/50
1253797/1253797 [==============================] - 7s - loss: 0.0450 - acc: 0.9468       
Epoch 2/50
1253797/1253797 [==============================] - 7s - loss: 0.0218 - acc: 0.9753       
...
Epoch 48/50
1253797/1253797 [==============================] - 7s - loss: 0.0076 - acc: 0.9920      
Epoch 49/50
1253797/1253797 [==============================] - 7s - loss: 0.0076 - acc: 0.9919       
Epoch 50/50
1253797/1253797 [==============================] - 7s - loss: 0.0076 - acc: 0.9920      
Done
Time: 0:06:05.55

The network weights are stored in data/model.h5

Validate

$ python validate.py
Using TensorFlow backend.
Building model...
Done

2017-05-14 00:50:01.683992: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:857] OS X does not support NUMA - returning NUMA node zero
2017-05-14 00:50:01.684174: I tensorflow/core/common_runtime/gpu/gpu_device.cc:906] Found device 0 with properties: 
name: Quadro K2000
major: 3 minor: 0 memoryClockRate (GHz) 0.954
pciBusID 0000:02:00.0
Total memory: 2.00GiB
Free memory: 720.18MiB
2017-05-14 00:50:01.684190: I tensorflow/core/common_runtime/gpu/gpu_device.cc:927] DMA: 0 
2017-05-14 00:50:01.684195: I tensorflow/core/common_runtime/gpu/gpu_device.cc:937] 0:   Y 
2017-05-14 00:50:01.684206: I tensorflow/core/common_runtime/gpu/gpu_device.cc:996] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Quadro K2000, pci bus id: 0000:02:00.0)
Validating model...
1253536/1253797 [============================>.] - ETA: 0s   
Done
Result: [0.0073155245152076989, 0.99240706430147785]
Time: 0:00:40.17

Note: The first value is the mean square error, the second value is the achieved accuracy.

Run

$ python predict.py Silbentrennung
Using TensorFlow backend.
Building model...
Done

2017-05-14 00:49:08.660959: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:857] OS X does not support NUMA - returning NUMA node zero
2017-05-14 00:49:08.661122: I tensorflow/core/common_runtime/gpu/gpu_device.cc:906] Found device 0 with properties: 
name: Quadro K2000
major: 3 minor: 0 memoryClockRate (GHz) 0.954
pciBusID 0000:02:00.0
Total memory: 2.00GiB
Free memory: 738.18MiB
2017-05-14 00:49:08.661136: I tensorflow/core/common_runtime/gpu/gpu_device.cc:927] DMA: 0 
2017-05-14 00:49:08.661141: I tensorflow/core/common_runtime/gpu/gpu_device.cc:937] 0:   Y 
2017-05-14 00:49:08.661152: I tensorflow/core/common_runtime/gpu/gpu_device.cc:996] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Quadro K2000, pci bus id: 0000:02:00.0)
Input: Silbentrennung
Hyphenation: Sil·ben·tren·nung

Tensorflow build notes

Setup: See https://gist.github.com/Mistobaan/dd32287eeb6859c6668d

Configure:

PYTHON_BIN_PATH=$(which python) CUDA_TOOLKIT_PATH="/usr/local/cuda" CUDNN_INSTALL_PATH="/usr/local/cuda" TF_UNOFFICIAL_SETTING=1 TF_NEED_CUDA=1 TF_CUDA_COMPUTE_CAPABILITIES="3.0" TF_CUDNN_VERSION="6" TF_CUDA_VERSION="8.0" TF_CUDA_VERSION_TOOLKIT=8.0 ./configure

Note: Use defaults everywhere.

Build:

bazel build -c opt --copt=-mavx --copt=-msse4.2 --config=cuda //tensorflow/tools/pip_package:build_pip_package

If encountering problems with Library not loaded: @rpath/libcudart.8.0.dylib, follow http://stackoverflow.com/a/40007947/997063.

If encountering problems regarding -lgomp, replace -lgomp with -L/usr/local/Cellar/llvm/4.0.0/lib/libiomp5.dylib in third_party/gpus/cuda/BUILD.tpl or comment out this line.

References

http://machinelearningmastery.com/text-generation-lstm-recurrent-neural-networks-python-keras/
http://blog.aloni.org/posts/backprop-with-tensorflow/
B. Fritzke and C. Nasahl, "A neural network that learns to do hyphenation," IJCNN-91-Seattle International Joint Conference on Neural Networks, Seattle, WA, USA, 1991, pp. 960 vol.2-. (doi: 10.1109/IJCNN.1991.155602)

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
prepare_data		prepare_data
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
dataset.py		dataset.py
model.py		model.py
poetry.lock		poetry.lock
predict.py		predict.py
pyproject.toml		pyproject.toml
train.py		train.py
util.py		util.py
validate.py		validate.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

HypheNN-de

Dependencies

Download & prepare data

Train

Validate

Run

Tensorflow build notes

References

About

Releases

Packages

Contributors 2

Languages

License

msiemens/HypheNN-de

Folders and files

Latest commit

History

Repository files navigation

HypheNN-de

Dependencies

Download & prepare data

Train

Validate

Run

Tensorflow build notes

References

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages