Name	Name	Last commit message	Last commit date
parent directory ..
CONTRIBUTING.md	CONTRIBUTING.md	Update CONTRIBUTING.md	Apr 13, 2025
DEVELOPING.md	DEVELOPING.md	Update DEVELOPING.md	Apr 13, 2025
HOMOGRAPH.md	HOMOGRAPH.md	Update HOMOGRAPH.md	Jul 13, 2025
HYPERPARAMETERSEARCH.md	HYPERPARAMETERSEARCH.md	HYPERPARAMETERSEARCH.md	Sep 1, 2025
README.md	README.md	Update README.md	Jul 10, 2025
ROADMAP.md	ROADMAP.md	Create ROADMAP.md	Jul 18, 2025

Goruut Phonemization System - Documentation

Overview

Goruut is an advanced open-source phonemization system that provides bidirectional conversion between written text and IPA phonetic representations. Designed for both research and production use, it offers:

High-accuracy phonemization for 130+ languages
Compact, efficient models that run on CPU
Self-hostable architecture with no external dependencies
Support for both word-level and sentence-level processing

Core Capabilities

1. Phonemization (G2P)

Converts orthographic text to International Phonetic Alphabet (IPA) representations. Handles:

Regular vocabulary words
Out-of-vocabulary words through learned patterns
Language-specific phonetic rules
Homograph disambiguation (in supported languages only)
Tonal languages

2. Dephonemization (P2G)

Reverse process that converts IPA back to written text. Features:

Bidirectional consistency with phonemization
Handling of multiple possible orthographic representations
Context-aware disambiguation

3. Advanced Word Analysis

Pronunciation explanation for OOV words
Word decomposition explaning applied phonetic rules

Language Support

Goruut provides specialized models for:

Major Language Families

Indo-European (English, Spanish, Hindi, Russian, ...)
Sino-Tibetan (Mandarin, ...)
Afro-Asiatic (Arabic, Hebrew, ...)
Uralic (Finnish, Hungarian, ...)
And 20+ other language families

Special Features

English: Full homograph support (e.g., "read" → /ɹiːd/ or /ɹɛd/)
Chinese: Handles character input
Arabic: Supports diacritic restoration
Dialects: Separate models for major dialect groups

Technical Architecture

Learning System

Goruut uses a hybrid approach:

Statistical Grammar Induction
- Automatically learns language-specific G2P/P2G rules
- Builds longest prefix finite-state-transducer grammar from training data
- Identifies productive morphological patterns
- Generates an alignment of words and the phonetic representation for many words in the lexicon
Hashtron Transformers
- Compact weightless networks using integer arithmetic
- Specialized architectures for different tasks:
  - Phonemization (16-phoneme window)
  - Dephonemization (16-phoneme window)
  - Homograph disambiguation (16-word window)

Model Specifications

Component	Layers	Attention Heads	Context Window	Model Size
Phonemization	7	4	16 phonemes	~1.2MB
Dephonemization	7	4	16 phonemes	~1.2MB
Homograph Resolver	7	4	16 words	~1.0MB

Training System

Data Requirements

For each language, Goruut needs:

Core Lexicon (lexicon.tsv)
- Raw word-to-IPA mappings
- Minimum: 10,000 entries recommended
- Format: word<tab>pronunciation
Aligned Subset (clean.tsv)
- High-quality aligned pairs
- Automatically generated by the system from lexicon.tsv
- Used for model training
- Typically 20%-80% of core lexicon
Homograph Data (multi.tsv) [Optional]
- Sentence-level examples
- Format: sentence<tab>pronunciation

Training Process

Grammar Induction
./study_language.sh french (in cmd/analysis2 folder)
Data Preparation and Alignment
./clean_language.sh french (in cmd/analysis2 folder)
Coverage Evaluation go build && ./coverage.sh french (in cmd/backtest folder)
Transformer Training
./train_language.sh french --maxpremodulo [NUMBER] (in cmd/analysis2 folder, NUMBER is set to 5 times the cleaning complexity)
Homograph Transformer Training (optional)
./train_homograph.sh french (in cmd/backtest folder)

API Reference

HTTP Endpoints

Phonemize Sentence
POST /tts/phonemize/sentence
Content-Type: application/json

Deployment Options

Docker Quickstart

Clone repository
git clone https://github.com/neurlang/goruut.git
cd goruut
Build and run
docker compose up -d --build
Verify
curl -X POST http://localhost:18080/tts/phonemize/sentence -H "Content-Type: application/json" -d '{"Language":"English","Sentence":"Test sentence"}'

Bare Metal Installation

Requirements:

Go 1.20+
Python 3.8+ (for inference only)
2GB RAM minimum

Installation:

cd ./cmd/goruut
go build
./goruut --configfile ../../configs/config.json

Performance Characteristics

Operation	Initial Latency	Average latency	Memory Usage
Phonemization	236 ms/sentence	2 ms/sentence	100MB
Dephonemization	228 ms/sentence	1 ms/sentence	100MB

Contribution Guide

Adding New Languages

Create language directory in neurlang/dataset same as the ones in dicts/ used for training
Add training data:

lexicon.tsv (required)
clean.tsv (recommended)
multi.tsv (optional for homographs)

Submit pull request to neurlang/dataset

Improving Existing Models

Fork repository
Train improved models using:
./cmd/analysis2/train_language.sh [LANGUAGE] --maxpremodulo [NUMBER]
Include validation results
Submit PR with new weight files to neurlang/goruut

Limitations and Roadmap

Current Limitations

No built-in number pronunciation
Limited handling of abbreviations
No automatic language detection

Planned Features

Numerics expression handling
Abbreviations
Improved OOV learning (reinforcement learning)
Enhanced dialect blending

Frequently Asked Questions

Q: How accurate is the English homograph handling?
A: Current models achieve 85% accuracy on the Google Homograph Benchmark.

Q: Can I use this commercially?
A: Yes, Goruut is in general licensed under MIT. However keep in mind that some datasets (Tibetan) that we train permit only non commercial use. Consult a lawyer if needing the model trained on those languages.

Q: What's the smallest language model?
A: Most models are 0.8-1.5MB. The smallest is probably Esperanto at ~60KB.

Q: How often are models updated?
A: We release quarterly updates with improved models.

Technical Implementation

Q: What algorithms/models power the phonemization?
A: Goruut uses a hybrid system combining Statistical Grammar Induction and Hashtron Transformers (compact weightless neural networks). The architecture includes specialized 7-layer transformers for phonemization, dephonemization, and homograph disambiguation.

Q: How are language models structured?
A: Each language has forward (weights4.json.zlib) and reverse (weights4_reverse.json.zlib) model files containing transformer weights. These are trained separately for G2P and P2G tasks.

Data & Training

Q: How are new languages added?
A: By submitting to neurlang/dataset with:

lexicon.tsv (raw word-IPA pairs)
clean.tsv (aligned high-quality subset)
Optional multi.tsv for homographs

Q: What's the training workflow?
A: Four-stage pipeline:

Grammar induction (study_language.sh)
Data alignment (clean_language.sh)
Coverage evaluation (coverage.sh)
Model training (train_language.sh)

Usage

Q: How to use as a library?
A: Import github.com/neurlang/goruut/lib in Go code. The library exposes phonemization/dephonemization functions with language-specific contexts. In python, use pygoruut.

Q: HTTP service setup?
A: Run the binary with a config file:

./goruut --configfile configs/config.json

Then query endpoints like POST /tts/phonemize/sentence

Performance

Q: What accuracy metrics are available?
A: The system generates:

coverage_*.txt files showing lexicon coverage percentages
Validation reports comparing model performance on test sets
Homograph disambiguation accuracy (e.g., 85% for English)

Q: How are models evaluated during training?
A: Through:

Automatic generation of weights*.best files (best-performing snapshots)
Alignment quality scores in cleaning reports
Out-of-vocabulary word handling tests

Contributing

Q: How can I improve existing language models?
A: Three-step process:

Fork the repository
Retrain models using:
./train_language.sh [LANGUAGE] --maxpremodulo [NUMBER]
Submit pull request with new weight files

Q: What's needed to add a new language?
A: Minimum requirements:

lexicon.tsv (10,000+ word-IPA pairs)
clean.tsv (aligned subset)
Optional multi.tsv for homographs
Must follow existing directory structure

Q: How are contributions validated?
A: All submissions are tested via:

Coverage verification scripts
Alignment quality checks
Benchmark comparisons against existing models

Customization

Q: Can I adjust phoneme representations for specific use cases?
A: IPA Flavors mechanism exists (via configuraton) to translate to a custom IPA-like phone set. The default config file contains an example how to output espeak phones using the system.

Limitations

Q: Can it process mixed-language text?
A: Yes, but not automatically - input must be specific to the user-selected languages in language priority order. Automatic language detection is a planned feature.

Q: How does it handle rare/unseen words?
A: Uses learned phonetic patterns, but accuracy declines for:

Highly irregular pronunciations
Words with no similar training examples
Obscure proper nouns

Comparison

Q: How does this compare to eSpeak/Festival?
A: Key differentiators:

Feature	Goruut	eSpeak/Festival
Architecture	Hashtron Transformers	Rule-based
Bidirectional	Yes (G2P + P2G)	G2P only
Model Size	0.8-1.5MB per language	Larger
Homographs	Context-aware (85% acc.)	Limited handling
Tonal Languages	Full support	Basic support
Self-hostable	Yes, no external deps	Depends on system

Support and Community

GitHub Discussions: https://github.com/neurlang/goruut/discussions
Issue Tracker: https://github.com/neurlang/goruut/issues
Contributor Chat: Discord invite in README.md

License

MIT - See LICENSE file for complete terms. See neurlang/dataset regarding the licenses for the training data used.

Collapse file tree

Files

doc

Directory actions

More options