Goruut is an advanced open-source phonemization system that provides bidirectional conversion between written text and IPA phonetic representations. Designed for both research and production use, it offers:
- High-accuracy phonemization for 130+ languages
- Compact, efficient models that run on CPU
- Self-hostable architecture with no external dependencies
- Support for both word-level and sentence-level processing
Converts orthographic text to International Phonetic Alphabet (IPA) representations. Handles:
- Regular vocabulary words
- Out-of-vocabulary words through learned patterns
- Language-specific phonetic rules
- Homograph disambiguation (in supported languages only)
- Tonal languages
Reverse process that converts IPA back to written text. Features:
- Bidirectional consistency with phonemization
- Handling of multiple possible orthographic representations
- Context-aware disambiguation
- Pronunciation explanation for OOV words
- Word decomposition explaning applied phonetic rules
Goruut provides specialized models for:
- Indo-European (English, Spanish, Hindi, Russian, ...)
- Sino-Tibetan (Mandarin, ...)
- Afro-Asiatic (Arabic, Hebrew, ...)
- Uralic (Finnish, Hungarian, ...)
- And 20+ other language families
- English: Full homograph support (e.g., "read" → /ɹiːd/ or /ɹɛd/)
- Chinese: Handles character input
- Arabic: Supports diacritic restoration
- Dialects: Separate models for major dialect groups
Goruut uses a hybrid approach:
-
Statistical Grammar Induction
- Automatically learns language-specific G2P/P2G rules
- Builds longest prefix finite-state-transducer grammar from training data
- Identifies productive morphological patterns
- Generates an alignment of words and the phonetic representation for many words in the lexicon
-
Hashtron Transformers
- Compact weightless networks using integer arithmetic
- Specialized architectures for different tasks:
- Phonemization (16-phoneme window)
- Dephonemization (16-phoneme window)
- Homograph disambiguation (16-word window)
Component | Layers | Attention Heads | Context Window | Model Size |
---|---|---|---|---|
Phonemization | 7 | 4 | 16 phonemes | ~1.2MB |
Dephonemization | 7 | 4 | 16 phonemes | ~1.2MB |
Homograph Resolver | 7 | 4 | 16 words | ~1.0MB |
For each language, Goruut needs:
-
Core Lexicon (lexicon.tsv)
- Raw word-to-IPA mappings
- Minimum: 10,000 entries recommended
- Format:
word<tab>pronunciation
-
Aligned Subset (clean.tsv)
- High-quality aligned pairs
- Automatically generated by the system from
lexicon.tsv
- Used for model training
- Typically 20%-80% of core lexicon
-
Homograph Data (multi.tsv) [Optional]
- Sentence-level examples
- Format:
sentence<tab>pronunciation
-
Grammar Induction
./study_language.sh french
(incmd/analysis2
folder) -
Data Preparation and Alignment
./clean_language.sh french
(incmd/analysis2
folder) -
Coverage Evaluation
go build && ./coverage.sh french
(incmd/backtest
folder) -
Transformer Training
./train_language.sh french --maxpremodulo [NUMBER]
(incmd/analysis2
folder, NUMBER is set to 5 times the cleaning complexity) -
Homograph Transformer Training (optional)
./train_homograph.sh french
(incmd/backtest
folder)
- Phonemize Sentence
POST /tts/phonemize/sentence
Content-Type: application/json
-
Clone repository
git clone https://github.com/neurlang/goruut.git
cd goruut
-
Build and run
docker compose up -d --build
-
Verify
curl -X POST http://localhost:18080/tts/phonemize/sentence -H "Content-Type: application/json" -d '{"Language":"English","Sentence":"Test sentence"}'
- Requirements:
- Go 1.20+
- Python 3.8+ (for inference only)
- 2GB RAM minimum
- Installation:
cd ./cmd/goruut
go build
./goruut --configfile ../../configs/config.json
Operation | Initial Latency | Average latency | Memory Usage |
---|---|---|---|
Phonemization | 236 ms/sentence | 2 ms/sentence | 100MB |
Dephonemization | 228 ms/sentence | 1 ms/sentence | 100MB |
- Create language directory in neurlang/dataset same as the ones in
dicts/
used for training - Add training data:
- lexicon.tsv (required)
- clean.tsv (recommended)
- multi.tsv (optional for homographs)
- Submit pull request to neurlang/dataset
- Fork repository
- Train improved models using:
./cmd/analysis2/train_language.sh [LANGUAGE] --maxpremodulo [NUMBER]
- Include validation results
- Submit PR with new weight files to neurlang/goruut
- No built-in number pronunciation
- Limited handling of abbreviations
- No automatic language detection
- Numerics expression handling
- Abbreviations
- Improved OOV learning (reinforcement learning)
- Enhanced dialect blending
Q: How accurate is the English homograph handling?
A: Current models achieve 85% accuracy on the Google Homograph Benchmark.
Q: Can I use this commercially?
A: Yes, Goruut is in general licensed under MIT. However keep in mind that some datasets (Tibetan) that we train permit only non commercial use. Consult a lawyer if needing the model trained on those languages.
Q: What's the smallest language model?
A: Most models are 0.8-1.5MB. The smallest is probably Esperanto at ~60KB.
Q: How often are models updated?
A: We release quarterly updates with improved models.
Q: What algorithms/models power the phonemization?
A: Goruut uses a hybrid system combining Statistical Grammar Induction and Hashtron Transformers (compact weightless neural networks). The architecture includes specialized 7-layer transformers for phonemization, dephonemization, and homograph disambiguation.
Q: How are language models structured?
A: Each language has forward (weights4.json.zlib
) and reverse (weights4_reverse.json.zlib
) model files containing transformer weights. These are trained separately for G2P and P2G tasks.
Q: How are new languages added?
A: By submitting to neurlang/dataset
with:
lexicon.tsv
(raw word-IPA pairs)clean.tsv
(aligned high-quality subset)- Optional
multi.tsv
for homographs
Q: What's the training workflow?
A: Four-stage pipeline:
- Grammar induction (
study_language.sh
) - Data alignment (
clean_language.sh
) - Coverage evaluation (
coverage.sh
) - Model training (
train_language.sh
)
Q: How to use as a library?
A: Import github.com/neurlang/goruut/lib
in Go code. The library exposes phonemization/dephonemization functions with language-specific contexts. In python, use pygoruut
.
Q: HTTP service setup?
A: Run the binary with a config file:
./goruut --configfile configs/config.json
Then query endpoints like POST /tts/phonemize/sentence
Q: What accuracy metrics are available?
A: The system generates:
coverage_*.txt
files showing lexicon coverage percentages- Validation reports comparing model performance on test sets
- Homograph disambiguation accuracy (e.g., 85% for English)
Q: How are models evaluated during training?
A: Through:
- Automatic generation of
weights*.best
files (best-performing snapshots) - Alignment quality scores in cleaning reports
- Out-of-vocabulary word handling tests
Q: How can I improve existing language models?
A: Three-step process:
- Fork the repository
- Retrain models using:
./train_language.sh [LANGUAGE] --maxpremodulo [NUMBER]
- Submit pull request with new weight files
Q: What's needed to add a new language?
A: Minimum requirements:
lexicon.tsv
(10,000+ word-IPA pairs)clean.tsv
(aligned subset)- Optional
multi.tsv
for homographs - Must follow existing directory structure
Q: How are contributions validated?
A: All submissions are tested via:
- Coverage verification scripts
- Alignment quality checks
- Benchmark comparisons against existing models
Q: Can I adjust phoneme representations for specific use cases?
A: IPA Flavors mechanism exists (via configuraton) to translate to a custom IPA-like phone set. The default config file contains an example how to output espeak phones using the system.
Q: Can it process mixed-language text?
A: Yes, but not automatically - input must be specific to the user-selected languages in language priority order. Automatic language detection is a planned feature.
Q: How does it handle rare/unseen words?
A: Uses learned phonetic patterns, but accuracy declines for:
- Highly irregular pronunciations
- Words with no similar training examples
- Obscure proper nouns
Q: How does this compare to eSpeak/Festival?
A: Key differentiators:
Feature | Goruut | eSpeak/Festival |
---|---|---|
Architecture | Hashtron Transformers | Rule-based |
Bidirectional | Yes (G2P + P2G) | G2P only |
Model Size | 0.8-1.5MB per language | Larger |
Homographs | Context-aware (85% acc.) | Limited handling |
Tonal Languages | Full support | Basic support |
Self-hostable | Yes, no external deps | Depends on system |
- GitHub Discussions: https://github.com/neurlang/goruut/discussions
- Issue Tracker: https://github.com/neurlang/goruut/issues
- Contributor Chat: Discord invite in README.md
MIT - See LICENSE file for complete terms. See neurlang/dataset regarding the licenses for the training data used.