Skip to content

Files

Latest commit

23189e1 · Sep 1, 2025

History

History

doc

README.md

Goruut Phonemization System - Documentation

Overview

Goruut is an advanced open-source phonemization system that provides bidirectional conversion between written text and IPA phonetic representations. Designed for both research and production use, it offers:

  • High-accuracy phonemization for 130+ languages
  • Compact, efficient models that run on CPU
  • Self-hostable architecture with no external dependencies
  • Support for both word-level and sentence-level processing

Core Capabilities

1. Phonemization (G2P)

Converts orthographic text to International Phonetic Alphabet (IPA) representations. Handles:

  • Regular vocabulary words
  • Out-of-vocabulary words through learned patterns
  • Language-specific phonetic rules
  • Homograph disambiguation (in supported languages only)
  • Tonal languages

2. Dephonemization (P2G)

Reverse process that converts IPA back to written text. Features:

  • Bidirectional consistency with phonemization
  • Handling of multiple possible orthographic representations
  • Context-aware disambiguation

3. Advanced Word Analysis

  • Pronunciation explanation for OOV words
  • Word decomposition explaning applied phonetic rules

Language Support

Goruut provides specialized models for:

Major Language Families

  • Indo-European (English, Spanish, Hindi, Russian, ...)
  • Sino-Tibetan (Mandarin, ...)
  • Afro-Asiatic (Arabic, Hebrew, ...)
  • Uralic (Finnish, Hungarian, ...)
  • And 20+ other language families

Special Features

  • English: Full homograph support (e.g., "read" → /ɹiːd/ or /ɹɛd/)
  • Chinese: Handles character input
  • Arabic: Supports diacritic restoration
  • Dialects: Separate models for major dialect groups

Technical Architecture

Learning System

Goruut uses a hybrid approach:

  1. Statistical Grammar Induction

    • Automatically learns language-specific G2P/P2G rules
    • Builds longest prefix finite-state-transducer grammar from training data
    • Identifies productive morphological patterns
    • Generates an alignment of words and the phonetic representation for many words in the lexicon
  2. Hashtron Transformers

    • Compact weightless networks using integer arithmetic
    • Specialized architectures for different tasks:
      • Phonemization (16-phoneme window)
      • Dephonemization (16-phoneme window)
      • Homograph disambiguation (16-word window)

Model Specifications

Component Layers Attention Heads Context Window Model Size
Phonemization 7 4 16 phonemes ~1.2MB
Dephonemization 7 4 16 phonemes ~1.2MB
Homograph Resolver 7 4 16 words ~1.0MB

Training System

Data Requirements

For each language, Goruut needs:

  1. Core Lexicon (lexicon.tsv)

    • Raw word-to-IPA mappings
    • Minimum: 10,000 entries recommended
    • Format: word<tab>pronunciation
  2. Aligned Subset (clean.tsv)

    • High-quality aligned pairs
    • Automatically generated by the system from lexicon.tsv
    • Used for model training
    • Typically 20%-80% of core lexicon
  3. Homograph Data (multi.tsv) [Optional]

    • Sentence-level examples
    • Format: sentence<tab>pronunciation

Training Process

  1. Grammar Induction
    ./study_language.sh french (in cmd/analysis2 folder)

  2. Data Preparation and Alignment
    ./clean_language.sh french (in cmd/analysis2 folder)

  3. Coverage Evaluation go build && ./coverage.sh french (in cmd/backtest folder)

  4. Transformer Training
    ./train_language.sh french --maxpremodulo [NUMBER] (in cmd/analysis2 folder, NUMBER is set to 5 times the cleaning complexity)

  5. Homograph Transformer Training (optional)
    ./train_homograph.sh french (in cmd/backtest folder)

API Reference

HTTP Endpoints

  1. Phonemize Sentence
    POST /tts/phonemize/sentence
    Content-Type: application/json

Deployment Options

Docker Quickstart

  1. Clone repository
    git clone https://github.com/neurlang/goruut.git
    cd goruut

  2. Build and run
    docker compose up -d --build

  3. Verify
    curl -X POST http://localhost:18080/tts/phonemize/sentence -H "Content-Type: application/json" -d '{"Language":"English","Sentence":"Test sentence"}'

Bare Metal Installation

  1. Requirements:
  • Go 1.20+
  • Python 3.8+ (for inference only)
  • 2GB RAM minimum
  1. Installation:
  • cd ./cmd/goruut
  • go build
  • ./goruut --configfile ../../configs/config.json

Performance Characteristics

Operation Initial Latency Average latency Memory Usage
Phonemization 236 ms/sentence 2 ms/sentence 100MB
Dephonemization 228 ms/sentence 1 ms/sentence 100MB

Contribution Guide

Adding New Languages

  1. Create language directory in neurlang/dataset same as the ones in dicts/ used for training
  2. Add training data:
  • lexicon.tsv (required)
  • clean.tsv (recommended)
  • multi.tsv (optional for homographs)
  1. Submit pull request to neurlang/dataset

Improving Existing Models

  1. Fork repository
  2. Train improved models using:
    ./cmd/analysis2/train_language.sh [LANGUAGE] --maxpremodulo [NUMBER]
  3. Include validation results
  4. Submit PR with new weight files to neurlang/goruut

Limitations and Roadmap

Current Limitations

  • No built-in number pronunciation
  • Limited handling of abbreviations
  • No automatic language detection

Planned Features

  • Numerics expression handling
  • Abbreviations
  • Improved OOV learning (reinforcement learning)
  • Enhanced dialect blending

Frequently Asked Questions

Q: How accurate is the English homograph handling?
A: Current models achieve 85% accuracy on the Google Homograph Benchmark.

Q: Can I use this commercially?
A: Yes, Goruut is in general licensed under MIT. However keep in mind that some datasets (Tibetan) that we train permit only non commercial use. Consult a lawyer if needing the model trained on those languages.

Q: What's the smallest language model?
A: Most models are 0.8-1.5MB. The smallest is probably Esperanto at ~60KB.

Q: How often are models updated?
A: We release quarterly updates with improved models.

Technical Implementation

Q: What algorithms/models power the phonemization?
A: Goruut uses a hybrid system combining Statistical Grammar Induction and Hashtron Transformers (compact weightless neural networks). The architecture includes specialized 7-layer transformers for phonemization, dephonemization, and homograph disambiguation.

Q: How are language models structured?
A: Each language has forward (weights4.json.zlib) and reverse (weights4_reverse.json.zlib) model files containing transformer weights. These are trained separately for G2P and P2G tasks.

Data & Training

Q: How are new languages added?
A: By submitting to neurlang/dataset with:

  • lexicon.tsv (raw word-IPA pairs)
  • clean.tsv (aligned high-quality subset)
  • Optional multi.tsv for homographs

Q: What's the training workflow?
A: Four-stage pipeline:

  1. Grammar induction (study_language.sh)
  2. Data alignment (clean_language.sh)
  3. Coverage evaluation (coverage.sh)
  4. Model training (train_language.sh)

Usage

Q: How to use as a library?
A: Import github.com/neurlang/goruut/lib in Go code. The library exposes phonemization/dephonemization functions with language-specific contexts. In python, use pygoruut.

Q: HTTP service setup?
A: Run the binary with a config file:

./goruut --configfile configs/config.json

Then query endpoints like POST /tts/phonemize/sentence

Performance

Q: What accuracy metrics are available?
A: The system generates:

  • coverage_*.txt files showing lexicon coverage percentages
  • Validation reports comparing model performance on test sets
  • Homograph disambiguation accuracy (e.g., 85% for English)

Q: How are models evaluated during training?
A: Through:

  1. Automatic generation of weights*.best files (best-performing snapshots)
  2. Alignment quality scores in cleaning reports
  3. Out-of-vocabulary word handling tests

Contributing

Q: How can I improve existing language models?
A: Three-step process:

  1. Fork the repository
  2. Retrain models using:
    ./train_language.sh [LANGUAGE] --maxpremodulo [NUMBER]
  3. Submit pull request with new weight files

Q: What's needed to add a new language?
A: Minimum requirements:

  • lexicon.tsv (10,000+ word-IPA pairs)
  • clean.tsv (aligned subset)
  • Optional multi.tsv for homographs
  • Must follow existing directory structure

Q: How are contributions validated?
A: All submissions are tested via:

  • Coverage verification scripts
  • Alignment quality checks
  • Benchmark comparisons against existing models

Customization

Q: Can I adjust phoneme representations for specific use cases?
A: IPA Flavors mechanism exists (via configuraton) to translate to a custom IPA-like phone set. The default config file contains an example how to output espeak phones using the system.

Limitations

Q: Can it process mixed-language text?
A: Yes, but not automatically - input must be specific to the user-selected languages in language priority order. Automatic language detection is a planned feature.

Q: How does it handle rare/unseen words?
A: Uses learned phonetic patterns, but accuracy declines for:

  • Highly irregular pronunciations
  • Words with no similar training examples
  • Obscure proper nouns

Comparison

Q: How does this compare to eSpeak/Festival?
A: Key differentiators:

Feature Goruut eSpeak/Festival
Architecture Hashtron Transformers Rule-based
Bidirectional Yes (G2P + P2G) G2P only
Model Size 0.8-1.5MB per language Larger
Homographs Context-aware (85% acc.) Limited handling
Tonal Languages Full support Basic support
Self-hostable Yes, no external deps Depends on system

Support and Community

License

MIT - See LICENSE file for complete terms. See neurlang/dataset regarding the licenses for the training data used.