URIEL+: Enhancing Linguistic Inclusion and Usability in a Typological and Multilingual Knowledge Base

URIEL is a knowledge base offering geographical, phylogenetic, and typological vector representations for 7970 languages. It includes distance measures between these vectors for 4005 languages, which are accessible via the lang2vec tool. Despite being frequently cited, URIEL is limited in terms of linguistic inclusion and overall usability. To tackle these challenges, we introduce URIEL+, an enhanced version of URIEL and lang2vec addressing these limitations. In addition to expanding typological feature coverage for 2898 languages, URIEL+ improves user experience with robust, customizable distance calculations to better suit the needs of the users. These upgrades also offer competitive performance on downstream tasks and provide distances that better align with linguistic distance studies.

If you are interested for more information, check out our full paper.

Citation

If you use this code for your research, please cite the following work:

@inproceedings{khan-etal-2025-uriel,
    title = "{URIEL}+: Enhancing Linguistic Inclusion and Usability in a Typological and Multilingual Knowledge Base",
    author = {Khan, Aditya  and
      Shipton, Mason  and
      Anugraha, David  and
      Duan, Kaiyao  and
      Hoang, Phuong H.  and
      Khiu, Eric  and
      Do{\u{g}}ru{\"o}z, A. Seza  and
      Lee, En-Shiun Annie},
    editor = "Rambow, Owen  and
      Wanner, Leo  and
      Apidianaki, Marianna  and
      Al-Khalifa, Hend  and
      Eugenio, Barbara Di  and
      Schockaert, Steven",
    booktitle = "Proceedings of the 31st International Conference on Computational Linguistics",
    month = jan,
    year = "2025",
    address = "Abu Dhabi, UAE",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2025.coling-main.463/",
    pages = "6937--6952",
    abstract = "URIEL is a knowledge base offering geographical, phylogenetic, and typological vector representations for 7970 languages. It includes distance measures between these vectors for 4005 languages, which are accessible via the lang2vec tool. Despite being frequently cited, URIEL is limited in terms of linguistic inclusion and overall usability. To tackle these challenges, we introduce URIEL+, an enhanced version of URIEL and lang2vec that addresses these limitations. In addition to expanding typological feature coverage for 2898 languages, URIEL+ improves the user experience with robust, customizable distance calculations to better suit the needs of users. These upgrades also offer competitive performance on downstream tasks and provide distances that better align with linguistic distance studies."
}

If you have any questions, you can open a GitHub Issue or send us an email.

Check out ExploRIEL, our online UI for URIEL+: https://uriel-leelab.streamlit.app/

Environment

Python 3.10 or later. If you're working with the MIDASpy extra dependencies, version of Python must be less than 3.11. Details of dependencies are in setup.py. NOTE: There are known issues with the MIDASpy extra dependencies. Please use between Python 3.10 and Python 3.11 for the time being.

Setup Instruction

To get started with URIEL+:

pip install urielplus

from urielplus import urielplus

u = urielplus.URIELPlus()

Configuration Options Examples

URIEL+ offers various configurations that you can adjust:
- Caching: Enable or disable caching (True or False).
- Aggregation Method: Choose the method for aggregating data across sources ('U' for unweighted, 'A' for weighted).
- Fill Missing Data: Decide whether to fill missing data using parent language data (True or False).
- Distance Metric: Specify the distance metric to be used ("angular" or "cosine").
Changing A Configuration:
```
u.set_{configuration}({option})
```
Checking A Configuration:
```
u.get_{configuration}({option})
```
Replace {configuration} with cache, aggregation, fill_with_base_lang, or distance_metric.
Replace {option} with your desired value for the selected configuration.
Note: the default configurations are cache=False, aggregation='U', fill_with_base_lang=True, and distance_metric="angular".

Retrieving Loaded Features Examples

Retrieving A Loaded Feature:

u.get_{vector_type}_{feature_type}_array()

Replace {vector_type} with phylogeny, typological, or geography.
Replace {feature_type} with features, languages, data, or sources.
Example:
```
u.get_typological_languages_array()
```

Database Integration Examples

Integrating One Database:
```
u.integrate_{database}()
```

Integrating Some Databases:

u.integrate_custom_databases({databases})

Integrating All Databases:
```
u.integrate_databases()
```
Set Language Codes to Glottocodes:
```
u.set_glottocodes()
```
Reset all changes:
```
u.reset()
```
Import (and replace all existing) data from a custom CSV file:
```
  u.import_csv({file_path}, {index})
```
Replace {database} with saphon, bdproto, grambank, apics, or ewave.
Replace {databases} with arguments "UPDATED_SAPHON", "BDPROTO", "GRAMBANK", "APICS", and/or "EWAVE" (e.g., "UPDATED_SAPHON", "BDPROTO", "EWAVE").
Replace {index} with 0 for genetic data, 1 for typological data, or 2 for geographic data.

Imputation Examples

Aggregate Typological Data:

u.set_aggregation({aggregation}) 
u.aggregate()

Impute Missing Values:
```
u.{imputation_strategy}_imputation()
```
Replace {aggregation} with 'U' (union) or 'A' (average).
Replace {imputation_strategy} with midaspy, knn, softimpute, or mean.

Language Distance Calculation Examples

Calculate a Specific Distance:

print(u.new_distance({distance_type}, {languages}))

Calculate Distance Using Specific Features:

print(u.new_custom_distance({features}, {languages}, {source}))

Retrieve Language Vectors:

u.get_vector({distance_type}, {languages})

View URIEL+ Feature Coverage:
```
u.feature_coverage()
```

Calculate Confidence Scores for Distances

print(u.confidence_score({language 1}, {language 2}, {distance_type}))

Replace {distance_type} with a distance type (e.g., "featural") or a list (e.g., ["syntactic", "phonological"]). Must be single distance type for retrieving language vectors.
Replace {features} with a list of features (e.g., ["F_Germanic", "S_SVO", "P_NASAL_VOWELS"]).
Replace {languages}, {language 1}, and {language 2} with language codes (e.g., "stan1293", "hind1269").
Replace {source} with one database (e.g., "WALS") or all databases ('A').
Note: the default {source} is all databases.

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
.github/workflows		.github/workflows
urielplus		urielplus
.gitignore		.gitignore
LICENSE.txt		LICENSE.txt
MANIFEST.in		MANIFEST.in
README.md		README.md
demo.py		demo.py
logo.png		logo.png
pyproject.toml		pyproject.toml
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

URIEL+: Enhancing Linguistic Inclusion and Usability in a Typological and Multilingual Knowledge Base

Citation

Contents

Environment

Setup Instruction

Configuration Options Examples

Retrieving Loaded Features Examples

Database Integration Examples

Imputation Examples

Language Distance Calculation Examples

About

Uh oh!

Releases 2

Packages

Uh oh!

Contributors 5

Languages

License

LeeLanguageLab/URIELPlus

Folders and files

Latest commit

History

Repository files navigation

URIEL+: Enhancing Linguistic Inclusion and Usability in a Typological and Multilingual Knowledge Base

Citation

Contents

Environment

Setup Instruction

Configuration Options Examples

Retrieving Loaded Features Examples

Database Integration Examples

Imputation Examples

Language Distance Calculation Examples

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Contributors 5

Languages

Packages