URIEL+: Enhancing Linguistic Inclusion and Usability in a Typological and Multilingual Knowledge Base
URIEL is a knowledge base offering geographical, phylogenetic, and typological vector representations for 7970 languages. It includes distance measures between these vectors for 4005 languages, which are accessible via the lang2vec tool. Despite being frequently cited, URIEL is limited in terms of linguistic inclusion and overall usability. To tackle these challenges, we introduce URIEL+, an enhanced version of URIEL and lang2vec addressing these limitations. In addition to expanding typological feature coverage for 2898 languages, URIEL+ improves user experience with robust, customizable distance calculations to better suit the needs of the users. These upgrades also offer competitive performance on downstream tasks and provide distances that better align with linguistic distance studies.
If you are interested for more information, check out our full paper.
If you use this code for your research, please cite the following work:
@inproceedings{khan-etal-2025-uriel,
title = "{URIEL}+: Enhancing Linguistic Inclusion and Usability in a Typological and Multilingual Knowledge Base",
author = {Khan, Aditya and
Shipton, Mason and
Anugraha, David and
Duan, Kaiyao and
Hoang, Phuong H. and
Khiu, Eric and
Do{\u{g}}ru{\"o}z, A. Seza and
Lee, En-Shiun Annie},
editor = "Rambow, Owen and
Wanner, Leo and
Apidianaki, Marianna and
Al-Khalifa, Hend and
Eugenio, Barbara Di and
Schockaert, Steven",
booktitle = "Proceedings of the 31st International Conference on Computational Linguistics",
month = jan,
year = "2025",
address = "Abu Dhabi, UAE",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2025.coling-main.463/",
pages = "6937--6952",
abstract = "URIEL is a knowledge base offering geographical, phylogenetic, and typological vector representations for 7970 languages. It includes distance measures between these vectors for 4005 languages, which are accessible via the lang2vec tool. Despite being frequently cited, URIEL is limited in terms of linguistic inclusion and overall usability. To tackle these challenges, we introduce URIEL+, an enhanced version of URIEL and lang2vec that addresses these limitations. In addition to expanding typological feature coverage for 2898 languages, URIEL+ improves the user experience with robust, customizable distance calculations to better suit the needs of users. These upgrades also offer competitive performance on downstream tasks and provide distances that better align with linguistic distance studies."
}
If you have any questions, you can open a GitHub Issue or send us an email.
Check out ExploRIEL, our online UI for URIEL+: https://uriel-leelab.streamlit.app/
- Environment
- Setup Instruction
- Configuration Options Examples
- Retrieving Loaded Features Examples
- Database Integration Examples
- Imputation Examples
- Language Distance Calculation Examples
Python 3.10 or later. If you're working with the MIDASpy
extra dependencies, version of Python must be less than 3.11. Details of dependencies are in setup.py
. NOTE: There are known issues with the MIDASpy extra dependencies. Please use between Python 3.10 and Python 3.11 for the time being.
-
To get started with URIEL+:
pip install urielplus
from urielplus import urielplus u = urielplus.URIELPlus()
-
URIEL+ offers various configurations that you can adjust:
- Caching: Enable or disable caching (True or False).
- Aggregation Method: Choose the method for aggregating data across sources ('U' for unweighted, 'A' for weighted).
- Fill Missing Data: Decide whether to fill missing data using parent language data (True or False).
- Distance Metric: Specify the distance metric to be used ("angular" or "cosine").
-
Changing A Configuration:
u.set_{configuration}({option})
-
Checking A Configuration:
u.get_{configuration}({option})
-
Replace
{configuration}
withcache
,aggregation
,fill_with_base_lang
, ordistance_metric
. -
Replace
{option}
with your desired value for the selected configuration. -
Note: the default configurations are
cache=False
,aggregation='U'
,fill_with_base_lang=True
, anddistance_metric="angular"
.
-
Retrieving A Loaded Feature:
u.get_{vector_type}_{feature_type}_array()
-
Replace
{vector_type}
withphylogeny
,typological
, orgeography
. -
Replace
{feature_type}
withfeatures
,languages
,data
, orsources
. -
Example:
u.get_typological_languages_array()
-
Integrating One Database:
u.integrate_{database}()
-
Integrating Some Databases:
u.integrate_custom_databases({databases})
-
Integrating All Databases:
u.integrate_databases()
-
Set Language Codes to Glottocodes:
u.set_glottocodes()
-
Reset all changes:
u.reset()
-
Import (and replace all existing) data from a custom CSV file:
u.import_csv({file_path}, {index})
-
Replace
{database}
withsaphon
,bdproto
,grambank
,apics
, orewave
. -
Replace
{databases}
with arguments"UPDATED_SAPHON"
,"BDPROTO"
,"GRAMBANK"
,"APICS"
, and/or"EWAVE"
(e.g.,"UPDATED_SAPHON"
,"BDPROTO"
,"EWAVE"
). -
Replace
{index}
with0
for genetic data,1
for typological data, or2
for geographic data.
-
Aggregate Typological Data:
u.set_aggregation({aggregation}) u.aggregate()
-
Impute Missing Values:
u.{imputation_strategy}_imputation()
-
Replace
{aggregation}
with'U'
(union) or'A'
(average). -
Replace
{imputation_strategy}
withmidaspy
,knn
,softimpute
, ormean
.
-
Calculate a Specific Distance:
print(u.new_distance({distance_type}, {languages}))
-
Calculate Distance Using Specific Features:
print(u.new_custom_distance({features}, {languages}, {source}))
-
Retrieve Language Vectors:
u.get_vector({distance_type}, {languages})
-
View URIEL+ Feature Coverage:
u.feature_coverage()
-
Calculate Confidence Scores for Distances
print(u.confidence_score({language 1}, {language 2}, {distance_type}))
-
Replace
{distance_type}
with a distance type (e.g.,"featural"
) or a list (e.g.,["syntactic"
,"phonological"]
). Must be single distance type for retrieving language vectors. -
Replace
{features}
with a list of features (e.g.,["F_Germanic", "S_SVO", "P_NASAL_VOWELS"]
). -
Replace
{languages}
,{language 1}
, and{language 2}
with language codes (e.g.,"stan1293"
,"hind1269"
). -
Replace
{source}
with one database (e.g.,"WALS"
) or all databases ('A'
). -
Note: the default
{source}
is all databases.