This repo contains the synonyms file with all countries in all languages for analyzers that accept Solr format. It can be used to configure a synonym token filter for explicit tokenization of country names in various languages to country names in English.
Countries data has been gathered from country-list.
If you use ElasticSearch, you can define a synonym token filter like this:
"filter" : {
"countries_synonyms" : {
"type" : "synonym",
"synonyms_path" : "countries_synonyms.txt"
}
}
Then use countries_synonyms
in any custom analyzer. You can find more information about Synonym Token Filter in the documentation.
In case if you need other explicit languages beside English you can generate a synonyms file yourself:
Download countries data
wget https://github.com/umpirsky/country-list/archive/master.zip
Extract it
unzip -e master.zip
Check all available languages
ls country-list-master/data
Run generator with a language option. Here is the example for Russian language
ruby main.rb ru_RU