DiCountries introduction

Note

This repository on github is just a task for job candidates. It's not for production and it's not exactly the version we use.

DiCountries introduction

About

Country name normalization for DataIntelligence project

Author: Koptev, Roman (roman.koptev@softwareone.com)

Quick start

Installation

To use the module from this repository you need to have a working python [1] installation (version >= 3.6.0) with pre-installed pip [2] package installer. You can run the python and pip executables using python and pip commands, or, depending on your setup, you may have to run them like python3 and pip3 or using other aliases. The recommended way to install these scripts is to execute commands like:

% python -m pip install -U pip
% python -m pip install -U dicountries

To access the azure artifact feed you need to set up your pip to use appropriate indices. For example you can set PIP_EXTRA_INDEX_URL environment variable like:

% export PIP_EXTRA_INDEX_URL="https://di_libraries:<access token>@pkgs.dev.azure.com/swodataintelligence/71b5c973-6f2c-42b7-a0a9-8af59f1bf7ee/_packaging/di_libraries_test/pypi/simple/"

or you can add a file pip.ini (Windows) or pip.conf (Mac/Linux) to your virtualenv with the content like this:

[global]
extra-index-url=https://di_libraries:<access token>@pkgs.dev.azure.com/swodataintelligence/71b5c973-6f2c-42b7-a0a9-8af59f1bf7ee/_packaging/di_libraries_test/pypi/simple/

The template <access token> should be substituted with your valid access token that should have Packaging Read access rights. You can create this token in your AzureDevOps account.

Usage

A simple usage example:

from dicountries.whoosh_index import CountryIndex

country_index = CountryIndex()
country_index.refresh()

print(country_index.normalize_country('Russia'))
print(country_index.normalize_country('Korea, Republic of'))
print(country_index.refine_country('Korea, Republic of'))

The expected output will have lines like these in the end:

Russian Federation
Korea, Republic of
Republic of Korea

The method :py:meth:`dicountries.whoosh_index.CountryIndex.normalize_country` returns a normalized country name if possible (looking in country indexes and then using fuzzy search if had not found), otherwise it returns the country name from the incoming parameter.

This function will try to return the country name accordingly to the ISO 3166 [3] standard, but if a substitution for this name is determined in the file post_process_country_mapping.json in the package's data directory that substitution will be returned.

You can force to not use substitutions from the post_process_country_mapping.json using the input parameter postprocess=False:

print(country_index.normalize_country('Russia', postprocess=False))

The method :py:meth:`dicountries.whoosh_index.CountryIndex.refine_country` will return the same value as the :py:meth:`dicountries.whoosh_index.CountryIndex.normalize_country`, but if there is a comma [,] in the returned name it will recombine the name so that the part after the comma will precede the part before the comma. The comma will be deleted.

You can also do this transformation on any string using function :py:func:`dicountries.utils.reorder_name` from the :py:mod:`dicountries.utils` module.

Every time you run this script it will create a subdirectory indexes in the current working directory to backup indexes there. You can pass the index directory explicitly to the :py:class:`dicountries.whoosh_index.CountryIndex` constructor like this:

country_index = CountryIndex(index_path="<Your index directory>")

If you don't want the index being rebuilt every time the script is running just omit the line:

country_index.refresh()

Without this line the index will be rebuilt only if it doesn't exist, otherwise it will be read from the index directory (it's faster).

If you want the index to be updated as a background process or you want to have :py:mod:`asyncio` integration you can pass the parameter use_async=True to the :py:class:`dicountries.whoosh_index.CountryIndex` constructor. Also there is an async function for index refreshing:

await country_index.refresh_async()

The search process is normally optimized and uses a cache. You can control the size of the cache using the max_search_cache parameter, e.g.:

country_index = CountryIndex(max_search_cache=1000)

During the normalization the search process usually checks the cache first. If some country isn't found in the cache more complicated techniques will be used. Every found country is placed to the simple cache, but if the cache reaches max_search_cache size it will be cleared and the search process will be reinitialized.

[1]	https://www.python.org/

[2]	https://pypi.org/project/pip/

[3]	https://en.wikipedia.org/wiki/ISO_3166

Name		Name	Last commit message	Last commit date
Latest commit History 57 Commits
deploy		deploy
dicountries		dicountries
docs		docs
requirements		requirements
tests		tests
.bandit.cfg		.bandit.cfg
.gitignore		.gitignore
.vale.ini		.vale.ini
README.rst		README.rst
noxfile.py		noxfile.py
setup.cfg		setup.cfg
setup.py		setup.py
test.py		test.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DiCountries introduction

About

Quick start

Installation

Usage

About

Releases

Packages

Languages

romikforest/runtime-common-library-dicountries

Folders and files

Latest commit

History

Repository files navigation

DiCountries introduction

About

Quick start

Installation

Usage

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages