Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cache char ASCII equivalence to speed-up transliteration #12

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

dukebody
Copy link

Trying to make the transliteration process faster without C compliation I've come up with something good enough.

Following the section caching, I've set up another dict cache to cache char equivalences.

Speed tests

Basic case

Old code:

python -m timeit -s 'from unidecode import unidecode' 'unidecode("Hello world! ñ ŀ ¢")'
100000 loops, best of 3: 4.09 usec per loop

New code with char caching:

python -m timeit -s 'from unidecode import unidecode' 'unidecode("Hello world! ñ ŀ ¢")'
100000 loops, best of 3: 2.41 usec per loop

This case represents a 40% speedup.

Best case scenario (all equal chars)

Old code:

python -m timeit -s 'from unidecode import unidecode' 'unidecode("aaaaa")'
1000000 loops, best of 3: 1.05 usec per loop

New caching code:

python -m timeit -s 'from unidecode import unidecode' 'unidecode("aaaaa")'
1000000 loops, best of 3: 0.848 usec per loop

Bad case scenario (random chars)

Old code:

python -m timeit -s 'from unidecode import unidecode; import random; import string' 'unidecode("".join(random.choice(string.ascii_letters) for x in range(10)))'
100000 loops, best of 3: 12.9 usec per loop

New code:

python -m timeit -s 'from unidecode import unidecode; import random; import string' 'unidecode("".join(random.choice(string.ascii_letters) for x in range(10)))'
100000 loops, best of 3: 12.2 usec per loop

Considerations

  1. I know the bad case scenario I presented is not really the worst, since the worst would be transliterating always different characters on each call. That't harder to test, however extremely unlikely to happen in practice. I believe most uses will transliterate a reduced set of characters, and then the char cache will kick in and be faster. Bigger speedups probably correspond to transliterating non-ascii chars. For example, transliterating "áéíóúñ", which are the most common non-ascii chars in the Spanish language, takes less than 1 usec (in average) with caching, and almost 3 usecs without it (3x speedup).
  2. The character cache takes memory and I've not conducted memory tests because I don't know how. However I believe that (a) a dictionary {char → char} shouldn't take a lot of memory and (b) most practical uses will likely transliterate a reduced set of characters (from a reduced set of languages) so the cache shouldn't grow too big.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant