Skip to content

Commit

Permalink
trying with .rst...
Browse files Browse the repository at this point in the history
  • Loading branch information
mircealungu committed Jul 10, 2020
1 parent b445726 commit ca71a78
Show file tree
Hide file tree
Showing 3 changed files with 90 additions and 4 deletions.
3 changes: 2 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
Statistics about word frequency in different languages based on a corpus of

Statistics about word frequencies in different languages based on a corpus of
movie subtitles as extracted by the Frequency Words (https://github.com/hermitdave/FrequencyWords) project.

Currently supported languages:
Expand Down
85 changes: 85 additions & 0 deletions README.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,85 @@
Statistics about word frequencies in different languages based on a
corpus of movie subtitles as extracted by the `Frequency Words`_
project.

Currently supported languages:

::

"da", "de", "el", "en", "es", "fr", "it", "nl", "no", "pl", "pt", "ro", "zh-CN"

Usage Examples
~~~~~~~~~~~~~~

Getting the info about a given word
'''''''''''''''''''''''''''''''''''

::

>> from wordstats import Word
>> print (Word.stats('bleu', 'fr'))
bleu: (lang: fr, rank: 1521, freq: 9.42, imp: 9.42, diff: 0.03, klevel: 2)

Comparing the difficulty of two German words
''''''''''''''''''''''''''''''''''''''''''''

::

>> from wordstats import Word
>> Word.stats('blauzungekrankenheit','de').difficulty > Word.stats('blau','de').difficulty
True

Top 10 most used words in Dutch
'''''''''''''''''''''''''''''''

::

>> from wordstats import LanguageInfo
>> Dutch = LanguageInfo.load('nl')
>> print(Dutch.all_words()[:10])
['ik', 'je', 'het', 'de', 'dat', 'is', 'een', 'niet', 'en', 'van']

Words common across all the languages
'''''''''''''''''''''''''''''''''''''

Given that the corpus is based on subtitles, some common names have
sliped in. The ``common_words()`` function returns a list.

::

>> from wordstats.common_words import common_words
>> for each in common_words():
>> if len(each) > 9:
>> print(each)
washington
christopher
enterprise

Words that are the same in Polish and Romanian
''''''''''''''''''''''''''''''''''''''''''''''

::

>> from wordstats import LanguageInfo
>> Polish = LanguageInfo.load("pl")
>> Romanian = LanguageInfo.load("ro")
>> for each in Polish.all_words():
>> if each in Romanian.all_words():
>> if len(each) > 5 and each not in common_words():
>> print(each)
telefon
moment
prezent
interes
...

Installation
~~~~~~~~~~~~

::

pip install wordstats

.

.. _Frequency Words: https://github.com/hermitdave/FrequencyWords
6 changes: 3 additions & 3 deletions setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,21 +22,21 @@ def package_files(directory):

extra_files = package_files('wordstats/language_data/')

with open('README.md') as f:
with open('README.rst') as f:
long_description = f.read()

setuptools.setup(
name="wordstats",
packages=setuptools.find_packages(),
version="1.0.3",
version="1.0.4",
license="MIT",
description="Multilingual word frequency statistics for Python based on subtitles corpora",
long_description=long_description,
long_description_content_type='text/markdown',
author="Mircea Lungu",
author_email="me@mir.lu",
url="https://github.com/zeeguu-ecosystem/Python-Wordstats",
download_url="https://github.com/zeeguu-ecosystem/Python-Wordstats/archive/v_1.0.3.tar.gz",
download_url="https://github.com/zeeguu-ecosystem/Python-Wordstats/archive/v_1.0.4.tar.gz",
include_package_data=True,
zip_safe=False,
keywords="natural language processing, multilingual",
Expand Down

0 comments on commit ca71a78

Please sign in to comment.