Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to store the data #1

Open
goodmami opened this issue Feb 22, 2019 · 2 comments
Open

How to store the data #1

goodmami opened this issue Feb 22, 2019 · 2 comments
Labels
enhancement New feature or request

Comments

@goodmami
Copy link
Member

We need to decide a good way to store the gwadoc data, but it's not yet clear what are the intended uses or who are the intended users beyond generating the HTML documentation.
The current (not checked-in) data is a python file that fills dictionaries with data. If generating documentation is the only use, we may as well put it directly into restructuredText. If we want a Python API, e.g., to request the localized name, definition, reverse, etc. from OMW, then it might make sense to make Python classes (Sphinx's autodoc could possibly be used to generate the docs, then).

In either case we could store the data in a data file and transform it (perhaps with validation) into the target representation. I propose using TOML​. Even though it is relatively new and not in the standard library, it was chosen for Rust's package manager and for the future of Python packaging (see PEP-0518), so it has support by major projects.

Here's a what (part of) hypernym would look like:

[hypernym]

  [hypernym.name]
    en = "Hypernym"
    symbol = ""
    ja = "上位語"

  [hypernym.def]
    en = "a word that is more general than a given word"
    pl = "Relacja łącząca znaczenie z drugim, ogólniejszym, niż to pierwsze, ale należącym do tej samej części mowy, co ono"
    ja = "当該synsetが相手synsetに包含される"

​There's some flexibility in TOML (but not as flexible as YAML, which is a good thing). Something like this would be equivalent, e.g., if you want to group all attributes by language:

[hypernym]
name.en = "Hypernym"
def.en = "a word that is more general than a given word"
# etc...

And while I would like to place this file (gwadoc.toml or whatever) at the top level so it's more prominent for non-Python users/contributors, that would make it much more difficult to distribute with the project and for the python code to find when run. So it might go under gwadoc/gwadoc.toml instead.

As an alternative, if we don't care much about non-Python users, we could make a Python class like Relation and do things like this:

rels['hypernym'] = Relation(
    name={
        "en": "Hypernym",
        "ja": "上位語",
    },
    def={
        "en": "a word that is more general than a given word",
    }
)

Then query it like this:

>>> hypernym = rels['hypernym']
>>> hypernym.name['en']
Hypernym
@fcbond fcbond added the enhancement New feature or request label Feb 22, 2019
@fcbond
Copy link
Member

fcbond commented Feb 22, 2019

I think we might leave it as a python dictionary for the moment, and concentrate on using and extending it.

Converting to TOML looks like it may make it easier to edit down the road.

@goodmami
Copy link
Member Author

For now I've settled on having data structures that behave like dictionaries or classes in that they allow for both key-lookup (e.g. rels['hypernym']['name']['en']) and dot-access (rels.hypernym.name.en). The former is useful when you have the relation or property name in a variable and prefer rels[relation] over getattr(rels, relation) while the latter is much simpler and makes editing the file easier. I also made the data structures raise errors on invalid keys/attributes and defined inventories of valid relations, forms, projects, languages, etc., in order to reduce errors caused by simple typos.

I'll leave this issue open as a feature request for future versions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants