Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No part of speech tags for German non-lemma entries #205

Open
justinsilvestre opened this issue Feb 11, 2025 · 2 comments
Open

No part of speech tags for German non-lemma entries #205

justinsilvestre opened this issue Feb 11, 2025 · 2 comments

Comments

@justinsilvestre
Copy link

An example: The entry for 2nd-person singular "willst" points to the lemma (infinitive) "wollen" in Wiktionary: https://en.wiktionary.org/wiki/willst#German

However, "wollen" has a homonym "wollen", an adjective meaning "woolen".

In Wiktionary it is clear that this "wollen" is unrelated, as "willst" is marked as a verb there. But in the kty-de-en dictionary, the part of speech of "willst" is only marked implicitly in the "deinflection" rules. So when performing a lookup on the word "wollen" as found in the "willst" entry, there is no easy way to narrow down the search to include only verbs.

If explicit part of speech markers were present in these non-lemma entries (like the definition tags in lemma entries), it would be much easier to recover their parts of speech.

@StefanVukovic99
Copy link
Collaborator

StefanVukovic99 commented Feb 11, 2025

If i'm understanding this right, it's similar to yomidevs/yomitan#1509 and yomidevs/yomitan#1507, i.e. it's about making the dictionary deinflection format more precise.

I think this is a worthwhile goal, but it would require first changing

  1. the dictionary schema in yomitan to allow extra data (reading, tags) for dict deinflections
  2. the logic for looking them up
{
    "type": "array",
    "description": "Deinflection of the term to an uninflected term.",
    "minItems": 2,
    "maxItems": 2,
    "items": [
        {
            "type": "string",
            "description": "The uninflected term."
        },
        {
            "type": "array",
            "description": "A chain of inflection rules that produced the inflected term",
            "items": {
                "type": "string",
                "description": "A single inflection rule."
            }
        }
    ]
}

Maybe the way to do it would be by adding more items to this array?

@justinsilvestre
Copy link
Author

Thanks for the links + reply. I suppose changing the schema would indeed be necessary if the idea is to allow one term entry to have definitions corresponding to terms of varying parts of speech.

From my uninformed perspective, though, it would seem more sensible to consider homonyms/homographs having different parts of speech as different lemmas altogether, and to have their definitions listed in separate term entries. If that is indeed how it works with yomitan dictionaries, I don't see why we can't use the "rules" field to mark the part of speech of these inflected term entries, described in the term banks schema as "String of space-separated rule identifiers for the definition which is used to validate deinflection. An empty string should be used for words which aren't inflected.". I guess this field may originally have been envisioned as just for use in lemma entries, but in the yomitan repo test dictionary data, as well as kty-de-en, it appears that this field is entirely unused in non-lemma entries.

In any event, yes, the lookup code would still need some changes. Depending on how these changes are implemented, putting this part of speech data in this rules field could be a good idea because, as a simple string field, it could work well as an indexed field. Assuming again that 1 entry = 1 part of speech, search results could easily be narrowed down by checking if the rules field matches a string query, rather than requiring that all results have their entire definitions field loaded/analyzed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants