No part of speech tags for German non-lemma entries #205

justinsilvestre · 2025-02-11T21:15:06Z

An example: The entry for 2nd-person singular "willst" points to the lemma (infinitive) "wollen" in Wiktionary: https://en.wiktionary.org/wiki/willst#German

However, "wollen" has a homonym "wollen", an adjective meaning "woolen".

In Wiktionary it is clear that this "wollen" is unrelated, as "willst" is marked as a verb there. But in the kty-de-en dictionary, the part of speech of "willst" is only marked implicitly in the "deinflection" rules. So when performing a lookup on the word "wollen" as found in the "willst" entry, there is no easy way to narrow down the search to include only verbs.

If explicit part of speech markers were present in these non-lemma entries (like the definition tags in lemma entries), it would be much easier to recover their parts of speech.

StefanVukovic99 · 2025-02-11T21:23:08Z

If i'm understanding this right, it's similar to yomidevs/yomitan#1509 and yomidevs/yomitan#1507, i.e. it's about making the dictionary deinflection format more precise.

I think this is a worthwhile goal, but it would require first changing

the dictionary schema in yomitan to allow extra data (reading, tags) for dict deinflections
the logic for looking them up

{
    "type": "array",
    "description": "Deinflection of the term to an uninflected term.",
    "minItems": 2,
    "maxItems": 2,
    "items": [
        {
            "type": "string",
            "description": "The uninflected term."
        },
        {
            "type": "array",
            "description": "A chain of inflection rules that produced the inflected term",
            "items": {
                "type": "string",
                "description": "A single inflection rule."
            }
        }
    ]
}

Maybe the way to do it would be by adding more items to this array?

justinsilvestre · 2025-02-12T09:56:24Z

Thanks for the links + reply. I suppose changing the schema would indeed be necessary if the idea is to allow one term entry to have definitions corresponding to terms of varying parts of speech.

From my uninformed perspective, though, it would seem more sensible to consider homonyms/homographs having different parts of speech as different lemmas altogether, and to have their definitions listed in separate term entries. If that is indeed how it works with yomitan dictionaries, I don't see why we can't use the "rules" field to mark the part of speech of these inflected term entries, described in the term banks schema as "String of space-separated rule identifiers for the definition which is used to validate deinflection. An empty string should be used for words which aren't inflected.". I guess this field may originally have been envisioned as just for use in lemma entries, but in the yomitan repo test dictionary data, as well as kty-de-en, it appears that this field is entirely unused in non-lemma entries.

In any event, yes, the lookup code would still need some changes. Depending on how these changes are implemented, putting this part of speech data in this rules field could be a good idea because, as a simple string field, it could work well as an indexed field. Assuming again that 1 entry = 1 part of speech, search results could easily be narrowed down by checking if the rules field matches a string query, rather than requiring that all results have their entire definitions field loaded/analyzed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

No part of speech tags for German non-lemma entries #205

No part of speech tags for German non-lemma entries #205

justinsilvestre commented Feb 11, 2025

StefanVukovic99 commented Feb 11, 2025 •

edited

Loading

justinsilvestre commented Feb 12, 2025

No part of speech tags for German non-lemma entries #205

No part of speech tags for German non-lemma entries #205

Comments

justinsilvestre commented Feb 11, 2025

StefanVukovic99 commented Feb 11, 2025 • edited Loading

justinsilvestre commented Feb 12, 2025

StefanVukovic99 commented Feb 11, 2025 •

edited

Loading