Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Many alternate headwords #10

Open
funderburkjim opened this issue Jul 9, 2015 · 7 comments
Open

Many alternate headwords #10

funderburkjim opened this issue Jul 9, 2015 · 7 comments

Comments

@funderburkjim
Copy link
Contributor

In checking a potential correction for PW, I noticed a feature from this example:

<H1>100{srotaISa}1{*srotaISa}¦ ‹und› #{srotaHpati} •m. {%das Meer.%} PW131930

The feature is that SrotaHpati is an additional headword, presented in the text as an alternate to
srotaISa.

The pattern ¦ ‹und› occurs in 2975 cases, and, from a brief examination, appears usually to indicate an alternate headword.

I'm not sure how to specifically handle these cases . But in a more perfect coding of PW, these alternate headwords would be accessible as headwords; and I wanted to mention this here as the subject of some future enhancement to PW.

@gasyoun
Copy link
Member

gasyoun commented Jul 11, 2015

Interesting observation. Hmm, I've seen them before and if I would have some effect, would love to see them sooner than later. The more I wonder if these 2975 cases match with MW or are above his lexicon's reach. In the .xml file (http://www.sanskrit-lexicon.uni-koeln.de/scans/PWScan/2014/downloads/pwxml.zip) it's <noti>und</noti>, but that gives 9105 cases.
There are good ones, like

<H1><h><key1>hrAduni</key1><key2>hrAdu/ni</key2></h><body><noti>und</noti> <s>hrAdu/nI</s>

There are harder cases, that are not covered with Jim's regex, but still should be counted:

<H1><h><key1>hOtrakalpadruma</key1><key2>hOtrakalpadruma</key2></h><body><gram n="n">n.</gram> <noti>und</noti> <s>hOtrasUtra</s>

And false positives that do not relate to headwords:

<H1><h><key1>hvArya</key1><key2>(hvArya)</key2></h><body><s>hvAria/</s> <gram n="Adj">Adj.</gram> <i>colubrinus</i> <noti>oder</noti> <i>geschmeidig , sich durchwindend</i> , <gram n="m">m.</gram> <noti>angeblich</noti> <i>Ross</i> <noti>und</noti> <i>Schlange.</i>

What I wanted to show with these samples is that there can occur 1-2 other tags between <body> and the <noti>und</noti> train.

@funderburkjim
Copy link
Contributor Author

  • There are 1153 cases that are similar to hOtrakalpdruma, in that there is a gender-specifier:
¦ •[mfn][.] ‹und› #
  • in cases like hrAduni, the alternate headword is fully spelled hrAdu/nI in a 'key2' form, from which the key1 hrAdunI is readily derived.
  • In some cases, the headword is only partially given, and so key1 would be harder to derive:
{agnipraveSa}1{agnipraveSa}¦ •m. ‹und› #{°praveSana} •n. 
  where the full alternate headword is agnipraveSana

@gasyoun
Copy link
Member

gasyoun commented Nov 14, 2015

Would love to see the list. This + praefix-root verbs in PW and PWK is the thing I need the most at now for my Reverse Dictionary. And AS replaced by Unicode in etymologies once and for all. It's all I ask for.

@funderburkjim
Copy link
Contributor Author

It seems that at least a first pass at a partial list of alternate headwords for PW could be derived programatically based on the observations made above. It would be a matter of applying a regex to either pw.txt or pw.xml.

To perfect this list would doubtless involve numerous revisions, whose details cannot be estimated ahead of time.

@gasyoun
Copy link
Member

gasyoun commented Nov 17, 2015

Please apply regex woodoo. I will review the results files to see where it will fail.

@gasyoun
Copy link
Member

gasyoun commented May 5, 2017

these alternate headwords would be accessible as headwords

Is still million years ahead, @funderburkjim ?

@Andhrabharati
Copy link

To perfect this list would doubtless involve numerous revisions, whose details cannot be estimated ahead of time.

@funderburkjim

With my v.2 data (which has all the 'grouped' words identified and marked) incorporated into cdsl file, this issue can be closed.

And this doesn't involve numerous revisions, but just a single one!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants