Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AE alternate headword patterns #19

Open
drdhaval2785 opened this issue Apr 23, 2017 · 66 comments
Open

AE alternate headword patterns #19

drdhaval2785 opened this issue Apr 23, 2017 · 66 comments

Comments

@drdhaval2785
Copy link
Contributor

drdhaval2785 commented Apr 23, 2017

  1. Key2 has two headwords e.g. {@Afire, Aflame,@}

  2. {@-ment@} etc English suffices.

These both need to be examined in detail. English suffix application must have been a trodden path. Need to do review of literature.

@gasyoun
Copy link
Member

gasyoun commented Apr 23, 2017

English suffix application must have been a trodden pat

Please explain what do you want to get?

@drdhaval2785
Copy link
Contributor Author

drdhaval2785 commented Apr 24, 2017 via email

@drdhaval2785
Copy link
Contributor Author

drdhaval2785 commented Apr 24, 2017

Total three kinds of regexes in ae.xml gave me what I wanted
<b>- - <b>-ed.</b> [Headword separator]
<i>-([^<]*)</i> - <i>-prep.</i> [Meaning sense separator]
<b>([0-9]+)</b> - <b>2</b> [Synonym separator]

advance
Advance, v. i. प्र-उत्-चल् 1 P, प्र-अभि-या 2 P, अभिमुखं या, or -गम् व्रज् 1 P, प्रगम्, प्रस्था 1 A; शत्रोरभिमुखं याति ‘a. s against the enemy.’ 
		2 वृध् 1 A, प्र-उप- -चि pass. उन्नतिं-उत्कर्षं-या; ‘a. ed some steps to receive’ सादरमभिमुखदत्तावि- -रलपदः (Ka. 95); ‘a. ed in age’ परिणत- -वयस्, वयोवृद्ध, प्रवयस्; ‘a. ed in know-ledge’ ज्ञानवृद्ध. 
	v. t. उच्चल् c., पुरस्कृ 8 U, अग्रे नी 1 P. 
		2 वृध् c., परिपुष् 4 P 
		3 उपक्षिप् 6 P, उपन्यस् 4 P, पुरो निधा 3 U. 
		4 प्राक् दा 

@drdhaval2785
Copy link
Contributor Author

Also see one more peculiar tendency of the dictionary which we may have to parse.

carve
Carve, v. t. कॄ 6 P, तक्ष् 1, 5 P, उत्कॄ, (उत्कीर्णा इव वासयष्टिषु V. 3); उल्लिख 6 P. 
		2 कृत् 6 P, छिट् 7 P, कॢप् c., विभज् 1 U; ‘c. out (a plan)’ युज् 10; प्रकॢप् c., प्रतियुज् 10; चिंत् 10, निरूप् 10 
-er, s. तक्षकः, कारुः, त्वष्टृ m., तट्ट् m-, 
		2 परिकल्पकः.

Here c. out (a plan) stands for carve out (a plan).
The dictionary writes only the first letter of preceding headword / subheadword and we have to expand it to read so. In those days, to save money / paper it may have been necessary. Now we can afford to have carve out which is what was intended by the author.

@drdhaval2785
Copy link
Contributor Author

@funderburkjim and @gasyoun
Can you have a look at the structure of ae.txt and ae.xml and jot down such peculiarities.
Once that is stabilized, I can start some implementation to extract subheadword or separation of sense / synonyms.

@gasyoun
Copy link
Member

gasyoun commented Apr 24, 2017

Excite -ment
Excitement
Excite -ed
Excited

Understood.

Now we can afford to have carve out which is what was intended by the author.

Agree, and again - add a tag, where it is stated what was it in the book. Or make it c[arve] out instead of c. out. or just carve out. 1st would be the most precise one, one would not for what exactly to look in the book after.

@gasyoun
Copy link
Member

gasyoun commented Apr 26, 2017

I opened akzara and I see in Apte Practical Sanskrit-English Dictionary, revised edition, 1957

-BAj a. having a share in the syllables (of a prayer ?). 
-BUmikA tablet; nyastAkzarAmakzaraBUmikAyAm R. 18. 46. 
-muKaH [akzarARi tanmayAni SAstrARi vA muKe yasya] a scholar, student. 
-Kam [za. ta.] the beginning of the alphabet; the letter a. 
-muzwikA ‘finger-speech’, speaking by means of finger-signs. 
-varjita a. unlettered, illiterate, not knowing how to read or write. a. of an epithet of paramAtman. 
-vyaktiH f. [za. ta.] distinct articulation of syllables. 
-SikzA [za. ta.] the science of (mystic) syllables; theory of brahma (brahmatattva); mahyaM °kzAM viDAya Dk. 11. 
-saMsTAnam [akzarARAM saMsTAnaM yatra] arrangement of letters, writing, alphabet.

As I understand that the work done for MW and partly for PWG, has not been done and compounds not formed, right? Because in hwnorm1c I see that akzaraBUmikA:akzaraBUmikA:PD belongs only to PD, but could be Apte as well. And there is no akzarasaMsTAnam at all.

@drdhaval2785
Copy link
Contributor Author

drdhaval2785 commented Apr 26, 2017 via email

@gasyoun
Copy link
Member

gasyoun commented Apr 26, 2017

misposted AP issue in AE issue.

All equal, right @funderburkjim ?

@funderburkjim
Copy link
Contributor

Good observation on AP, but agree that it needs reposting under an AP issue .

The AP question is similar to AE question in one important aspect: solution requires more advanced parsing of the entry, and addition of tags.

@funderburkjim
Copy link
Contributor

In working a lot with AE corrections with Sampada, I've also felt the need for 'improving' Apte's presentation.

@funderburkjim
Copy link
Contributor

single quote pattern

Another pattern is material in opening and closing single quotes, like

‘c. out (a plan)’   NOTE: these are not the apostrophe character.

Within such a context, the X+period pattern usu. means some abbreviation.

The first thing I would do is to look for errors in matching open-closed single quotes; these would need to be cleaned up before parsing this pattern.

@funderburkjim
Copy link
Contributor

do we need pyparsing ?

It might be that we need to learn how to use more sophisticated parsers to help us in such tasks. In particular, pyparsing. It seems likely, at least in case of AE, that each entry could be parsed into a meaningful data structure; but this can't be done just with regular expressions. pyparsing (and other similar parsers, such as 'ply') have a learning curve.

@funderburkjim
Copy link
Contributor

A vote in favor of sub headwords in AE

@Shalu411 recently submitted via the Correction Form the following:

[L=3099] [p= 117], hw=distinct
Not a typo. But the head word "Distinct" has become a part of "Distinguish" 
and it is not appearing when searched for. Please mark it as a separate head word.

Adding such 'embedded' headwords as 'distinct' to the searchable words for AE would solve this problem.

This note just made to emphasize the interest and importance of doing such an enhancement.

Incidentally, Sampada is now past the half-way point in the page-by-page corrections to AE (reference), with 3500+ corrections made thus far.

image

@gasyoun
Copy link
Member

gasyoun commented Aug 22, 2017

Adding such 'embedded' headwords as 'distinct' to the searchable words for AE would solve this problem.

The subheadword issue. The one I'm longing for the most + upasarga dhatu combinations extracted.

3500+ corrections

Means we can expect 7000 corrections. Sampada is a hero.

@Shalu411
Copy link

Shalu411 commented Sep 1, 2017

Namaste
Happy to write back after so long. Would like to join in. Please let me know the needful task expected from me. Could not follow the whole thing though, will soon join in the flow. Hope the team is doing great.
Abhinandanani to Sampada.

@gasyoun
Copy link
Member

gasyoun commented Sep 1, 2017

Please let me know the needful task expected from me.

Some verification, @funderburkjim ?

@funderburkjim
Copy link
Contributor

funderburkjim commented Sep 2, 2017

@Shalu411 Hi!

We'll have to think together how to organize things so you can help. This comment is to get us started.

We have two kinds of extra headwords in AE:

  • Alternate headwords like {@afire, Aflame,@} 'Afire' is the primary headword. 'Aflame' is the
    alternate.
  • Subheadwords like: '{@-ment@} under headword 'excite'.
    I think Dhaval has done some preliminary work here.

The Alternate headwords are probably easiest to start with.

Suggested first task

Here's a good starter task, I think.
Here is a link to the list of raw alternate headwords, 217 cases in all.

In a few cases, the alternate is spelled incompletely, Like 'Amid,-st'. In such cases, the task is to
expand the abbreviated form. Do this just by adding an extra field to the line. Here's an example:
Before edit:

013:Amid,-st,:2209,2214

After edit:

013:Amid,-st,:2209,2214:Amidst

So the first task is just to do this for all the cases where it is relevant to do so.

@Shalu411 OK?

@gasyoun
Copy link
Member

gasyoun commented Sep 4, 2017

@funderburkjim I'm working as a translator. So if a word is mispelled (mis-not-glued well together) she adds : and the correct form after and that's all? Only English words, right?

@funderburkjim
Copy link
Contributor

Yes -- Just need to have the correct spelling for alternate headwords. Just English words.

@gasyoun
Copy link
Member

gasyoun commented Sep 6, 2017

correct spelling for alternate headwords. Just English words.

Understood, @Shalu411 ?

@Shalu411
Copy link

Shalu411 commented Sep 8, 2017

Namaste, Sorry for my silence for a while. Yes, I understood things clear now. Will do the needful.. and will be here for any clarifications. Yes, its simple and quicker one. Happy to be back, really. :)

@Shalu411
Copy link

Shalu411 commented Sep 9, 2017

First issue-

037:BliTe, BliTesome,:5513,5517

Capital letters.
I shall make them both into small case.

And blite is a very Old English word :) Keep it?

After Edit-

037:BliTe, BliTesome,:5513,5517:Blite, Blitesome

@Shalu411
Copy link

Shalu411 commented Sep 9, 2017

Next issue-
057:Chirk, (1):8313,8314

What is (1) here?

@Shalu411
Copy link

Shalu411 commented Sep 9, 2017

I see that in this list, there are words that are not related to same head-word as well.
021:Askance, Askew, Aslant,:3292,3295
112:Discussive, Discutient,:15663,15665
302:Nacre, Naker,:40521,40522

Ignore, once I see that all spelling is right?

@Shalu411
Copy link

Shalu411 commented Sep 9, 2017

Next doubt-
070:Concentric,-cal,:10149,10150:Concentric,Concentrical,
In the pre-edit stage, all the words have commas (even when there is none after). Shall I put comma in the end in the edited-stage? Or just no comma at the end?
070:Concentric,-cal,:10149,10150:Concentric,Concentrical

@SergeA
Copy link
Collaborator

SergeA commented Sep 9, 2017

037:BliTe, BliTesome,:5513,5517

Why in BliTe, BliTesome capital T stays for th? Is it some bug of processing?
The headword is Blithe, Blithesome

@Shalu411
Copy link

Shalu411 commented Sep 9, 2017

Is it some bug of processing?
May be yes.. may be SLP effect.
Well, let me correct it.

@Shalu411
Copy link

Shalu411 commented Sep 9, 2017

Next issue-
The pre-edit words are separated by comma with space.
So the after-edit words will bear the comma with space? Right?
Or will it bear no effect on the out-put? Perfectly OK either way?

@Shalu411
Copy link

Shalu411 commented Sep 9, 2017

One issue off-task--
What might be the reason that the scanned image is not shown on the Mozilla browser, when I want to check the image?
It just loads and remains blank.
Was checking this. (http://www.sanskrit-lexicon.uni-koeln.de/scans/AEScan/2014/web/webtc/servepdf.php?page=099)

[L=2697] [p= 099] Demain, Demesne, s. svādhīnā bhūmiḥ.

Same issue with all Cologne pages in my browser. It is in tact with Chrome. :)

@SergeA
Copy link
Collaborator

SergeA commented Sep 9, 2017

A convertion byproduct I guess, so a bug.

So there can be numerous cases affected by this bug of mixing T/th?
Current AE basic interface shows it ok: "Blithe, Blithesome" /// not found: bliTe
http://sanskrit-lexicon.uni-koeln.de/scans/AEScan/2014/web/webtc/indexcaller.php
Then wherefrom it comes?

BTW why AE list interface does not allow to input English words and only Devanagari?
http://sanskrit-lexicon.uni-koeln.de/scans/AEScan/2014/web/webtc1/index.php

@gasyoun
Copy link
Member

gasyoun commented Sep 10, 2017

So there can be numerous cases affected by this bug of mixing T/th?

Guess so.

BTW why AE list interface does not allow to input English words and only Devanagari?

A good point. None of the English dictionaries allow English words in list mode.

@Shalu411
Copy link

Hyphens- I would remove for our needs.
Sure. Ok. Done.
closing the tab and reopening may help.
Hmm.. Will try. Thanks.
such a good feeling to see you back.
Same my side. Thanks
Now waiting Shree Jim to speak up on this.. if anything :)

@funderburkjim
Copy link
Contributor

@Shalu411 Here are my comments on your questions. If I overlooked any, please remind me.

Review your solutions in light of these comments, and make any changes needed.

BliTe make lower case 'T'

It is blithe [p= 037] : Blithe, Blithesome,. (I see Serge noticed this).
The 'BliTe' spelling was probably an error that has since been corrected. (The current headword is
correctly spelled 'blithe':
image

Probably the list you are working with was generated by my local version of AE, before BliTe was corrected.

Chirk , (1) What is 1?

Although our digitization has the digit '1', I think it should be the letter 'l'
image
And by using Google's 'define' function ('define chirl'), I see it has a meaning similar to 'define chirk':

chirl:
a trilling or quavering sound. the soothing chirl of doves. verb (transitive) to produce or make a trilling 
or quavering sound. Collins English Dictionary.
chirk: 
& intr.v. chirked, chirk·ing, chirks. To make or become cheerful. Used with up. [Middle English chirken, to chirp, chirrup, 

So in this case the alternate headword must be 'chirl' -- that's my guess.
Please mark the alternate as 'chirl' and precede with a question mark so I'll know it needs special
attention (such as changing that '1' to 'l').

Askance, Askew, Aslant

not related to same head-word as well ?

I think these are adjectives which are loosely related in meaning (check definitions to confirm). Thus,
put two alternates for Askance.

Discussive, Discutient

Again, using Google for definitions, I see that these have similar (medical) meanings. Put Discutient as alternate.

Nacre, Naker

Again, using Google, both words relate to some kind of drum. (nacre has other meanings also). So put Naker as alternate
I see by your later comment that you are also using Google to help with these old words.
In one of the later entries for 'nacre' , I see this from the Free Dictionary:

[French, from Old French nacle, from Old Italian naccaro, drum, nacre, from Arabic naqqāra, small drum, from naqara, to bore, pierce; see nqr in Semitic roots.

So that explains how 'nacre' can also mean drum.

070:Concentric,-cal,:10149,10150:Concentric,Concentrical

Don't put comma at end.

Spaces

So the after-edit words will bear the comma with space? Right?
Or will it bear no effect on the out-put? Perfectly OK either way?

Spaces between words in your after-edit form are optional. OK either way.

Blank scan page 99 in FireFox

Since I show the page just fine using Mozilla browser, I suspect that it was some kind of temporary glitch in the internet transmission that caused you to have a blank page. Try it again. I suspect you will get this page now.

image

212:Hysterics, (Hysteria,)

Leave the paren?
No, remove the paren in the Alternate that you enter. That will make it easier for me to process with
a program.

Do similarly for Neither (nor).

219:Inamorata,-Inamorato,:29750,29751

Leave the '-'? No, drop it. Similarly for the other two you mentioned.

340:Pers, ire,:45552,45559:Perspire

This is a print error. Please mark it with a '?' so I'll remember to give it different handling:
45559:?Perspire. The digitization needs to be corrected in this case -- there is no alternate word.

Also, similarly mark any others that need special handling.

435:Somer-sault, -set,:57776,57777:Somer-sault, Somer-set

Leave the hyphen or drop?
Your solution seems right. Cologne headwords have two forms: a 'raw' form (like Somer-sault) and
a 'lookup' or processed form (somersault) [No capital 'S' and no '-']. We call the 'raw' form 'key2' and
the 'citation' or 'lookup' form 'key1'. For purpose of looking up a form, we use 'key1'.

484:Vantage,-ground,:64418,64420:Vantage,Vantage-ground,

Your solution looks good.

@funderburkjim
Copy link
Contributor

funderburkjim commented Sep 13, 2017

@Shalu411 After you make any adjustments (such as per previous comment), you need to get the
corrected file to me somehow.

The best way might be to add your file as a second file in the Gist that you started from.
That way, the result will be able for others besides me to review if they wish to do so.

To do this,

  • go to the gist link above.
  • click 'Edit'
  • scroll to bottom of screen
  • click add file
  • Enter a 'file name' (such as 'Corrected AE alternate headword list')
  • copy paste your file
  • click Update secret gist.

Give it a try! If you get stuck, we can go to some plan B.

Then post a comment here so I'll know the corrections are ready for me.

@funderburkjim
Copy link
Contributor

@drdhaval2785

This is question regarding your 'subheadword' work on AE (e.g. aehw3.txt) .

Is this ready for further work? I'm thinking that

  • Your 6548 solutions could be first examined by means of Enchant.
    • If Enchant finds the computed subword, we can say it likely doesn't need further examination
      by a human
  • There will be some (probably much smaller) list of cases that Enchant doesn't confirm.
    Examination of these might be a good next task for @Shalu411 , if she's game.

What do you think? (also, I'm not sure whether aehw3.txt is the file to work further with).

@funderburkjim
Copy link
Contributor

BTW why AE list interface does not allow to input English words and only Devanagari?

This has to do with Preferences. For English-Sanskrit dictionaries, set
preferences as follows:

  • Keyboard input: 'Phonetic: SLP1'
  • Input display: 'Same as keyboard input'
  • Server display: To whatever form (e.g. Devanagari) you want for the Sanskrit words).

image

image

@drdhaval2785
Copy link
Contributor Author

Is this ready for further work?

Left it for quite some time. Will need time to figure out where I left, and what remains to be done. Will give green signal when I myself am confident.

@drdhaval2785
Copy link
Contributor Author

@funderburkjim

Started to use pyenchant for English dictionaries.
Only 1572 / 6945 unexplained in AEehw3.txt now.

@gasyoun
Copy link
Member

gasyoun commented Sep 13, 2017

Probably the list you are working with was generated by my local version of AE, before BliTe was corrected.

Seems so

So in this case the alternate headword must be 'chirl' -- that's my guess.

Agree

This has to do with Preferences. For English-Sanskrit dictionaries, set

Too tough even for me

Only 1572 / 6945 unexplained in AEehw3.txt now.

Where can Usha find the 1572 words in question, Dhaval?

@funderburkjim
Copy link
Contributor

Where can Usha find the 1572 words in question,

By experiment, there are (in aehw3.txt) 1572 instances of '@0'.
For instance Abound@ly@Aboundly@683@0.
So, these must be the ones PyEnchant couldn't account for.

image

From this we see unexpected phenomenon. Author introduces (under headword abound) the related
word 'abundant'. And it is no doubt that the subsequent '-ly' is meant to apply to 'abundant', thereby
giving 'abundantly'. In other words, the '-sfx' does not always apply to the 'key1' (abound).

Also the aehw3.txt does not mention
'-Abundant' or '-Abundance' under 'abound' . Only '-ing' and '-ly' are mentioned.

Abound@ing@Abounding@681@1
Abound@ly@Aboundly@683@0

Maybe the fact that '-Abundant' starts with capital letter is significant.
Suggest a listing that includes the -[CAPITAL LETTER] patterns, along with the others, and all in sequence of occurrence.

Maybe the rule for subwords is:

  • a (lower-case) suffix ('-sfx') applies to the current headword within an entry
  • The current headword within an entry is initially the main headword of the entry (e.g. Abound)
  • an (upper-case) suffix ('-Sfx') within an entry does two things:
    • supplies an additional subheadword completely formed (i.e., it is fully spelled. like Abundant)
    • resets the current headword to which subsequent lower case `-sfx' suffixes are applied.

@funderburkjim
Copy link
Contributor

Too tough even for me

Probably indicates a need for change to UI for list displays for English-Sanskrit dictionaries.

My current thinking in regard to the 'Preferences' used in list displays is that it should be completely
replaced in favor of the list-0.2.html display (with dictionary fixed). list-0.2 has provision for choosing 'input=deva'. and then it is up to the user to choose how he wants to type unicode Devanagari into the citation field. This might be a drawback to some users who want to type slp1 but see Devanagari.

In regard to English-Sanskrit dictionaries, the list Display preferences is currently really not applicable,
even though it is possible to use the settings in the manner I described above.

Have added this concern to my TODO list. Perhaps it should also be mentioned in a separate 'Cologne' repository issue labeled enhancement.

@gasyoun
Copy link
Member

gasyoun commented Sep 13, 2017

Maybe the rule for subwords is:

@Shalu411 nothing about it in the Preface?

is that it should be completely
replaced in favor of the list-0.2.html display (with dictionary fixed).

Yes, yes, yes!

drawback to some users who want to type slp1 but see Devanagari.

Jim would be the only one I can image who feels comfortable enough with SLP1.

In regard to English-Sanskrit dictionaries, the list Display preferences is currently really not applicable,
even though it is possible to use the settings in the manner I described above.

Yap, next to unusable.

Have added this concern to my TODO list.

Hail to number 35 💯

@drdhaval2785
Copy link
Contributor Author

rule for subwords

As it turns out, this is a bit deeper than what I thought it would be.
Let me finish regenerating the files on this logic before someone starts looking at it.

@gasyoun
Copy link
Member

gasyoun commented Sep 14, 2017

Let me finish regenerating the files on this logic before someone starts looking at it.

Thanks, Dhaval.

@drdhaval2785
Copy link
Contributor Author

@funderburkjim and @gasyoun
Now majority of automatic deductions are over and now only 1000 odd entries remain for manual examination.
Can be seen via @0$ regex in AEehw3.txt file.

Jim may like to generate the displays for @Shalu411.

@gasyoun
Copy link
Member

gasyoun commented Sep 14, 2017

Jim, @Shalu411 is ready and waiting for your instructions.

@funderburkjim
Copy link
Contributor

ready and waiting

Want to get the first task results before beginning another task (see this comment above).

@funderburkjim
Copy link
Contributor

@Shalu411 Have not yet received your final corrections on alternate headwords. ?

@Shalu411
Copy link

Namaste
Extremely sorry for the gap. There was a net connection problem and then, as usual, some personal issues. Now am back. Yes, I have read the feed back. Will provide the final file soon.

@Shalu411
Copy link

Shalu411 commented Sep 24, 2017

Greetings-
All changes done.
Will add the file in a while.. with a smile.. :)

attention (such as changing that '1' to 'l').```
Done

```put two alternates for Askance.```
Done. No change required. Its perfect

```Put Discutient as alternate.```
Done

``` put Naker as alternate```
Done

```Don't put comma at end.```
Done

```Spaces between words in your after-edit form are optional. OK either way.```
Ok.. Then, with space

```No, remove the paren in the Alternate that you enter. ```
Done

```Do similarly for Neither (nor).```
Done

```Leave the '-'? No, drop it. ```
Done

``a 'raw' form (like Somer-sault) ``
Well, carefully ignored :)

@Shalu411
Copy link

#19 (comment)
I have tried.. But it shows no edit option. Marcis says, may be I do not have enough rights. Please see to this issue..

@drdhaval2785
Copy link
Contributor Author

@Shalu411,
Github allows for uploading of files in issues itself. You can upload the file here in issues by dragging and dropping. I guess you will not have access to Jim's gist.

@Shalu411
Copy link

Apte-AE-1.txt
Namaste
Then fine. Here is the file.. I request Shree Jim to please let the file reach its correct destination.

@Shalu411
Copy link

Ting tong.. Everything Ok?

@gasyoun
Copy link
Member

gasyoun commented Sep 28, 2017

@funderburkjim seems to be busy, but guess it's fine.

@funderburkjim
Copy link
Contributor

@Shalu411 Yes, I didn't read my email for a couple of days, busy with PW.

In looking at your Apte-AE-1.txt file I notice a problem with this line:
013:Among,-st,:2228,2233

Did you forget to expand, the Among line like you did for the preceding line ?
013:Amid,-st,:2209,2214:Amidst

Would you correct the Among line, and check other lines for similar incompletions?
Thanks, Jim

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants