Simple-search: hanumat #167

funderburkjim · 2017-07-31T22:07:18Z

This issue devoted to question raised regarding handling of 'hanumat' in simple search.
[reference] (#156 (comment)))

If in MW I search for hanumat I get nothing (H2) हनू-मत् [p= 1288] : m. &c = हनु-मत्. [L=260538]

2 results: हनूमत् हनुमत्

Strange that हनुमत्is second, because I would think "having (large) jaws" is the most wanted one.

Agree that we should get hanumat as the best choice. We should also get this if user supplied 'hanuman'.

How to accomplish this is not known. Maybe the comments will come up with a solution.

The text was updated successfully, but these errors were encountered:

funderburkjim · 2017-07-31T22:11:10Z

Technical reason why word frequency fails here:

Actually, neither spelling (hanumat, hanUmat) appears in the word frequency file:

9 matches for "hanuma" in buffer: word_frequency_adj.txt (case-insensitive match)
  66049:hanuman 2
  66050:hanumant 10   <<<<
  66051:hanumatI 0
  66063:hanUmatI 0
  66064:hanUmata 0
  66065:hanUmantI 0
  66066:hanUmantavana 0
  66067:hanUmanteSvara 0
  66068:hanUmanteSvaratIrTamAhAtmyavarRana 0

The word frequency file uses 'ant' instead of 'at' in the spelling: hanumant.

gasyoun · 2017-08-01T04:09:04Z

How to accomplish this is not known.

Can't we have a variant with n before endings on t?

gasyoun · 2018-01-22T20:48:01Z

When checking what way should it be: tādṛ́śī or tādṛ́śā

Entered tadrsi

0 no results found

Entered tadrsa and found

(H2) tādṛ́śa [p= 442] : mf (ī) n.

what I wanted. No simple solution, @funderburkjim ?

gasyoun · 2018-11-06T17:33:20Z

When searching for KRISN nothing found.

Because of capital K? Nope, kRISN does not work as well.
kRSN worked. Can't we have RI = R rule, @funderburkjim , please?

funderburkjim · 2018-11-06T21:56:22Z

One part of the simple search has to do with HK. The assumption is that the user might be using a
spelling with HK assumptions. But HK uses some capital letters. Here is the HK alphabet

a A i I u U R RR lR lRR e ai o au M H 
k kh g gh G 
c ch j jh J 
T Th D Dh N 
t th d dh n 
p ph b bh m 
y r l v z S s h

I wonder what would happen if the user's input string was lower-cased first.

Does this sound worth a try. This would work for 'krisn'

funderburkjim · 2018-11-06T22:09:03Z

hanumat problem appears solved

Simple search with 'hanumat' yields: hanumat hanūmat, so the more common short-u word is now first,
as desired

Simple search with 'hanumant' yields: hanumat hanūmat hanumanta
again, as desired.

funderburkjim · 2018-11-06T22:11:45Z

No simple solution to tadrsi

The desired word is tadrSa, which can be found from 'tadrsa'.

But how to generalize? If we allow a = i everywhere, will there be many false positives?

funderburkjim · 2018-11-06T22:25:22Z

@gasyoun Are there any other open questions on simple search besides KRISN and tadrsi?

gasyoun · 2018-11-07T02:58:14Z

besides KRISN and tadrsi?

None that I'm aware of.

I wonder what would happen if the user's input string was lower-cased first.

I guess it's not the only case when a user will try first letter capital or even all capital. Let's have a solution for that and HK is not an issue, still. I was surprised he could not find a thing.

gasyoun · 2019-06-23T06:33:23Z

When searching in GRA virupa 0 no results found
So not found and only virUpa found, strange.

funderburkjim · 2019-06-24T00:05:53Z

virupa no results in GRA.

Also, no results in MW. Clearly a bug, probably related to various representations of vocalic 'r' (SLP1 'f').

When we get the Cologne reorg done (as discussed today), I'll spend some time trying to make simple search more robust. This will probably be 3+ months from now.

gasyoun · 2019-06-24T04:38:41Z

3+ months

Great, half a year is not an issue. Simple search is what makes a difference, so we can wait for sure.

gasyoun · 2019-07-20T04:30:39Z

In MW, rather strange

0 no results found: rupasampanna
1 result: rUpasampanna

gasyoun · 2019-08-10T14:59:30Z

I was looking for shamsudeen and Google did not recognize it was shamsuddin
(Sheikh Shams Uddin). I had to fix it manually. So even Google's AI is not as AI.

P.S. If I would search for Cyrillic кришна and get krsna

А а a a a
Б б b b b
В в v v v
Г г g g g
Д д d d d
Е е e e e
Ё ё jo yo e
Ж ж zh zh zh
З з z z z
И и i i i
Й й jj j i
К к k k k
Л л l l l
М м m m m
Н н n n n
О о o o o
П п p p p
Р р r r r
С с s s s
Т т t t t
У у u u u
Ф ф f f f
Х х kh x kh
Ц ц c cz (c)³ ts
Ч ч ch ch ch
Ш ш sh sh sh
Щ щ shh shh shch
Ъ ъ ʺ ʺ ie
Ы ы y y' y
Ь ь ʹ ʹ –
Э э eh e' e
Ю ю ju yu iu
Я я ja ya ia

gasyoun · 2020-02-08T18:31:25Z

gir for WIL
returns
2 results: giri gir
but not
gṝ
Guess should.

gasyoun · 2020-03-09T16:14:02Z

6 results: nī miṣ nis mī miś mi

Now if I search for nis I will still first get nī showed first. @funderburkjim let's have by default that what I search for and if it's an exact match, let it come first? At least as an option. Twice in one hour I did not noticed why I could not find what I was looking for. That was the reason.

funderburkjim · 2020-03-10T00:46:36Z

Notice that 'nis' is in bold?

That's the ui clue that it is what you typed. Not enough of a clue?

gasyoun · 2020-03-10T10:56:52Z

Notice that 'nis' is in bold?

Sometimes I notice too late. We invented it. And even I tend to forget.

Not enough of a clue?

Seems so. When I'm 100% sure that there will be the word I search for, I forget to check if a more frequent word is shown. Especially with shorter words.

Ref: sanskrit-lexicon/COLOGNE#167 (comment)

funderburkjim · 2020-03-10T21:59:51Z

nis is now first

gasyoun · 2020-03-13T19:08:52Z

nis is now first

Thanks, so it is above all frequency, right? The 2nd one is still based on frequency, right?

Another thought. If I search chava, 0 found, and a similar (actually the only possible word with chav in beginning) chavi will never be found. Good when I heard a word, but not sure what the exact ending was or should be.

gasyoun · 2020-08-01T19:12:33Z

If I enter meta, a Pali word in Sanskrit dictionary, all I will get is meṭa and I will never be aware that it's Sanskrit maitra that corresponds to the Pali meta. So there is no such place or engine that could help me in that. Have no simple only-Cologne solution, and we are not ready for https://dsal.uchicago.edu/dictionaries/pali/

Another example is if I've heard a word and not quite sure how it should be written, like vyudpati I get 0 no results found. What I would have like it to find was actually vyutpatti that I was unaware of how it was written. So it's 2 points:
t instead of d
and tt instead of t.
This seems to be more realistic, @funderburkjim ?

funderburkjim · 2020-08-01T19:57:18Z

This question made me think of spelling checker (for Sanskrit).
Wonder if Norvig's intro could be adapted to Sanskrit?

funderburkjim · 2020-08-01T20:09:12Z

vyudpati has two problems:

'dp' normally does not occur because of sandhi (d + p -> tp). So maybe the program could
know about such transforms
pati should be patti, so maybe program could know to try doubling the 't' -- but in what contexts?

funderburkjim · 2020-08-01T20:19:58Z

If we simple search 'vyutpati', we get result 'vyutpatti' ! so the 'dp' -> 'tp' transformation would
probably be sufficient. The program seems to already know about 'tt' as alternate to 't'.

For instance 'pati' also solves as 'pati', 'patti' and
'patti' solves as 'patti', 'pati'. Similar for 'citi'.
And 'gatti' solves as 'gati', etc.

gasyoun · 2020-08-02T17:21:42Z

Wonder if Norvig's intro could be adapted to Sanskrit?

Let's plan a call on that. @drdhaval2785 are you there?

'dp' normally does not occur because of sandhi

So the spellchecker should kill anti-sandhi cases first.

pati should be patti, so maybe program could know to try doubling the 't' -- but in what contexts?

Like try to double any consonant?

'tp' transformation would probably be sufficient.

so anti-sandhi would do the job, got it.

drdhaval2785 · 2020-08-03T03:18:56Z

I am OK with call.

kill anti-sandhi

I think we have already discovered the way long ago.

Prepare bigrams / trigrams from sanhw1.txt
Anything which is anti-sandhi will not be in valid bigrams or trigrams.
So out of Norvig items, we can remove transposes, replaces etc which are not in valid bigrams or trigrams.

gasyoun · 2020-08-04T12:46:55Z

I think we have already discovered the way long ago.

And lost a few times as well. How about a call on 23rd of August?

funderburkjim · 2020-08-04T17:50:00Z

23rd of August tentatively ok with me.

funderburkjim · 2020-08-04T17:56:20Z

bigrams, trigrams

The 2grams and 3grams for headwords are in https://github.com/sanskrit-lexicon/csl-apidev/tree/master/simple-search/ngram1 .

These are used in simple_search.php.

gasyoun · 2020-08-05T07:57:32Z

The 2grams and 3grams for headwords are in https://github.com/sanskrit-lexicon/csl-apidev/tree/master/simple-search/ngram1 .

2 years ago last update - we have not cleaned in the meantime any dirt, that could keep the list shorter?

gasyoun · 2020-11-13T15:40:08Z

vrisapha gives varṣapa instead of vṛṣabha

gasyoun · 2020-12-16T20:50:27Z

Jim made first capital letter possible, it's capitalness is ignored - finally!

funderburkjim · 2020-12-17T03:22:51Z

This capital-letter issue is only partially solved. I was going to work further on it this week, but
the flurry of comments by Dhaval and the sqlite3 bug have taken precedence.

gasyoun · 2020-12-31T16:29:22Z

There is one field where simple search has not yet been used for. As it is a separate library, it can be used for searching for Sanskrit words outside Cologne as well. One of the most notorious places for writing in 100 wrong ways each and every Sanskrit words is DLI, included DLI at archive.org.
For example to find 2015.445538.SanskritHindiKoushAC45191966.pdf one needs to be aware of Koush for koṣa. @funderburkjim would you mind adding ou as an entry option for o? So letters that can be dropped out without not loosing the original word actually meant.

gasyoun · 2021-01-19T15:17:37Z

If I enter vacakah I will never know vācaka exists. One can mix ḥ with h.

gasyoun · 2021-01-19T15:56:30Z

If I enter gṛhastha for gṛhatsha I will not get nothing. But guess @funderburkjim our algo will not manage it anyway.

gasyoun · 2021-01-19T16:31:07Z

If I enter false form yoginaḥ I would want to get yogin, makes sense?

gasyoun · 2021-01-19T17:35:56Z

If I enter hariṇyagarbha instead of hiraṇyagarbha I guess too hard for non-Google suggest algo.

funderburkjim · 2021-01-19T22:32:21Z

All good examples of current limitations.

I still haven't had opportunity to work on simple search in nearly a month now. Too many strings pulling me in other directions.

gasyoun · 2021-01-20T02:34:05Z

I still haven't had opportunity to work on simple search in nearly a month now. Too many strings pulling me in other directions.

I know. If my voice has any value I would stop for a month or so with all the integrations
of the corrigenda until we do not finalize API and move on with simple.

gasyoun · 2021-01-20T22:07:26Z

I was tracking a word from a book kūṭstha, but it should have been written as kūṭastha. So a vowel was missing in the middle of a word.

gasyoun · 2021-11-29T07:44:29Z

The closeness of a writing of a word should be higher in the sorting algorythm than just pure frequency, @funderburkjim

bahu will not show there is a results like bāhu, strange

gasyoun · 2022-02-17T21:45:48Z

@funderburkjim If I look for sandhimant in MW simple, sandhimat will not be found

gasyoun · 2023-03-06T07:40:56Z

@funderburkjim plus to 6 others, these 4 remain there as well. As per UI I believe it's top-5. @drdhaval2785 agreee?

gasyoun · 2024-09-24T16:11:22Z

@funderburkjim
When I sarch for vraj I get vraja as well.
When I search for vraja I would want to see vraj as well, but I do not as of now.

gasyoun · 2024-09-28T14:01:27Z

@funderburkjim
brahman is there, even bhrāmaṇa is, but there is no brāhmaṇa in MW, strange:
16 results: brahman vraṇa bhrama bhramaṇa bharaṇa varaṇa brahma bhrāmaṇa varaṇā bhrāma vraṇana vraṇaha vrāṇa vrahman varāṇa bharama

gasyoun · 2024-10-06T17:50:42Z

@funderburkjim Cyrillic mode:
"сурья" or "суря" should find a way to "sūrya", but it does not show up:
сурья gives 12 results, but lacks the one we need: sur sura śūra sūri sūra śura sūrin śūr suri surī sūr sūrī

If we type "surya" with Latin letters, the needed one is the 2-nd:
4 results: surya sūrya sūryā surīya

funderburkjim added a commit to sanskrit-lexicon/csl-apidev that referenced this issue Mar 10, 2020

If user spelling found in simple search, put it first.

bacbde1

Ref: sanskrit-lexicon/COLOGNE#167 (comment)

funderburkjim mentioned this issue Mar 14, 2020

simple-search: show 'chavi' if request 'chava' #302

Open

drdhaval2785 added the enhancement New website features label Dec 17, 2020

drdhaval2785 mentioned this issue Dec 20, 2020

todo list in 2021 (in descending order of importance) #325

Open

funderburkjim mentioned this issue Jan 25, 2021

simple search, v1.1 sanskrit-lexicon/csl-apidev#26

Open

Simple-search: hanumat #167

Simple-search: hanumat #167

Comments

funderburkjim commented Jul 31, 2017

funderburkjim commented Jul 31, 2017 • edited Loading

Technical reason why word frequency fails here:

gasyoun commented Aug 1, 2017

gasyoun commented Jan 22, 2018

gasyoun commented Nov 6, 2018

funderburkjim commented Nov 6, 2018

funderburkjim commented Nov 6, 2018

hanumat problem appears solved

funderburkjim commented Nov 6, 2018

No simple solution to tadrsi

funderburkjim commented Nov 6, 2018

gasyoun commented Nov 7, 2018

gasyoun commented Jun 23, 2019

funderburkjim commented Jun 24, 2019

gasyoun commented Jun 24, 2019

gasyoun commented Jul 20, 2019

gasyoun commented Aug 10, 2019 • edited Loading

gasyoun commented Feb 8, 2020

gasyoun commented Mar 9, 2020

funderburkjim commented Mar 10, 2020

gasyoun commented Mar 10, 2020

funderburkjim commented Mar 10, 2020

nis is now first

gasyoun commented Mar 13, 2020

gasyoun commented Aug 1, 2020

funderburkjim commented Aug 1, 2020

funderburkjim commented Aug 1, 2020

funderburkjim commented Aug 1, 2020

gasyoun commented Aug 2, 2020

drdhaval2785 commented Aug 3, 2020

gasyoun commented Aug 4, 2020

funderburkjim commented Aug 4, 2020

funderburkjim commented Aug 4, 2020

gasyoun commented Aug 5, 2020

gasyoun commented Nov 13, 2020

gasyoun commented Dec 16, 2020

funderburkjim commented Dec 17, 2020

gasyoun commented Dec 31, 2020 • edited Loading

gasyoun commented Jan 19, 2021

gasyoun commented Jan 19, 2021

gasyoun commented Jan 19, 2021

gasyoun commented Jan 19, 2021

funderburkjim commented Jan 19, 2021

gasyoun commented Jan 20, 2021

gasyoun commented Jan 20, 2021 • edited Loading

gasyoun commented Nov 29, 2021

gasyoun commented Feb 17, 2022

gasyoun commented Mar 6, 2023

gasyoun commented Sep 24, 2024

gasyoun commented Sep 28, 2024

gasyoun commented Oct 6, 2024

funderburkjim commented Jul 31, 2017 •

edited

Loading

gasyoun commented Aug 10, 2019 •

edited

Loading

gasyoun commented Dec 31, 2020 •

edited

Loading

gasyoun commented Jan 20, 2021 •

edited

Loading