Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simple-search: hanumat #167

Open
funderburkjim opened this issue Jul 31, 2017 · 47 comments
Open

Simple-search: hanumat #167

funderburkjim opened this issue Jul 31, 2017 · 47 comments
Labels
enhancement New website features

Comments

@funderburkjim
Copy link
Contributor

This issue devoted to question raised regarding handling of 'hanumat' in simple search.
[reference] (#156 (comment)))

If in MW I search for hanumat I get nothing (H2) हनू-मत् [p= 1288] : m. &c = हनु-मत्. [L=260538]

2 results: हनूमत् हनुमत्

Strange that हनुमत्is second, because I would think "having (large) jaws" is the most wanted one.

Agree that we should get hanumat as the best choice. We should also get this if user supplied 'hanuman'.

How to accomplish this is not known. Maybe the comments will come up with a solution.

@funderburkjim
Copy link
Contributor Author

funderburkjim commented Jul 31, 2017

Technical reason why word frequency fails here:

Actually, neither spelling (hanumat, hanUmat) appears in the word frequency file:

9 matches for "hanuma" in buffer: word_frequency_adj.txt (case-insensitive match)
  66049:hanuman 2
  66050:hanumant 10   <<<<
  66051:hanumatI 0
  66063:hanUmatI 0
  66064:hanUmata 0
  66065:hanUmantI 0
  66066:hanUmantavana 0
  66067:hanUmanteSvara 0
  66068:hanUmanteSvaratIrTamAhAtmyavarRana 0

The word frequency file uses 'ant' instead of 'at' in the spelling: hanumant.

@gasyoun
Copy link
Member

gasyoun commented Aug 1, 2017

How to accomplish this is not known.

Can't we have a variant with n before endings on t?

@gasyoun
Copy link
Member

gasyoun commented Jan 22, 2018

When checking what way should it be: tādṛ́śī or tādṛ́śā

Entered tadrsi

0 no results found

Entered tadrsa and found

(H2) tādṛ́śa [p= 442] : mf (ī) n.

what I wanted. No simple solution, @funderburkjim ?

@gasyoun
Copy link
Member

gasyoun commented Nov 6, 2018

When searching for KRISN nothing found.

  1. Because of capital K? Nope, kRISN does not work as well.
  2. kRSN worked. Can't we have RI = R rule, @funderburkjim , please?

@funderburkjim
Copy link
Contributor Author

One part of the simple search has to do with HK. The assumption is that the user might be using a
spelling with HK assumptions. But HK uses some capital letters. Here is the HK alphabet

a A i I u U R RR lR lRR e ai o au M H 
k kh g gh G 
c ch j jh J 
T Th D Dh N 
t th d dh n 
p ph b bh m 
y r l v z S s h

I wonder what would happen if the user's input string was lower-cased first.

Does this sound worth a try. This would work for 'krisn'

@funderburkjim
Copy link
Contributor Author

hanumat problem appears solved

Simple search with 'hanumat' yields: hanumat hanūmat, so the more common short-u word is now first,
as desired

Simple search with 'hanumant' yields: hanumat hanūmat hanumanta
again, as desired.

@funderburkjim
Copy link
Contributor Author

No simple solution to tadrsi

The desired word is tadrSa, which can be found from 'tadrsa'.

But how to generalize? If we allow a = i everywhere, will there be many false positives?

@funderburkjim
Copy link
Contributor Author

@gasyoun Are there any other open questions on simple search besides KRISN and tadrsi?

@gasyoun
Copy link
Member

gasyoun commented Nov 7, 2018

besides KRISN and tadrsi?

None that I'm aware of.

I wonder what would happen if the user's input string was lower-cased first.

I guess it's not the only case when a user will try first letter capital or even all capital. Let's have a solution for that and HK is not an issue, still. I was surprised he could not find a thing.

@gasyoun
Copy link
Member

gasyoun commented Jun 23, 2019

When searching in GRA virupa 0 no results found
So not found and only virUpa found, strange.

@funderburkjim
Copy link
Contributor Author

virupa no results in GRA.

Also, no results in MW. Clearly a bug, probably related to various representations of vocalic 'r' (SLP1 'f').

When we get the Cologne reorg done (as discussed today), I'll spend some time trying to make simple search more robust. This will probably be 3+ months from now.

@gasyoun
Copy link
Member

gasyoun commented Jun 24, 2019

3+ months

Great, half a year is not an issue. Simple search is what makes a difference, so we can wait for sure.

@gasyoun
Copy link
Member

gasyoun commented Jul 20, 2019

In MW, rather strange

0 no results found: rupasampanna
1 result: rUpasampanna

@gasyoun
Copy link
Member

gasyoun commented Aug 10, 2019

I was looking for shamsudeen and Google did not recognize it was shamsuddin
(Sheikh Shams Uddin). I had to fix it manually. So even Google's AI is not as AI.

P.S. If I would search for Cyrillic кришна and get krsna

А а a a a
Б б b b b
В в v v v
Г г g g g
Д д d d d
Е е e e e
Ё ё jo yo e
Ж ж zh zh zh
З з z z z
И и i i i
Й й jj j i
К к k k k
Л л l l l
М м m m m
Н н n n n
О о o o o
П п p p p
Р р r r r
С с s s s
Т т t t t
У у u u u
Ф ф f f f
Х х kh x kh
Ц ц c cz (c)³ ts
Ч ч ch ch ch
Ш ш sh sh sh
Щ щ shh shh shch
Ъ ъ ʺ ʺ ie
Ы ы y y' y
Ь ь ʹ ʹ –
Э э eh e' e
Ю ю ju yu iu
Я я ja ya ia

@gasyoun
Copy link
Member

gasyoun commented Feb 8, 2020

gir for WIL
returns
2 results: giri gir
but not
gṝ
Guess should.

@gasyoun
Copy link
Member

gasyoun commented Mar 9, 2020

nis

6 results: nī miṣ nis mī miś mi

Now if I search for nis I will still first get showed first. @funderburkjim let's have by default that what I search for and if it's an exact match, let it come first? At least as an option. Twice in one hour I did not noticed why I could not find what I was looking for. That was the reason.

@funderburkjim
Copy link
Contributor Author

Notice that 'nis' is in bold?

That's the ui clue that it is what you typed. Not enough of a clue?

@gasyoun
Copy link
Member

gasyoun commented Mar 10, 2020

Notice that 'nis' is in bold?

Sometimes I notice too late. We invented it. And even I tend to forget.

Not enough of a clue?

Seems so. When I'm 100% sure that there will be the word I search for, I forget to check if a more frequent word is shown. Especially with shorter words.

funderburkjim added a commit to sanskrit-lexicon/csl-apidev that referenced this issue Mar 10, 2020
@funderburkjim
Copy link
Contributor Author

nis is now first

image

@gasyoun
Copy link
Member

gasyoun commented Mar 13, 2020

nis is now first

Thanks, so it is above all frequency, right? The 2nd one is still based on frequency, right?

Another thought. If I search chava, 0 found, and a similar (actually the only possible word with chav in beginning) chavi will never be found. Good when I heard a word, but not sure what the exact ending was or should be.

@gasyoun
Copy link
Member

gasyoun commented Aug 1, 2020

If I enter meta, a Pali word in Sanskrit dictionary, all I will get is meṭa and I will never be aware that it's Sanskrit maitra that corresponds to the Pali meta. So there is no such place or engine that could help me in that. Have no simple only-Cologne solution, and we are not ready for https://dsal.uchicago.edu/dictionaries/pali/

Another example is if I've heard a word and not quite sure how it should be written, like vyudpati I get 0 no results found. What I would have like it to find was actually vyutpatti that I was unaware of how it was written. So it's 2 points:
t instead of d
and tt instead of t.
This seems to be more realistic, @funderburkjim ?

@funderburkjim
Copy link
Contributor Author

This question made me think of spelling checker (for Sanskrit).
Wonder if Norvig's intro could be adapted to Sanskrit?

@funderburkjim
Copy link
Contributor Author

vyudpati has two problems:

  1. 'dp' normally does not occur because of sandhi (d + p -> tp). So maybe the program could
    know about such transforms
  2. pati should be patti, so maybe program could know to try doubling the 't' -- but in what contexts?

@funderburkjim
Copy link
Contributor Author

If we simple search 'vyutpati', we get result 'vyutpatti' ! so the 'dp' -> 'tp' transformation would
probably be sufficient. The program seems to already know about 'tt' as alternate to 't'.

For instance 'pati' also solves as 'pati', 'patti' and
'patti' solves as 'patti', 'pati'. Similar for 'citi'.
And 'gatti' solves as 'gati', etc.

@gasyoun
Copy link
Member

gasyoun commented Aug 2, 2020

Wonder if Norvig's intro could be adapted to Sanskrit?

Let's plan a call on that. @drdhaval2785 are you there?

'dp' normally does not occur because of sandhi

So the spellchecker should kill anti-sandhi cases first.

pati should be patti, so maybe program could know to try doubling the 't' -- but in what contexts?

Like try to double any consonant?

'tp' transformation would probably be sufficient.

so anti-sandhi would do the job, got it.

@drdhaval2785
Copy link
Contributor

I am OK with call.

kill anti-sandhi

I think we have already discovered the way long ago.

  1. Prepare bigrams / trigrams from sanhw1.txt
  2. Anything which is anti-sandhi will not be in valid bigrams or trigrams.
  3. So out of Norvig items, we can remove transposes, replaces etc which are not in valid bigrams or trigrams.

@gasyoun
Copy link
Member

gasyoun commented Aug 4, 2020

I think we have already discovered the way long ago.

And lost a few times as well. How about a call on 23rd of August?

@funderburkjim
Copy link
Contributor Author

23rd of August tentatively ok with me.

@funderburkjim
Copy link
Contributor Author

bigrams, trigrams

The 2grams and 3grams for headwords are in https://github.com/sanskrit-lexicon/csl-apidev/tree/master/simple-search/ngram1 .

These are used in simple_search.php.

@gasyoun
Copy link
Member

gasyoun commented Aug 5, 2020

The 2grams and 3grams for headwords are in https://github.com/sanskrit-lexicon/csl-apidev/tree/master/simple-search/ngram1 .

2 years ago last update - we have not cleaned in the meantime any dirt, that could keep the list shorter?

@gasyoun
Copy link
Member

gasyoun commented Nov 13, 2020

vrisapha gives varṣapa instead of vṛṣabha

@gasyoun
Copy link
Member

gasyoun commented Dec 16, 2020

taks

Jim made first capital letter possible, it's capitalness is ignored - finally!

@funderburkjim
Copy link
Contributor Author

This capital-letter issue is only partially solved. I was going to work further on it this week, but
the flurry of comments by Dhaval and the sqlite3 bug have taken precedence.

@gasyoun
Copy link
Member

gasyoun commented Dec 31, 2020

There is one field where simple search has not yet been used for. As it is a separate library, it can be used for searching for Sanskrit words outside Cologne as well. One of the most notorious places for writing in 100 wrong ways each and every Sanskrit words is DLI, included DLI at archive.org.
For example to find 2015.445538.SanskritHindiKoushAC45191966.pdf one needs to be aware of Koush for koṣa. @funderburkjim would you mind adding ou as an entry option for o? So letters that can be dropped out without not loosing the original word actually meant.

@gasyoun
Copy link
Member

gasyoun commented Jan 19, 2021

If I enter vacakah I will never know vācaka exists. One can mix with h.

@gasyoun
Copy link
Member

gasyoun commented Jan 19, 2021

If I enter gṛhastha for gṛhatsha I will not get nothing. But guess @funderburkjim our algo will not manage it anyway.

@gasyoun
Copy link
Member

gasyoun commented Jan 19, 2021

If I enter false form yoginaḥ I would want to get yogin, makes sense?

@gasyoun
Copy link
Member

gasyoun commented Jan 19, 2021

If I enter hariṇyagarbha instead of hiraṇyagarbha I guess too hard for non-Google suggest algo.

@funderburkjim
Copy link
Contributor Author

All good examples of current limitations.

I still haven't had opportunity to work on simple search in nearly a month now. Too many strings pulling me in other directions.

@gasyoun
Copy link
Member

gasyoun commented Jan 20, 2021

I still haven't had opportunity to work on simple search in nearly a month now. Too many strings pulling me in other directions.

I know. If my voice has any value I would stop for a month or so with all the integrations
of the corrigenda until we do not finalize API and move on with simple.

@gasyoun
Copy link
Member

gasyoun commented Jan 20, 2021

I was tracking a word from a book kūṭstha, but it should have been written as kūṭastha. So a vowel was missing in the middle of a word.

@gasyoun
Copy link
Member

gasyoun commented Nov 29, 2021

praharana

The closeness of a writing of a word should be higher in the sorting algorythm than just pure frequency, @funderburkjim

bahu

bahu will not show there is a results like bāhu, strange

@gasyoun
Copy link
Member

gasyoun commented Feb 17, 2022

@funderburkjim If I look for sandhimant in MW simple, sandhimat will not be found

@gasyoun
Copy link
Member

gasyoun commented Mar 6, 2023

@funderburkjim plus to 6 others, these 4 remain there as well. As per UI I believe it's top-5. @drdhaval2785 agreee?

@gasyoun
Copy link
Member

gasyoun commented Sep 24, 2024

@funderburkjim
When I sarch for vraj I get vraja as well.
When I search for vraja I would want to see vraj as well, but I do not as of now.

@gasyoun
Copy link
Member

gasyoun commented Sep 28, 2024

@funderburkjim
brahman is there, even bhrāmaṇa is, but there is no brāhmaṇa in MW, strange:
16 results: brahman vraṇa bhrama bhramaṇa bharaṇa varaṇa brahma bhrāmaṇa varaṇā bhrāma vraṇana vraṇaha vrāṇa vrahman varāṇa bharama

@gasyoun
Copy link
Member

gasyoun commented Oct 6, 2024

@funderburkjim Cyrillic mode:
"сурья" or "суря" should find a way to "sūrya", but it does not show up:
сурья gives 12 results, but lacks the one we need: sur sura śūra sūri sūra śura sūrin śūr suri surī sūr sūrī

If we type "surya" with Latin letters, the needed one is the 2-nd:
4 results: surya sūrya sūryā surīya

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New website features
Projects
None yet
Development

No branches or pull requests

3 participants