Listing methods to identify errors #46

drdhaval2785 · 2014-12-05T16:40:33Z

We are having three methods going around to identify data entry errors.

1 Alphabetic ordering issue (Sampada working on it)
2 Pattern mismatch based on CV pattern (Dhaval)
3 Fuzzy (Marcis)
4 Pattern mismatch based on string length.
5 Apply subanta and tiGanta generators to these methods - so that our tools are ready for application to description also.
6 listing out impossible letter combinations by Sanskrit grammar rules.
7 Taking English-Sanskrit dictionaries as base and clustering the Sanskrit words having same meaning. The word which is not repeated across dictionaries is suspect.
8 Search for a list of feminine words ending in 'a'
9 Listing out words which appear only in one dictionary after filtering out common differences like M, H at the end, corresponding nasal letters etc.

Let's start exploring other modes of coding interventions which may give us some more resources for error tracking.

4 Pattern mismatch based on string length.

Let me explain.
If MW has an entry Davala, we cut it into all possible 1 letter, 2 lettter, ... 6 letter combination.
D,a,v,a,l,a,Da,av,va,al,la.....,Davala.
Then we compare the test dictionary against such different pattern combination.
if The test dictionary has Dabala because of data entry error, it will be caught because Daba is not a valid pattern in Sanskrit based on our Base dictionary MW.
In our current pattern such cases may be missed.

The text was updated successfully, but these errors were encountered:

gasyoun · 2014-12-05T17:03:04Z

Dhaval, I agree that shingles will help us (see http://digfor.blogspot.ru/2013/03/fruity-shingles.html). I will redo my fuzzy comparison as soon I'll finish my SLP7 encoding. Fuzzy does not recognize capital letters, so I'm making my modification for coding / decoding. The list is ready:

A ä
I ï
U ü
F ř
X ľ
E ë
O ö
M ň
H ĥ
K ķ
G ģ
N ŋ
C č
J ĵ
Y ñ
W ṭ
Q ţ
R ṇ
T ŧ
D đ
P þ
B ƅ
S š

It's something that might be less fruitful than CVC doing, so I'll wait for now.

funderburkjim · 2014-12-07T21:50:05Z

Regarding 4: Pattern matching on string length.

This is the topic of N-grams. Shingling and n-grams are similar, I think, after looking at @gasyoun 's reference. The main difference is that with n-grams you are looking at sequences of characters in a character string while with shingles you are looking at sequences of 'tokens' (e.g. of words).

@gasyoun Your 'fuzzy' approach is still 'fuzzy' to me, although I understand the basic 'edit distance' idea of string comparison. The approach might be more widely usable if, like the faultfinder tactic, it were developed into a program. Perhaps you and Dhaval are already developing such a program?

gasyoun · 2014-12-07T22:03:03Z

@funderburkjim my fuzzy is an addon to Excel (https://www.ablebits.com/excel-suite/) and so will remain.

drdhaval2785 · 2014-12-08T18:39:43Z

@gasyoun
Jim and I are interested in the documentation of logic and execution of your fuzzy. Spend some time and create a readme and video

drdhaval2785 · 2014-12-08T18:42:06Z

5 When we extend any pattern based business to the description - the field would be different. We will have noun-endings and verb-endings too which are not there in headwords. So,, we will have to figure out a way to tackle them.

drdhaval2785 · 2014-12-08T18:43:39Z

6 listing out impossible letter combinations by Sanskrit grammar rules

gasyoun · 2014-12-08T18:57:39Z

Yeah, 6 is my long wanted idea of non-sandhi patterns.

drdhaval2785 · 2014-12-08T19:06:50Z

7 Taking English-Sanskrit dictionaries as base and clustering the Sanskrit words having same meaning. The word which is not repeated across dictionaries is suspect

drdhaval2785 · 2014-12-18T07:20:15Z

8 Search for a list of feminine words ending in 'a'

Details at #53 .

drdhaval2785 · 2015-04-13T18:58:42Z

9 Listing out words which appear only in one dictionary after filtering out common differences like M, H at the end, corresponding nasal letters etc.

These are mostly unique words which would require closer inspection.

gasyoun · 2015-11-17T15:06:29Z

Let me speak out:
1 Alphabetic ordering issue - you would be good for coding it, but that would require understanding standarts, there are quite many

6 listing out impossible letter combinations by Sanskrit grammar rules - that is possible to only you, but still too far away, Dhaval's sandhi tool can only read, but not generate a list of possible errors.

7 Taking English-Sanskrit dictionaries as base - indeed, something I've not even thought about. SKD, AP should get most benefit, as there are word endings incorporated in headwords.

8 Search for a list of feminine words ending in 'a' - partly done by Jim. Out from 9 I would go with this one first, but here @funderburkjim help is required and maybe even Lexnorm work done on MW.

9 filtering out common differences like M, H at the end - is the 2nd most wanted thing.

If only Jim would help make №8 possible and Dhaval think about what is required for №9.

gasyoun · 2015-12-02T12:45:25Z

@drdhaval2785 There are two more which I hope could be productive.

#46

6 listing out impossible letter combinations by Sanskrit grammar rules
and
1 Alphabetic ordering issue
@funderburkjim is there any alphabetical misordering script you used before?

As per
7 Taking English-Sanskrit dictionaries as base
It's too much effort for now. The idea is perfect, but I see too many lines of code compared to the wrong words we will locate. It will clean the English-Sanskrit dictionaries for sure, but that's not a top priority as per me.

gasyoun · 2016-10-01T05:41:12Z

@drdhaval2785 so you think listing out impossible letter combinations by Sanskrit grammar rules is over now?

drdhaval2785 mentioned this issue Apr 13, 2015

Review of Corrections/Changes to dictionaries #90

Closed

drdhaval2785 added the wontfix label Apr 16, 2015

drdhaval2785 added Documentation and removed wontfix labels Nov 17, 2015

drdhaval2785 mentioned this issue Dec 2, 2015

Todo list as of December 2015 #181

Open

drdhaval2785 mentioned this issue Dec 20, 2020

todo list in 2021 (in descending order of importance) sanskrit-lexicon/COLOGNE#325

Open

drdhaval2785 mentioned this issue Jan 13, 2021

Hunspell for Sanskrit? sanskrit-lexicon/COLOGNE#91

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Listing methods to identify errors #46

Listing methods to identify errors #46

drdhaval2785 commented Dec 5, 2014

gasyoun commented Dec 5, 2014

funderburkjim commented Dec 7, 2014

gasyoun commented Dec 7, 2014

drdhaval2785 commented Dec 8, 2014

drdhaval2785 commented Dec 8, 2014

drdhaval2785 commented Dec 8, 2014

gasyoun commented Dec 8, 2014

drdhaval2785 commented Dec 8, 2014

drdhaval2785 commented Dec 18, 2014

drdhaval2785 commented Apr 13, 2015

gasyoun commented Nov 17, 2015

gasyoun commented Dec 2, 2015

gasyoun commented Oct 1, 2016

Listing methods to identify errors #46

Listing methods to identify errors #46

Comments

drdhaval2785 commented Dec 5, 2014

gasyoun commented Dec 5, 2014

funderburkjim commented Dec 7, 2014

gasyoun commented Dec 7, 2014

drdhaval2785 commented Dec 8, 2014

drdhaval2785 commented Dec 8, 2014

drdhaval2785 commented Dec 8, 2014

gasyoun commented Dec 8, 2014

drdhaval2785 commented Dec 8, 2014

drdhaval2785 commented Dec 18, 2014

drdhaval2785 commented Apr 13, 2015

gasyoun commented Nov 17, 2015

gasyoun commented Dec 2, 2015

gasyoun commented Oct 1, 2016