Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Listing methods to identify errors #46

Open
drdhaval2785 opened this issue Dec 5, 2014 · 13 comments
Open

Listing methods to identify errors #46

drdhaval2785 opened this issue Dec 5, 2014 · 13 comments

Comments

@drdhaval2785
Copy link
Contributor

@funderburkjim and @gasyoun

We are having three methods going around to identify data entry errors.

1 Alphabetic ordering issue (Sampada working on it)
2 Pattern mismatch based on CV pattern (Dhaval)
3 Fuzzy (Marcis)
4 Pattern mismatch based on string length.
5 Apply subanta and tiGanta generators to these methods - so that our tools are ready for application to description also.
6 listing out impossible letter combinations by Sanskrit grammar rules.
7 Taking English-Sanskrit dictionaries as base and clustering the Sanskrit words having same meaning. The word which is not repeated across dictionaries is suspect.
8 Search for a list of feminine words ending in 'a'
9 Listing out words which appear only in one dictionary after filtering out common differences like M, H at the end, corresponding nasal letters etc.

Let's start exploring other modes of coding interventions which may give us some more resources for error tracking.

4 Pattern mismatch based on string length.

Let me explain.
If MW has an entry Davala, we cut it into all possible 1 letter, 2 lettter, ... 6 letter combination.
D,a,v,a,l,a,Da,av,va,al,la.....,Davala.
Then we compare the test dictionary against such different pattern combination.
if The test dictionary has Dabala because of data entry error, it will be caught because Daba is not a valid pattern in Sanskrit based on our Base dictionary MW.
In our current pattern such cases may be missed.

@gasyoun
Copy link
Member

gasyoun commented Dec 5, 2014

Dhaval, I agree that shingles will help us (see http://digfor.blogspot.ru/2013/03/fruity-shingles.html). I will redo my fuzzy comparison as soon I'll finish my SLP7 encoding. Fuzzy does not recognize capital letters, so I'm making my modification for coding / decoding. The list is ready:

A ä
I ï
U ü
F ř
X ľ
E ë
O ö
M ň
H ĥ
K ķ
G ģ
N ŋ
C č
J ĵ
Y ñ
W ṭ
Q ţ
R ṇ
T ŧ
D đ
P þ
B ƅ
S š

It's something that might be less fruitful than CVC doing, so I'll wait for now.

@funderburkjim
Copy link
Contributor

Regarding 4: Pattern matching on string length.

This is the topic of N-grams. Shingling and n-grams are similar, I think, after looking at @gasyoun 's reference. The main difference is that with n-grams you are looking at sequences of characters in a character string while with shingles you are looking at sequences of 'tokens' (e.g. of words).

@gasyoun Your 'fuzzy' approach is still 'fuzzy' to me, although I understand the basic 'edit distance' idea of string comparison. The approach might be more widely usable if, like the faultfinder tactic, it were developed into a program. Perhaps you and Dhaval are already developing such a program?

@gasyoun
Copy link
Member

gasyoun commented Dec 7, 2014

@funderburkjim my fuzzy is an addon to Excel (https://www.ablebits.com/excel-suite/) and so will remain.

@drdhaval2785
Copy link
Contributor Author

@gasyoun
Jim and I are interested in the documentation of logic and execution of your fuzzy. Spend some time and create a readme and video

@drdhaval2785
Copy link
Contributor Author

5 When we extend any pattern based business to the description - the field would be different. We will have noun-endings and verb-endings too which are not there in headwords. So,, we will have to figure out a way to tackle them.

@drdhaval2785
Copy link
Contributor Author

6 listing out impossible letter combinations by Sanskrit grammar rules

@gasyoun
Copy link
Member

gasyoun commented Dec 8, 2014

Yeah, 6 is my long wanted idea of non-sandhi patterns.

@drdhaval2785
Copy link
Contributor Author

7 Taking English-Sanskrit dictionaries as base and clustering the Sanskrit words having same meaning. The word which is not repeated across dictionaries is suspect

@drdhaval2785
Copy link
Contributor Author

8 Search for a list of feminine words ending in 'a'

Details at #53 .

@drdhaval2785
Copy link
Contributor Author

9 Listing out words which appear only in one dictionary after filtering out common differences like M, H at the end, corresponding nasal letters etc.

These are mostly unique words which would require closer inspection.

@gasyoun
Copy link
Member

gasyoun commented Nov 17, 2015

Let me speak out:
1 Alphabetic ordering issue - you would be good for coding it, but that would require understanding standarts, there are quite many

6 listing out impossible letter combinations by Sanskrit grammar rules - that is possible to only you, but still too far away, Dhaval's sandhi tool can only read, but not generate a list of possible errors.

7 Taking English-Sanskrit dictionaries as base - indeed, something I've not even thought about. SKD, AP should get most benefit, as there are word endings incorporated in headwords.

8 Search for a list of feminine words ending in 'a' - partly done by Jim. Out from 9 I would go with this one first, but here @funderburkjim help is required and maybe even Lexnorm work done on MW.

9 filtering out common differences like M, H at the end - is the 2nd most wanted thing.

If only Jim would help make №8 possible and Dhaval think about what is required for №9.

@gasyoun
Copy link
Member

gasyoun commented Dec 2, 2015

@drdhaval2785 There are two more which I hope could be productive.

#46

6 listing out impossible letter combinations by Sanskrit grammar rules
and
1 Alphabetic ordering issue
@funderburkjim is there any alphabetical misordering script you used before?

As per
7 Taking English-Sanskrit dictionaries as base
It's too much effort for now. The idea is perfect, but I see too many lines of code compared to the wrong words we will locate. It will clean the English-Sanskrit dictionaries for sure, but that's not a top priority as per me.

@gasyoun
Copy link
Member

gasyoun commented Oct 1, 2016

@drdhaval2785 so you think listing out impossible letter combinations by Sanskrit grammar rules is over now?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants