-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Listing methods to identify errors #46
Comments
Dhaval, I agree that shingles will help us (see http://digfor.blogspot.ru/2013/03/fruity-shingles.html). I will redo my fuzzy comparison as soon I'll finish my SLP7 encoding. Fuzzy does not recognize capital letters, so I'm making my modification for coding / decoding. The list is ready:
It's something that might be less fruitful than CVC doing, so I'll wait for now. |
Regarding 4: Pattern matching on string length. This is the topic of N-grams. Shingling and n-grams are similar, I think, after looking at @gasyoun 's reference. The main difference is that with n-grams you are looking at sequences of characters in a character string while with shingles you are looking at sequences of 'tokens' (e.g. of words). @gasyoun Your 'fuzzy' approach is still 'fuzzy' to me, although I understand the basic 'edit distance' idea of string comparison. The approach might be more widely usable if, like the faultfinder tactic, it were developed into a program. Perhaps you and Dhaval are already developing such a program? |
@funderburkjim my fuzzy is an addon to Excel (https://www.ablebits.com/excel-suite/) and so will remain. |
@gasyoun |
|
|
Yeah, 6 is my long wanted idea of non-sandhi patterns. |
|
Details at #53 . |
These are mostly unique words which would require closer inspection. |
Let me speak out: 6 listing out impossible letter combinations by Sanskrit grammar rules - that is possible to only you, but still too far away, Dhaval's sandhi tool can only read, but not generate a list of possible errors. 7 Taking English-Sanskrit dictionaries as base - indeed, something I've not even thought about. SKD, AP should get most benefit, as there are word endings incorporated in headwords. 8 Search for a list of feminine words ending in 'a' - partly done by Jim. Out from 9 I would go with this one first, but here @funderburkjim help is required and maybe even Lexnorm work done on MW. 9 filtering out common differences like M, H at the end - is the 2nd most wanted thing. If only Jim would help make №8 possible and Dhaval think about what is required for №9. |
@drdhaval2785 There are two more which I hope could be productive.
As per |
@drdhaval2785 so you think |
@funderburkjim and @gasyoun
We are having three methods going around to identify data entry errors.
Let's start exploring other modes of coding interventions which may give us some more resources for error tracking.
Let me explain.
If MW has an entry Davala, we cut it into all possible 1 letter, 2 lettter, ... 6 letter combination.
D,a,v,a,l,a,Da,av,va,al,la.....,Davala.
Then we compare the test dictionary against such different pattern combination.
if The test dictionary has Dabala because of data entry error, it will be caught because Daba is not a valid pattern in Sanskrit based on our Base dictionary MW.
In our current pattern such cases may be missed.
The text was updated successfully, but these errors were encountered: