Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pkg.add should display nearest names #616

Closed
s-celles opened this issue Aug 11, 2018 · 9 comments
Closed

Pkg.add should display nearest names #616

s-celles opened this issue Aug 11, 2018 · 9 comments

Comments

@s-celles
Copy link

s-celles commented Aug 11, 2018

Hello,

When doing Pkg.add("NameOfPackage"), if NameOfPackage is not found an error message such as

ERROR: The following package names could not be resolved:
 * NameOfPackage (not found in project, manifest or registry)
Please specify by known `name=uuid`.

is shown.

If a package name is not found, maybe Pkg could try to "help" users by listing name of some packages whom name is quite near to what user is looking for.

Computing string similarity using for example dice coefficient (see various implementations) between user provided package name and name of each registered package could help.

Maybe comparison should be done after upper casing (or lower casing) both.
A threshold could probably be set.
Sorting by descending coefficient will be required, taking only (for example) five nearest names (however displaying them with correct case)

Kind regards

@StefanKarpinski
Copy link
Member

This is definitely a good feature to have. Isn't edit distance more commonly used for this?

@s-celles
Copy link
Author

Several kind of methods for measuring strings similarity exist :

  • Levenstein distance
  • Jaro distance
  • Jaro-Wrinkler distance
  • Dice coefficient
  • N-Gram similarity
  • Cosine similarity
  • Jaccard similarity
  • Longest common subsequence
  • Hamming distance

to name a few

https://en.wikipedia.org/wiki/String_metric can give a first idea about string metric but this paper gives probably a better overview
https://hpi.de/fileadmin/user_upload/fachgebiete/naumann/folien/SS13/DPDC/DPDC_12_Similarity.pdf

I personally used Dice coefficient in https://github.com/scls19fr/arduino_libraries_search and was quite happy with that choice.

I noticed that several Julia packages that could help to calculate string similarity exist.

To name a few :

Maybe these contributors can give us some advises for choosing a "good distance" function for such an use case.

@StefanKarpinski
Copy link
Member

StefanKarpinski commented Aug 13, 2018

I personally used Dice coefficient in https://github.com/scls19fr/arduino_libraries_search and was quite happy with that choice.

Can you put into words at all what about the Dice coefficient have made it work well for that? I'm not doubting, just wondering why it works better than any of the others. This is the first time I've heard of it for this whereas Damerau–Levenshtein seems to be the go-to standard for correcting spelling errors.

@s-celles
Copy link
Author

Sorry, but I'm not skilled enough to be the Dice advocate.

I just choose this distance because of three reasons.

  1. A bad reason
    it was the first one distance in French Wikipedia article https://fr.wikipedia.org/wiki/Mesure_de_similarit%C3%A9
    (it's a bad reason because you should notice that it's 3rd one in the english article https://en.wikipedia.org/wiki/String_metric )

  2. An other (untested) reason
    Implementation looked simple and so I thought it should be quite fast
    (but I haven't done any benchmark)

  3. A last reason ... others...
    https://stackoverflow.com/questions/653157/a-better-similarity-ranking-algorithm-for-variable-length-strings some SO contributors are also suggesting Dice...

But anyway... the choice of the distance method could be user defined
(especially when we don't have much knowledge about what is the best distance for such a use case both in term of speed to calculate best alternative package names and their relevance)

@s-celles
Copy link
Author

s-celles commented Jan 5, 2019

This is maybe a bit out of the scope of this issue... but maybe a Pkg.search / Pkg.find command might also be considered (searching by nearest name of package or description)

@matthieugomez
Copy link

Julia code for the Damereau-Levenshteindistance can be found in the StringDistances package

@cdluminate
Copy link

I second the request for Pkg.search feature.

@KristofferC
Copy link
Member

(TimerOutputs) pkg> add Exampel
    Updating registry at `~/.julia/registries/General.toml`
ERROR: The following package names could not be resolved:
 * Exampel (not found in project, manifest or registry)
   Suggestions: Example SMCExamples SIIPExamples DashTextareaAutocomplete ExactOptimalTransport

@waldyrious
Copy link

For the record, this was implemented in #2985.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants