Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cleanup entry "Move DOIs from note and URL field to DOI field and remove http prefix" incorrectly recognizies urls ending with "2010/stuff" as DOIs #6880

Closed
1 task done
JasonGross opened this issue Sep 6, 2020 · 7 comments · Fixed by #6920
Labels
good first issue An issue intended for project-newcomers. Varies in difficulty. [outdated] type: enhancement

Comments

@JasonGross
Copy link

JabRef version 5.2--2020-09-06--c0b139a on Windows 10 10.0 amd64, Java 14.0.2

Steps to reproduce the behavior:

  1. Save the file
@Misc{TrustedSlind,
  author   = {Konrad Slind},
  title    = {Trusted Extensions of Interactive Theorem Provers: Workshop Summary},
  date     = {2010-08},
  location = {Cambridge, England},
  url      = {http://www.cs.utexas.edu/users/kaufmann/itp-trusted-extensions-aug-2010/summary/summary.pdf},
}

as a .bib file.

  1. Open this file in JabRef
  2. Click on the one entry to select it
  3. Click Quality -> Cleanup entries / Alt+F8
  4. Ensure that only the first item ("Move DOIs from note and URL field to DOI field and remove http prefix") is checked
  5. Click OK
  6. Double-click on the entry and click "BibTeX source"

Note that the new source is

@Misc{TrustedSlind,
  author   = {Konrad Slind},
  title    = {Trusted Extensions of Interactive Theorem Provers: Workshop Summary},
  date     = {2010-08},
  doi      = {10/summary},
  location = {Cambridge, England},
}

This url is not a DOI link, though! Presumably this is because the matcher code at

// Regex
// (see http://www.doi.org/doi_handbook/2_Numbering.html)
private static final String DOI_EXP = ""
+ "(?:urn:)?" // optional urn
+ "(?:doi:)?" // optional doi
+ "(" // begin group \1
+ "10" // directory indicator
+ "(?:\\.[0-9]+)+" // registrant codes
+ "[/:%]" // divider
+ "(?:.+)" // suffix alphanumeric string
+ ")"; // end group \1
private static final String FIND_DOI_EXP = ""
+ "(?:urn:)?" // optional urn
+ "(?:doi:)?" // optional doi
+ "(" // begin group \1
+ "10" // directory indicator
+ "(?:\\.[0-9]+)+" // registrant codes
+ "[/:]" // divider
+ "(?:[^\\s]+)" // suffix alphanumeric without space
+ ")"; // end group \1
// Regex (Short DOI)
private static final String SHORT_DOI_EXP = ""
+ "(?:urn:)?" // optional urn
+ "(?:doi:)?" // optional doi
+ "(" // begin group \1
+ "10" // directory indicator
+ "[/:%]" // divider
+ "[a-zA-Z0-9]+"
+ ")"; // end group \1
private static final String FIND_SHORT_DOI_EXP = ""
+ "(?:urn:)?" // optional urn
+ "(?:doi:)?" // optional doi
+ "(" // begin group \1
+ "10" // directory indicator
+ "[/:]" // divider
+ "[a-zA-Z0-9]+"
+ "(?:[^\\s]+)" // suffix alphanumeric without space
+ ")"; // end group \1
private static final String HTTP_EXP = "https?://[^\\s]+?" + DOI_EXP;
private static final String SHORT_DOI_HTTP_EXP = "https?://[^\\s]+?" + SHORT_DOI_EXP;
// Pattern
private static final Pattern EXACT_DOI_PATT = Pattern.compile("^(?:https?://[^\\s]+?)?" + DOI_EXP + "$", Pattern.CASE_INSENSITIVE);
private static final Pattern DOI_PATT = Pattern.compile("(?:https?://[^\\s]+?)?" + FIND_DOI_EXP, Pattern.CASE_INSENSITIVE);
// Pattern (short DOI)
private static final Pattern EXACT_SHORT_DOI_PATT = Pattern.compile("^(?:https?://[^\\s]+?)?" + SHORT_DOI_EXP, Pattern.CASE_INSENSITIVE);
private static final Pattern SHORT_DOI_PATT = Pattern.compile("(?:https?://[^\\s]+?)?" + FIND_SHORT_DOI_EXP, Pattern.CASE_INSENSITIVE);

considers all non-space text starting with http:// or https://, followed by 10/ followed by any non-space text, to be a DOI. This is absurd. The character immediately preceding the 10, doi:, or urn: should at the very least be required to be a url separator character such as /, :, ?, &, or =.

@Siedlerchr Siedlerchr added [outdated] type: enhancement good first issue An issue intended for project-newcomers. Varies in difficulty. labels Sep 7, 2020
@PremKolar
Copy link
Contributor

Can I please do this??
Looks like fun! :)
PLease!

@Siedlerchr
Copy link
Member

@PremKolar Sure, go ahead!

@PremKolar
Copy link
Contributor

This is not as straight forward as I thought.
The problem is with the short dois. These can look like

I don't think there is a way to safely detect these in a url or in some other field. My only idea was to not delete the entry in respective original field in the case of a found short doi, so as not to lose the information in case of ambiguity. But this would inevitably result in wrong data in the doi field sometimes, when the url field is eg https://www.abc.de/10/abcd or when the field Note reads eg 01/10/2012.

Anyone willing to share their thoughts?

http://shortdoi.org/

@JasonGross
Copy link
Author

https://doi.org/d8dn

This one isn't matched because there's no 10, though, right?

I think that detecting what comes before the 10 and ensuring that it's a valid separator would already be a great improvement.

Another option is to query doi validity (I think there's already something like this in automatically searching for dois for an entry). If the matched doi isn't valid (I don't think 10/summary is, for example), then it shouldn't move it to the doi field.

@PremKolar
Copy link
Contributor

https://doi.org/d8dn
This one isn't matched because there's no 10, though, right?

Exactly! that's the 2nd Problem.

Ok yes, validating the doi is of course the obvious solution to this problem.. thanks for the idea!
I have quite a busy week ahead, but I should have found some time by the end of the week! :)

@Siedlerchr
Copy link
Member

Please keep in mind that the Cleanup actions can be executed for all entries in your library. So if you have thousands of entries you would generate 1000 requestss to the DOI resolver

@PremKolar
Copy link
Contributor

right..
I will test scalability and limit the validations to ambiguous cases only!

PremKolar added a commit to PremKolar/jabref that referenced this issue Sep 17, 2020
PremKolar added a commit to PremKolar/jabref that referenced this issue Sep 17, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue An issue intended for project-newcomers. Varies in difficulty. [outdated] type: enhancement
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

3 participants