Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HTTP 500 for processCitationList on non-breaking whitespace string #849

Open
bnewbold opened this issue Nov 3, 2021 · 2 comments
Open
Labels
bug From Hemiptera and especially its suborder Heteroptera

Comments

@bnewbold
Copy link
Contributor

bnewbold commented Nov 3, 2021

This is a pretty obscure corner case (parsing a mangled string), so may not be a priority to reproduce and fix. But it did come up in real use of GROBID and resulted in an internal error (HTTP 500) instead of a 4xx or 2xx status code.

The source of this citation string is the Crossref DOI reference metadata for the DOI 10.5817/cz.muni.m210-9541-2019. The JSON metadata can be fetched from https://api.crossref.org/v1/works/http://dx.doi.org/10.5817/cz.muni.m210-9541-2019. In the references[] array, the reference with key ref127 has the following JSON structure:

{
  "key": "ref127",
  "unstructured": "               "
}

If parsed in to Python and printed:

{'key': 'ref127', 'unstructured': '\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0'}

it is easier to see that this is not a simple space, it is Unicode character \uA0, which is "NO-BREAK SPACE".

When this string is submitted as a citation to parseCitationList, GROBID returns a 500 error. The stack trace looks like:

Nov 03 06:57:32 wbgrp-svc096.us.archive.org GROBID[400404]: ERROR [2021-11-03 06:57:32,569] org.grobid.service.process.GrobidRestProcessString: An unexpected exception occurs.
Nov 03 06:57:32 wbgrp-svc096.us.archive.org GROBID[400404]: ! java.lang.NullPointerException: null
Nov 03 06:57:32 wbgrp-svc096.us.archive.org GROBID[400404]: ! at org.grobid.core.data.BiblioItem.cleanTitles(BiblioItem.java:1784)
Nov 03 06:57:32 wbgrp-svc096.us.archive.org GROBID[400404]: ! at org.grobid.core.engines.CitationParser.processingLayoutTokenMultiple(CitationParser.java:175)
Nov 03 06:57:32 wbgrp-svc096.us.archive.org GROBID[400404]: ! at org.grobid.core.engines.CitationParser.processingStringMultiple(CitationParser.java:92)
Nov 03 06:57:32 wbgrp-svc096.us.archive.org GROBID[400404]: ! at org.grobid.core.engines.Engine.processRawReferences(Engine.java:168)
Nov 03 06:57:32 wbgrp-svc096.us.archive.org GROBID[400404]: ! at org.grobid.service.process.GrobidRestProcessString.processCitationList(GrobidRestProcessString.java:316)
Nov 03 06:57:32 wbgrp-svc096.us.archive.org GROBID[400404]: ! at org.grobid.service.GrobidRestService.processCitationListReturnXml_post(GrobidRestService.java:581)
Nov 03 06:57:32 wbgrp-svc096.us.archive.org GROBID[400404]: ! at sun.reflect.GeneratedMethodAccessor19.invoke(Unknown Source)
Nov 03 06:57:32 wbgrp-svc096.us.archive.org GROBID[400404]: ! at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
Nov 03 06:57:32 wbgrp-svc096.us.archive.org GROBID[400404]: ! at java.lang.reflect.Method.invoke(Method.java:498)
[...]

As a workaround, clients can simply not submit weird wihtespace strings for parsing.

@kermitt2 kermitt2 added the bug From Hemiptera and especially its suborder Heteroptera label Nov 6, 2021
@kermitt2
Copy link
Owner

kermitt2 commented Nov 6, 2021

Thank you @bnewbold, intere-string!

Let's add a bit more well-formed tests for the input.

@bnewbold
Copy link
Contributor Author

bnewbold commented Nov 8, 2021

There was at least one NullPointerException in cleanTitles, but after handling that I think there is an underlying issue which is how to handle the case of one or more all-whitespace reference string as part of a batch of reference strings which are not all all-whitespace. For example, one string which is just a space character (" ") followed by 5 normal reference strings submitted to processCitationList.

I haven't fully debugged, but it seems like in some situations empty citations in such a list are returned as a stub/empty biblioItem, and in some situations there is an HTTP 500. Here is an example curl command that fails with an HTTP 500 error:

curl -X POST -H "Accept: " -d "citations=Graff, Expert. Opin. Ther. Targets (2002) 6(1): 103-113" -d "citations=Groff, Expert. Opin. Ther. Targets (2002) 6(1): 103-113" -d "citations= " https://grobid.qa.fatcat.wiki/api/processCitationList

bnewbold added a commit to internetarchive/sandcrawler that referenced this issue Nov 16, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug From Hemiptera and especially its suborder Heteroptera
Projects
None yet
Development

No branches or pull requests

2 participants