expose AltoPdf as REST API #552

jorgeveamurguia · 2020-03-02T23:02:46Z

return xml with text content and all images in base64 format

format XML as :
"+ content +""+ base64imagesxml +"

…l images in base64 format

coveralls · 2020-03-02T23:12:08Z

Coverage decreased (-0.06%) to 37.742% when pulling 1460d16 on jveamurguia:master into b2bd435 on kermitt2:master.

lfoppiano · 2020-03-06T00:37:25Z

I'm not sure we want to expose an internal / transportation format at high level such at the API. You write a separate service, that uses grobid-core as a dependency and exposes the information you need or use the information at the programmatic level.

The ALTO format chained with images in a base64 format seems, IMHO, a bit an ad-hoc solution.

Could you add some justification of why would you need such service in grobid?

jorgeveamurguia · 2020-03-09T17:33:03Z

It is possible that it is poorly implemented. No new service is necessary. It is better to add a parameter and optionally return image linked to the document. You're right.

I am using Grobid to extract text from PDF documents. But, I would also like to extract images too. So I have implemented a new method and I have exposed it to take image (in BASE64 format).

kermitt2 · 2020-03-09T20:20:49Z

Hello @jorgeveamurguia

You can get the embedded images converted into .png by using the batch command. For instance:

> java -Xmx4G -jar grobid-core/build/libs/grobid-core-0.6.0-SNAPSHOT-onejar.jar -gH grobid-home -dIn ~/test/in0/ -dOut ~/test/out0 -exe processFullText

There is a web service that is doing the same, returning everything in a big zip file, processFulltextAssetDocument, still usable but deprecated.

But if you are simply interested in raw text and images, you'd better use command lines with https://github.com/kermitt2/pdfalto

jorgeveamurguia · 2020-06-10T10:15:13Z

Hello @kermitt2
I want to extract images and texts. I want text to be related to image.
Position of text and position of image in text is important to me,

lfoppiano · 2020-08-12T01:28:43Z

I close this, as it has been documented here

expose a REST API from altoPdf to return xml with text content and al…

1460d16

…l images in base64 format

Downchuck mentioned this pull request Mar 5, 2020

[WIP] Docx support #515

Draft

lfoppiano added the question There's no such thing as a stupid question label Apr 6, 2020

lfoppiano closed this Aug 12, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

expose AltoPdf as REST API #552

expose AltoPdf as REST API #552

jorgeveamurguia commented Mar 2, 2020

coveralls commented Mar 2, 2020

lfoppiano commented Mar 6, 2020

jorgeveamurguia commented Mar 9, 2020

kermitt2 commented Mar 9, 2020

jorgeveamurguia commented Jun 10, 2020

lfoppiano commented Aug 12, 2020

expose AltoPdf as REST API #552

expose AltoPdf as REST API #552

Conversation

jorgeveamurguia commented Mar 2, 2020

coveralls commented Mar 2, 2020

lfoppiano commented Mar 6, 2020

jorgeveamurguia commented Mar 9, 2020

kermitt2 commented Mar 9, 2020

jorgeveamurguia commented Jun 10, 2020

lfoppiano commented Aug 12, 2020