Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

expose AltoPdf as REST API #552

Closed
wants to merge 1 commit into from

Conversation

jorgeveamurguia
Copy link

return xml with text content and all images in base64 format

format XML as :
"+ content +""+ base64imagesxml +"

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.06%) to 37.742% when pulling 1460d16 on jveamurguia:master into b2bd435 on kermitt2:master.

@Downchuck Downchuck mentioned this pull request Mar 5, 2020
@lfoppiano
Copy link
Collaborator

I'm not sure we want to expose an internal / transportation format at high level such at the API. You write a separate service, that uses grobid-core as a dependency and exposes the information you need or use the information at the programmatic level.

The ALTO format chained with images in a base64 format seems, IMHO, a bit an ad-hoc solution.

Could you add some justification of why would you need such service in grobid?

@jorgeveamurguia
Copy link
Author

It is possible that it is poorly implemented. No new service is necessary. It is better to add a parameter and optionally return image linked to the document. You're right.

I am using Grobid to extract text from PDF documents. But, I would also like to extract images too. So I have implemented a new method and I have exposed it to take image (in BASE64 format).

@kermitt2
Copy link
Owner

kermitt2 commented Mar 9, 2020

Hello @jorgeveamurguia

You can get the embedded images converted into .png by using the batch command. For instance:

> java -Xmx4G -jar grobid-core/build/libs/grobid-core-0.6.0-SNAPSHOT-onejar.jar -gH grobid-home -dIn ~/test/in0/ -dOut ~/test/out0 -exe processFullText 

Screenshot from 2020-03-09 21-09-50

There is a web service that is doing the same, returning everything in a big zip file, processFulltextAssetDocument, still usable but deprecated.

But if you are simply interested in raw text and images, you'd better use command lines with https://github.com/kermitt2/pdfalto

@lfoppiano lfoppiano added the question There's no such thing as a stupid question label Apr 6, 2020
@jorgeveamurguia
Copy link
Author

Hello @kermitt2
I want to extract images and texts. I want text to be related to image.
Position of text and position of image in text is important to me,

@lfoppiano
Copy link
Collaborator

I close this, as it has been documented here

@lfoppiano lfoppiano closed this Aug 12, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question There's no such thing as a stupid question
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants