Inquiry for using the PDFScraper #7

summywong-developer · 2022-02-17T09:04:26Z

summywong-developer
Feb 17, 2022

Hi, erikkastelec,
Could you please teaching me how to use PDFScraper on Mac?

Since I am currently involved in technical research and in need of using this software for doing the steps of data extraction. Please let me know if we have a time to have a call via Skype, Slack, or Disorder either.

Answered by erikkastelec

Feb 22, 2022

For extraction of text data from the PDF I used pdfminer.six. If you need to extract data from tables than camelot would be a better choice.

Both of the libraries are well documented (I only have Slovene documentation for mine), but you can still take a look at how I used them in my library.

If you have PDF documents, which are in "image" form (you can't copy and paste from them) than I suggest you use Tesseractt to convert them into editable pdf format.

View full answer

erikkastelec · 2022-02-22T05:09:43Z

erikkastelec
Feb 22, 2022
Maintainer

Hi,

I am not available for one on one consultations about the project, but I will be more than happy to answer any specific questions you have about it.
If you tell me more about what kind of data you want to extract and in what format it will be used after extraction, I can point you in the right direction.

P.S. Sorry for the late reply

0 replies

summywong-developer · 2022-02-22T06:27:23Z

summywong-developer
Feb 22, 2022
Author

Hi, Thanks for your reply. Since I want to crawl the data from via website and pdf file and export it into excel format to analyse. I have tried octoparser, web scraper from google and google vision API and also python. Do you have any suggestions? Like python and R library which are able to run on mac and window? P.S: It's OKAY. I knew not everyone keep tracking on GitHub. Get Outlook for iOS<https://aka.ms/o0ukef>

…

________________________________ From: Erik Kastelec ***@***.***> Sent: Tuesday, February 22, 2022 1:09:56 PM To: erikkastelec/PDFScraper ***@***.***> Cc: summywong-developer ***@***.***>; Author ***@***.***> Subject: Re: [erikkastelec/PDFScraper] Inquiry for using the PDFScraper (Discussion #7) Hi, I am not available for one on one consultations about the project, but I will be more than happy to answer any specific questions you have about it. If you tell me more about what kind of data you want to extract and in what format it will be used after extraction, I can point you in the right direction. P.S. Sorry for the late reply — Reply to this email directly, view it on GitHub<#7 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AT7YJU7VC23IM4CZOJKOE7TU4MLCJANCNFSM5OUBAWIA>. Triage notifications on the go with GitHub Mobile for iOS<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>. You are receiving this because you authored the thread.Message ID: ***@***.***>

1 reply

erikkastelec Feb 22, 2022
Maintainer

For extraction of text data from the PDF I used pdfminer.six. If you need to extract data from tables than camelot would be a better choice.

Both of the libraries are well documented (I only have Slovene documentation for mine), but you can still take a look at how I used them in my library.

If you have PDF documents, which are in "image" form (you can't copy and paste from them) than I suggest you use Tesseractt to convert them into editable pdf format.

Answer selected by erikkastelec

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inquiry for using the PDFScraper #7

{{title}}

Replies: 2 comments 1 reply

{{title}}

{{title}}

{{title}}

Select a reply

Inquiry for using the PDFScraper #7

summywong-developer Feb 17, 2022

Replies: 2 comments · 1 reply

erikkastelec Feb 22, 2022 Maintainer

summywong-developer Feb 22, 2022 Author

erikkastelec Feb 22, 2022 Maintainer

summywong-developer
Feb 17, 2022

Replies: 2 comments 1 reply

erikkastelec
Feb 22, 2022
Maintainer

summywong-developer
Feb 22, 2022
Author

erikkastelec Feb 22, 2022
Maintainer