Inquiry for using the PDFScraper #7
-
Hi, erikkastelec, Since I am currently involved in technical research and in need of using this software for doing the steps of data extraction. Please let me know if we have a time to have a call via Skype, Slack, or Disorder either. |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 1 reply
-
Hi, I am not available for one on one consultations about the project, but I will be more than happy to answer any specific questions you have about it. P.S. Sorry for the late reply |
Beta Was this translation helpful? Give feedback.
-
Hi,
Thanks for your reply.
Since I want to crawl the data from via website and pdf file and export it into excel format to analyse.
I have tried octoparser, web scraper from google and google vision API and also python.
Do you have any suggestions? Like python and R library which are able to run on mac and window?
P.S: It's OKAY. I knew not everyone keep tracking on GitHub.
Get Outlook for iOS<https://aka.ms/o0ukef>
…________________________________
From: Erik Kastelec ***@***.***>
Sent: Tuesday, February 22, 2022 1:09:56 PM
To: erikkastelec/PDFScraper ***@***.***>
Cc: summywong-developer ***@***.***>; Author ***@***.***>
Subject: Re: [erikkastelec/PDFScraper] Inquiry for using the PDFScraper (Discussion #7)
Hi,
I am not available for one on one consultations about the project, but I will be more than happy to answer any specific questions you have about it.
If you tell me more about what kind of data you want to extract and in what format it will be used after extraction, I can point you in the right direction.
P.S. Sorry for the late reply
—
Reply to this email directly, view it on GitHub<#7 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AT7YJU7VC23IM4CZOJKOE7TU4MLCJANCNFSM5OUBAWIA>.
Triage notifications on the go with GitHub Mobile for iOS<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
You are receiving this because you authored the thread.Message ID: ***@***.***>
|
Beta Was this translation helpful? Give feedback.
For extraction of text data from the PDF I used pdfminer.six. If you need to extract data from tables than camelot would be a better choice.
Both of the libraries are well documented (I only have Slovene documentation for mine), but you can still take a look at how I used them in my library.
If you have PDF documents, which are in "image" form (you can't copy and paste from them) than I suggest you use Tesseractt to convert them into editable pdf format.