Skip to content

PDF to text/HTML using the selenium-webdriver.

Notifications You must be signed in to change notification settings

Superscanner/pdf-parser

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

34 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PDF to text

PDF to text/TextPosition using the pdf-dist.

Example.pdf outputs:

[
    {
        "page": 1,
        "text": "This is page 1",
        "TextPosition":
        [
            {
                "height": 35,
                "str": "This is page 1",
                "transform":
                [
                    35,
                    0,
                    0,
                    35,
                    112.7861,
                    76.5766
                ],
                "width": 219.83
            }
        ]
    },
    {
        "page": 2,
        "text": "This is page 2",
        "TextPosition":
        [
            {
                "height": 35,
                "str": "This is page 2",
                "transform":
                [
                    35,
                    0,
                    0,
                    35,
                    116.4326,
                    76.5766
                ],
                "width": 219.83
            }
        ]
    },
    {
        "page": 3,
        "text": "This is page 3",
        "TextPosition":
        [
            {
                "height": 35,
                "str": "This is page 3",
                "transform":
                [
                    35,
                    0,
                    0,
                    35,
                    114.7783,
                    76.5766
                ],
                "width": 219.83
            }
        ]
    }
]

Description

PDF files can become extremely complex, there are different versions, formats and a sometimes, a large number of nested elements.

To circumvent this problem JavaScript from a browsers PDF viewer is used to extract the data.

Getting Started

  • See: Installing

Installing

Clone this repository:

git clone git@github.com:zimonh/pdf-to-text.git

Install npm packages:

npm install

Executing program

First run in dedug mode, find a nice pdf and run like:

node app.js file='https://anywhere.com/book-article-or-whatever.pdf' debug=true

Now remove the debug option

node app.js file='https://anywhere.com/book-article-or-whatever.pdf'

You can also return a specific page 'page' option

node app.js file='https://anywhere.com/book-article-or-whatever.pdf' page=42

You can also access local files

node app.js file='File:///Users/Me/Desktop/book-article-or-whatever.pdf'

Known issues

  • Your file does not exist. (Try to copy the path from your browser)
  • undefined = Your page does not exist. View the PDF in a browser, search carefully, sometimes there are multiple page numbers on your screen.

Special thanks

Nickmanbear

Authors

ZIMONH www.zimonh.at

About

PDF to text/HTML using the selenium-webdriver.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • JavaScript 100.0%