PDF to text

PDF to text/TextPosition using the pdf-dist.

Example.pdf outputs:

[
    {
        "page": 1,
        "text": "This is page 1",
        "TextPosition":
        [
            {
                "height": 35,
                "str": "This is page 1",
                "transform":
                [
                    35,
                    0,
                    0,
                    35,
                    112.7861,
                    76.5766
                ],
                "width": 219.83
            }
        ]
    },
    {
        "page": 2,
        "text": "This is page 2",
        "TextPosition":
        [
            {
                "height": 35,
                "str": "This is page 2",
                "transform":
                [
                    35,
                    0,
                    0,
                    35,
                    116.4326,
                    76.5766
                ],
                "width": 219.83
            }
        ]
    },
    {
        "page": 3,
        "text": "This is page 3",
        "TextPosition":
        [
            {
                "height": 35,
                "str": "This is page 3",
                "transform":
                [
                    35,
                    0,
                    0,
                    35,
                    114.7783,
                    76.5766
                ],
                "width": 219.83
            }
        ]
    }
]

Description

PDF files can become extremely complex, there are different versions, formats and a sometimes, a large number of nested elements.

To circumvent this problem JavaScript from a browsers PDF viewer is used to extract the data.

Getting Started

See: Installing

Installing

Clone this repository:

git clone git@github.com:zimonh/pdf-to-text.git

Install npm packages:

npm install

Executing program

First run in dedug mode, find a nice pdf and run like:

node app.js file='https://anywhere.com/book-article-or-whatever.pdf' debug=true

Now remove the debug option

node app.js file='https://anywhere.com/book-article-or-whatever.pdf'

You can also return a specific page 'page' option

node app.js file='https://anywhere.com/book-article-or-whatever.pdf' page=42

You can also access local files

node app.js file='File:///Users/Me/Desktop/book-article-or-whatever.pdf'

Known issues

Your file does not exist. (Try to copy the path from your browser)
undefined = Your page does not exist. View the PDF in a browser, search carefully, sometimes there are multiple page numbers on your screen.

Special thanks

Nickmanbear

Authors

ZIMONH www.zimonh.at

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
Example.pdf		Example.pdf
README.md		README.md
app.js		app.js
logo.png		logo.png
logo.svg		logo.svg
package-lock.json		package-lock.json
package.json		package.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PDF to text

Description

Getting Started

Installing

Executing program

Known issues

Special thanks

Authors

About

Releases

Packages

Languages

Superscanner/pdf-parser

Folders and files

Latest commit

History

Repository files navigation

PDF to text

Description

Getting Started

Installing

Executing program

Known issues

Special thanks

Authors

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages