basic conversion from PDF to TEI trying to guess the structure of a text. Postprocessing required!
There are three basic ways to use this package. If you use oXygen, you can use the transformation scenario defined in the oXygen project (see below). Alternatively, you can use the ANT task defined in build.xml (see further below) or as a last option, do it manually.
A general scenario is defined in pdf2tei.xpr. You may need to adjust the parameters, especially 'saxon' which contains the path to a JAR of the Saxon XSLT processor (e.g. saxon-he-10.5.jar, as is used in the example).
You can use
a jar from the oXygen directories but not one of the oxygen-patched-saxon-9.jar
(or similar). Alternatively, you
can get the latest version from Saxonica ([https://www.saxonica.com/download/download_page.xml] for a complete selection
of the available editions) or the current version of the Home Edition directly from sourceforge
([https://sourceforge.net/projects/saxon/files/Saxon-HE/10/Java/] for the current line of Saxon 10).
With ant available on your path, you can directly call ant to run the predefined workflow in build.xml. You need to set the parameters to the values for you situation:
name
: the base name to be used for the resulting TEI file and the directory below outDiroutDir
: path to the directory where the output is to be storedpdf
: path to the PDF file to be processedsaxon
: path to a Saxon .jar (see the remarks in the previous section)
Example:
ant -Dname=pdftei -DoutDir=../output -Dpdf=../incoming/pdf-to-tei.pdf -Dsaxon=saxon-he-10.5.jar
- use
pdftohtml -xml file.pdf
to create a basic XML - apply
pt1.xsl
topt4.xsl
sequentially
While these scripts try their best to guess a structure – headings, paragraphs – from the PDF, there are major limitations to this approach. Hence, the output is not valid TEI but must be postprocessed. We cannot, for instance, determine for certain whether a smaller passage is a footnote or a quotation without knowledge of the contents. Also, we can only assume that a page has a maximum of one line of heading and footer each. Pages with more than that will result in a wrong structure and possibly a column break.
To facilitate the postprocessing, values that were calculated during transformation were retained in the result. This means that there are the dimensional attributes @left, @top, @size, @bottom, and @right present for every line, and @height, @width, and @l (for the most frequently used @left of all lines) on pb. Additionally, all tei:l are comprised of one or more tei:hi with layout information (most importantly @rendition but also dimensional attributes).
Some contributions to this software were created within the scope of a project funded by the German BMBF, project ID 16TOA015A.