how to improve PDF to HTML conversion #10

mrchristian · 2022-09-24T13:18:43Z

Currently the Semantic Climate project converts PDFs to HTML.

The content is the IPPC Climate report AR6 and we need to improve is markup for further semantic annotation, resuse, and presentation. From a typesetting perspective and freeing us from descructive reliance on PDF (note we can get PDF like results in a non-descructive way using Vivliostyle) - that's me @mrchristian I would like to produce HTMl that could be rendered in Vivliostyle better than this.

The output needs improvement. Currently it contained a number of elements which may not be needed, e.g., page numbers, inline styles, etc.

The objective would be to improve the output with tooling that can integrate with the current workflow.

The suggestion would be to create a way to evaluate the process by collating information on the issue:

Current tooling
Condition of the source PDFs
Problems with outputs
List of parts and markup that we need to retain their integrity
Define what we want in out target outputs
Do we want other output formats for richer markup and other interoperability
List and evaluate tools
Consult experts in the field: pandoc, le-tex, fidus, vivlio, css-rocks, etc

This research can be conducted in a wiki page on the Semantic Climate repository.

Here are sample files:

PDF source - Chapter 8 https://github.com/petermr/semanticClimate/blob/main/ipcc/ar6/wg3/Chapter08/fulltext.pdf

HTML full text - Chapter 8 https://github.com/petermr/semanticClimate/blob/main/ipcc/ar6/wg3/Chapter08/fulltext.html

Tasks

Link to current PDF to HTML tooling.
Consult Single Source Publishing Community https://github.com/singlesourcepub/community/discussions and others: le-tex, pandoc, css rocks?

petermr · 2022-09-24T13:31:33Z

Thanks very much Simon!

On Sat, Sep 24, 2022 at 2:18 PM Simon Worthington ***@***.***> wrote: Currently the Semantic Climate project converts PDFs to HTML. The output needs improvement. Currently it contained a number of elements which may not be needed, e.g., page numbers, inline styles, etc.

`py4ami` code does remove these (by geometry) but yesterday it stopped working. This may be due to the `pdfminer` libraries which I *think* are not loaded in `Colab` in the same way as my machine. The key thing that any PDF tool must do (maybe include Phillip) are: * preserve character entities * preserve styles (mainly font family) * preserve fill-color (often the only indication of bold) * preserve coordinates (critical for sub/superscripts and math * preserve image-coordinates (Very few out-of-the box system do all of these) pdfminer is hacky - it's partially abandonware. pdfplumber is layered on top. The colour is not transmitted well. I used to use PDFBox (Java) which is much better. But that's hard for an interactive session. It needs a person-month from someone who understands this The objective would be to improve the output with tooling that can

integrate with the current workflow. The suggestion would be to create a way to evaluate the process by collating information on the issue: 1. Current tooling 2. Condition of the source PDFs 3. Problems with outputs 4. List of parts and markup that we need to retain their integrity 5. Define what we want in out target outputs 6. Do we want other output formats for richer markup and other interoperability 7. List and evaluate tools 8. Consult experts in the field: pandoc, le-tex, fidus, vivlio, css-rocks, etc This research can be conducted in a wiki page on the Semantic Climate repository.

At the moment the only person who understands the whole PDF process is me. I am not aware of others who I can collaborate with. So I'd dearly love a volunteer.

…

Here are sample files: PDF source - Chapter 8 https://github.com/petermr/semanticClimate/blob/main/ipcc/ar6/wg3/Chapter08/fulltext.pdf HTML full text - Chapter 8 https://github.com/petermr/semanticClimate/blob/main/ipcc/ar6/wg3/Chapter08/fulltext.html Tasks 1. Link to current tool 2. Consult Single Source Publishing Community https://github.com/singlesourcepub/community/discussions and others — Reply to this email directly, view it on GitHub <#10>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAFTCS5NM4YGHDBSK6RCVE3V735T3ANCNFSM6AAAAAAQUUTCJU> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

petermr · 2024-01-31T14:30:43Z

Vivlio is now mainstream for our activites

petermr closed this as completed Jan 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

how to improve PDF to HTML conversion #10

how to improve PDF to HTML conversion #10

mrchristian commented Sep 24, 2022 •

edited

Loading

petermr commented Sep 24, 2022 via email

petermr commented Jan 31, 2024

how to improve PDF to HTML conversion #10

how to improve PDF to HTML conversion #10

Comments

mrchristian commented Sep 24, 2022 • edited Loading

Tasks

petermr commented Sep 24, 2022 via email

petermr commented Jan 31, 2024

mrchristian commented Sep 24, 2022 •

edited

Loading