Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

how to improve PDF to HTML conversion #10

Closed
mrchristian opened this issue Sep 24, 2022 · 2 comments
Closed

how to improve PDF to HTML conversion #10

mrchristian opened this issue Sep 24, 2022 · 2 comments

Comments

@mrchristian
Copy link
Collaborator

mrchristian commented Sep 24, 2022

Currently the Semantic Climate project converts PDFs to HTML.

The content is the IPPC Climate report AR6 and we need to improve is markup for further semantic annotation, resuse, and presentation. From a typesetting perspective and freeing us from descructive reliance on PDF (note we can get PDF like results in a non-descructive way using Vivliostyle) - that's me @mrchristian I would like to produce HTMl that could be rendered in Vivliostyle better than this.

The output needs improvement. Currently it contained a number of elements which may not be needed, e.g., page numbers, inline styles, etc.

The objective would be to improve the output with tooling that can integrate with the current workflow.

The suggestion would be to create a way to evaluate the process by collating information on the issue:

  1. Current tooling
  2. Condition of the source PDFs
  3. Problems with outputs
  4. List of parts and markup that we need to retain their integrity
  5. Define what we want in out target outputs
  6. Do we want other output formats for richer markup and other interoperability
  7. List and evaluate tools
  8. Consult experts in the field: pandoc, le-tex, fidus, vivlio, css-rocks, etc

This research can be conducted in a wiki page on the Semantic Climate repository.

Here are sample files:

PDF source - Chapter 8 https://github.com/petermr/semanticClimate/blob/main/ipcc/ar6/wg3/Chapter08/fulltext.pdf

HTML full text - Chapter 8 https://github.com/petermr/semanticClimate/blob/main/ipcc/ar6/wg3/Chapter08/fulltext.html

Tasks

  1. Link to current PDF to HTML tooling.
  2. Consult Single Source Publishing Community https://github.com/singlesourcepub/community/discussions and others: le-tex, pandoc, css rocks?
@petermr
Copy link
Owner

petermr commented Sep 24, 2022 via email

@petermr petermr closed this as completed Jan 31, 2024
@petermr
Copy link
Owner

petermr commented Jan 31, 2024

Vivlio is now mainstream for our activites

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants