-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
how to improve PDF to HTML conversion #10
Comments
Thanks very much Simon!
On Sat, Sep 24, 2022 at 2:18 PM Simon Worthington ***@***.***> wrote:
Currently the Semantic Climate project converts PDFs to HTML.
The output needs improvement. Currently it contained a number of elements
which may not be needed, e.g., page numbers, inline styles, etc.
`py4ami` code does remove these (by geometry) but yesterday it stopped
working. This may be due to the `pdfminer` libraries which I *think* are
not loaded in `Colab` in the same way as my machine. The key thing that any
PDF tool must do (maybe include Phillip) are:
* preserve character entities
* preserve styles (mainly font family)
* preserve fill-color (often the only indication of bold)
* preserve coordinates (critical for sub/superscripts and math
* preserve image-coordinates
(Very few out-of-the box system do all of these)
pdfminer is hacky - it's partially abandonware. pdfplumber is layered on
top. The colour is not transmitted well. I used to use PDFBox (Java) which
is much better. But that's hard for an interactive session. It needs a
person-month from someone who understands this
The objective would be to improve the output with tooling that can
integrate with the current workflow.
The suggestion would be to create a way to evaluate the process by
collating information on the issue:
1. Current tooling
2. Condition of the source PDFs
3. Problems with outputs
4. List of parts and markup that we need to retain their integrity
5. Define what we want in out target outputs
6. Do we want other output formats for richer markup and other
interoperability
7. List and evaluate tools
8. Consult experts in the field: pandoc, le-tex, fidus, vivlio,
css-rocks, etc
This research can be conducted in a wiki page on the Semantic Climate
repository.
At the moment the only person who understands the whole PDF process is me.
I am not aware of others who I can collaborate with. So I'd dearly love a
volunteer.
… Here are sample files:
PDF source - Chapter 8
https://github.com/petermr/semanticClimate/blob/main/ipcc/ar6/wg3/Chapter08/fulltext.pdf
HTML full text - Chapter 8
https://github.com/petermr/semanticClimate/blob/main/ipcc/ar6/wg3/Chapter08/fulltext.html
Tasks
1. Link to current tool
2. Consult Single Source Publishing Community
https://github.com/singlesourcepub/community/discussions and others
—
Reply to this email directly, view it on GitHub
<#10>, or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAFTCS5NM4YGHDBSK6RCVE3V735T3ANCNFSM6AAAAAAQUUTCJU>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
--
Peter Murray-Rust
Founder ContentMine.org
and
Reader Emeritus in Molecular Informatics
Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK
|
Vivlio is now mainstream for our activites |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Currently the Semantic Climate project converts PDFs to HTML.
The content is the IPPC Climate report AR6 and we need to improve is markup for further semantic annotation, resuse, and presentation. From a typesetting perspective and freeing us from descructive reliance on PDF (note we can get PDF like results in a non-descructive way using Vivliostyle) - that's me @mrchristian I would like to produce HTMl that could be rendered in Vivliostyle better than this.
The output needs improvement. Currently it contained a number of elements which may not be needed, e.g., page numbers, inline styles, etc.
The objective would be to improve the output with tooling that can integrate with the current workflow.
The suggestion would be to create a way to evaluate the process by collating information on the issue:
This research can be conducted in a wiki page on the Semantic Climate repository.
Here are sample files:
PDF source - Chapter 8 https://github.com/petermr/semanticClimate/blob/main/ipcc/ar6/wg3/Chapter08/fulltext.pdf
HTML full text - Chapter 8 https://github.com/petermr/semanticClimate/blob/main/ipcc/ar6/wg3/Chapter08/fulltext.html
Tasks
The text was updated successfully, but these errors were encountered: