Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for encoding detection when default encoding is not correct #956

Open
potaninmt opened this issue Dec 14, 2024 · 10 comments
Open
Labels
document-reading Related to reading documents enhancement

Comments

@potaninmt
Copy link

I noticed that there is a bug in extracting text from PDF when different encodings are contained inside. For example, one of the documents had to convert windows-1251 to windows-1252 for normal reading. Is it possible to implement in such a way that despite the many different encodings inside the document the text is extracted successfully? It is even possible that each token in a pdf can have its own encoding.

изображение

@BobLd
Copy link
Collaborator

BobLd commented Dec 15, 2024

@potaninmt thanks fir raising the issue, can you share a sample pdf? Thanks!

@potaninmt
Copy link
Author

@potaninmt thanks fir raising the issue, can you share a sample pdf? Thanks!

@BobLd Thanks, I'm attaching the file!
Example.pdf

@BobLd
Copy link
Collaborator

BobLd commented Dec 15, 2024

@potaninmt unfortunately, I believe this is not a pdfpig issue...

Firefox, Edge and Acrobat Reader are not able to copy the text properly (which is an indicator that the pdf document is not correctly built). In Firefox, copying the following
image

gives

ñëîæíî ïîêàçàòü, ÷òî åñëè ðàññìîòðåòü áåñêîíå÷íóþ ïîëèíîìèàëüíóþ ñèñòåìó,
òî îíà ñîéäåòñÿ ê íåêîòîðîé êîíå÷

@BobLd
Copy link
Collaborator

BobLd commented Dec 15, 2024

your best chance to have correct text extracted might be OCR. Or did you manage to play around with windows-1251 / windows-1252 to get the correct text after extraction?

@potaninmt
Copy link
Author

@BobLd
The thing is that there are a lot of such broken pdf's on the internet, but there are solutions to fix them, maybe it would be useful for you to implement it:
There is this service which is able to automatically fix the error with encodings: https://2cyr.com/decode/?lang=ru
There is also a project on github that can automatically detect the encoding of a text file: https://github.com/yinyue200/ude?tab=readme-ov-file

@potaninmt
Copy link
Author

@BobLd
изображение

@BobLd
Copy link
Collaborator

BobLd commented Dec 16, 2024

hi @potaninmt thanks a lot for pointing me to this library, I wasn't aware it even existed and it's extremely interesting.

I had a quick look and the Ude library is based on Mozilla Universal Charset Detector. All implementations I could find of the MUCD are under MLP/GPL2/AGPL2 license, which is not really compatible with PdfPig license.

1 option would be to release a separate NuGet package under the same license, preserving PdfPig. It would also be possible to do our own implementation (trickier).

I'll have a look at the first option, hopefully in the short term. Happy for you to give any implementation advice.

Some references (mainly for me) I found on the topic:

@BobLd BobLd added enhancement document-reading Related to reading documents labels Dec 16, 2024
@BobLd BobLd changed the title Incorrect pdf reading with different encodings inside the document Add support for encoding detection when default encoding is not correct Dec 16, 2024
@Charltsing
Copy link

https://github.com/CharsetDetector/UTF-unknown

@potaninmt
Copy link
Author

@Charltsing Thank you!

@potaninmt
Copy link
Author

@BobLd
Thank you for reading and responding! I think yes, it would be useful to correct the encodings in the pdf, considering that I've actually encountered quite a few problematic documents.
I don't have much experience with encodings, but had the following algorithm idea:
Take a specific list of encodings:
UTF-8
Unicode
Windows-1251
Windows-1252
...
And by brute force (input encoding into bytes with one encoding -> output decoding of bytes with another encoding) determine the maximum plausibility of the text. The number of combinations is not much, for example for 4 encodings it is 16.

Useful links:
https://en.wikipedia.org/wiki/Byte_order_mark

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
document-reading Related to reading documents enhancement
Projects
None yet
Development

No branches or pull requests

3 participants