Add support for encoding detection when default encoding is not correct #956

potaninmt · 2024-12-14T21:15:53Z

I noticed that there is a bug in extracting text from PDF when different encodings are contained inside. For example, one of the documents had to convert windows-1251 to windows-1252 for normal reading. Is it possible to implement in such a way that despite the many different encodings inside the document the text is extracted successfully? It is even possible that each token in a pdf can have its own encoding.

BobLd · 2024-12-15T00:15:44Z

@potaninmt thanks fir raising the issue, can you share a sample pdf? Thanks!

potaninmt · 2024-12-15T08:48:05Z

@potaninmt thanks fir raising the issue, can you share a sample pdf? Thanks!

@BobLd Thanks, I'm attaching the file!
Example.pdf

BobLd · 2024-12-15T12:48:37Z

@potaninmt unfortunately, I believe this is not a pdfpig issue...

Firefox, Edge and Acrobat Reader are not able to copy the text properly (which is an indicator that the pdf document is not correctly built). In Firefox, copying the following

gives

ñëîæíî ïîêàçàòü, ÷òî åñëè ðàññìîòðåòü áåñêîíå÷íóþ ïîëèíîìèàëüíóþ ñèñòåìó,
òî îíà ñîéäåòñÿ ê íåêîòîðîé êîíå÷

BobLd · 2024-12-15T12:50:54Z

your best chance to have correct text extracted might be OCR. Or did you manage to play around with windows-1251 / windows-1252 to get the correct text after extraction?

potaninmt · 2024-12-15T19:54:23Z

@BobLd
The thing is that there are a lot of such broken pdf's on the internet, but there are solutions to fix them, maybe it would be useful for you to implement it:
There is this service which is able to automatically fix the error with encodings: https://2cyr.com/decode/?lang=ru
There is also a project on github that can automatically detect the encoding of a text file: https://github.com/yinyue200/ude?tab=readme-ov-file

potaninmt · 2024-12-15T19:56:24Z

@BobLd

BobLd · 2024-12-16T19:09:59Z

hi @potaninmt thanks a lot for pointing me to this library, I wasn't aware it even existed and it's extremely interesting.

I had a quick look and the Ude library is based on Mozilla Universal Charset Detector. All implementations I could find of the MUCD are under MLP/GPL2/AGPL2 license, which is not really compatible with PdfPig license.

1 option would be to release a separate NuGet package under the same license, preserving PdfPig. It would also be possible to do our own implementation (trickier).

I'll have a look at the first option, hopefully in the short term. Happy for you to give any implementation advice.

Some references (mainly for me) I found on the topic:

Charltsing · 2024-12-17T08:29:05Z

https://github.com/CharsetDetector/UTF-unknown

potaninmt · 2024-12-17T16:58:24Z

@Charltsing Thank you!

potaninmt · 2024-12-17T17:07:22Z

@BobLd
Thank you for reading and responding! I think yes, it would be useful to correct the encodings in the pdf, considering that I've actually encountered quite a few problematic documents.
I don't have much experience with encodings, but had the following algorithm idea:
Take a specific list of encodings:
UTF-8
Unicode
Windows-1251
Windows-1252
...
And by brute force (input encoding into bytes with one encoding -> output decoding of bytes with another encoding) determine the maximum plausibility of the text. The number of combinations is not much, for example for 4 encodings it is 16.

Useful links:
https://en.wikipedia.org/wiki/Byte_order_mark

BobLd added enhancement document-reading Related to reading documents labels Dec 16, 2024

BobLd changed the title ~~Incorrect pdf reading with different encodings inside the document~~ Add support for encoding detection when default encoding is not correct Dec 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for encoding detection when default encoding is not correct #956

Add support for encoding detection when default encoding is not correct #956

potaninmt commented Dec 14, 2024

BobLd commented Dec 15, 2024

potaninmt commented Dec 15, 2024

BobLd commented Dec 15, 2024

BobLd commented Dec 15, 2024

potaninmt commented Dec 15, 2024

potaninmt commented Dec 15, 2024

BobLd commented Dec 16, 2024 •

edited

Loading

Charltsing commented Dec 17, 2024

potaninmt commented Dec 17, 2024

potaninmt commented Dec 17, 2024

Add support for encoding detection when default encoding is not correct #956

Add support for encoding detection when default encoding is not correct #956

Comments

potaninmt commented Dec 14, 2024

BobLd commented Dec 15, 2024

potaninmt commented Dec 15, 2024

BobLd commented Dec 15, 2024

BobLd commented Dec 15, 2024

potaninmt commented Dec 15, 2024

potaninmt commented Dec 15, 2024

BobLd commented Dec 16, 2024 • edited Loading

Charltsing commented Dec 17, 2024

potaninmt commented Dec 17, 2024

potaninmt commented Dec 17, 2024

BobLd commented Dec 16, 2024 •

edited

Loading