-
Notifications
You must be signed in to change notification settings - Fork 359
Please add support for reading tables with Arabic fonts #141
Comments
Thanks for the detailed report @ZainRizvi!
This is a problem in the PDF itself, its ToUnicode map is incorrect. I tried copying and pasting the arabic text from the PDF into a text editor and got boxes instead of arabic characters. In the past, I've used OCR to get text out of such PDFs.
Can you give me an example of this and help me understand this better? Perhaps you mean "same column of two different rows"?
You are correct, Camelot sorts the characters just as they would appear in english text. This is a bug, let me work out a fix for this. |
Sorry, yes, I meant same column of two different rows.
Thanks for looking at this bug so quickly!
…On Sat, Oct 13, 2018, 2:57 AM Vinayak Mehta ***@***.***> wrote:
Thanks for the detailed report @ZainRizvi <https://github.com/ZainRizvi>!
1. The (I think unicode) characters for the Arabic text seem to have
either been corrupted in the process or they've lost any mapping to the
Arabic fonts. Even when I tried opening up the file in Google Sheets &
other editors and set the font to Arabic the words would still not be
properly displayed
This is a problem in the PDF itself, its ToUnicode map is incorrect. I
tried copying and pasting the arabic text from the PDF into a text editor
and got boxes instead of arabic characters. In the past, I've used OCR to
get text out of such PDFs.
1. Text that should have been on the same row of two different rows is
instead placed in two different columns of the same row. (This is a minor
annoyance that I can easily work around, and perhaps you tool already has a
fix for this that I haven't discovered)
Can you give me an example of this and help me understand this better?
Perhaps you mean "same *column* of two different rows"?
The order of the letters has been flipped around. This is probably due to
the fact that Arabic reads from right to left. I suspect PDFminer gave
Camelot the letters in the "correct" left to right order, but Camelot, not
being aware that the letters should be read in the opposite order, flipped
the order around
You are correct, Camelot sorts the characters just as they would appear in
english text. This is a bug, let me work out a fix for this.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#141 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AEQw5yZXECSM4kTb2CBjQDaEfejUPVPxks5ukbj2gaJpZM4XaR9s>
.
|
"This is a problem in the PDF itself, its ToUnicode map is incorrect. I tried copying and pasting the arabic text from the PDF into a text editor and got boxes instead of arabic characters. In the past, I've used OCR to get text out of such PDFs." Interesting. I'm not very familiar with the PDF format. If the ToUnicode map is incorrect for those characters then how do PDF readers manage to render those characters correctly? Is there some custom font embedded into the PDF which described how to convert each character? |
The ToUnicode map contains a mapping of each font glyph to a corresponding unicode character. This mapping is broken in the PDF above. The PDF reader knows where to place each font glyph using the specified x,y coordinates. |
@ZainRizvi Can you extract the table from this PDF and tell me if the output is correct or not? From some visual pattern matching, I can tell that the text is extracted in the correct right-to-left reading order. Camelot uses text lines computed by PDFMiner and assigns them to specific cells. Even if PDFMiner creates text lines by combining individual characters in left-to-right order, the final result should be correct when read in right-to-left order. Correct me if I'm wrong here. |
Also, I added a test for the PDF mentioned in the comment above. Strangely, when I see this list in the terminal, it looks fine, but the order is messed up when viewing in VS Code or Github. |
Can you help me parsing a file in regional language like marathi in camelot? |
@kaneprajakta What is the problem that you're facing? Can you also post the code snippet that you're using? |
I can see that Arabic is still an issue as of version 0.7.2 although Camelot is doing a great job parsing pdf tables. Arabic text is backward (words in a phrase and letters in a word). Here's a Colab notebook (test pdf from Tabula). |
Same issue as abedkhooli, using 0.7.2. |
Looks like an issue with the PDF itself. Both ال and لا could be mapped to ال in the PDF's ToUnicode map. |
I'm pretty sure that isn't the problem, since when I copy and paste from
the PDF it copies correctly. How can the ToUnicode mapping be viewed? Is
there any way to override the mapping?
…On Tue, Aug 27, 2019, 16:39 Vinayak Mehta ***@***.***> wrote:
Also, it parses ال and لا as ال, which is a mistake
Looks like an issue with the PDF itself. Both ال and لا could be mapped to
ال in the PDF's ToUnicode map.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#141?email_source=notifications&email_token=AACQ6TXBUW2XKB55S42GPTTQGUVALA5CNFSM4F3JD5WKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD5HYPNY#issuecomment-525305783>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AACQ6TRSRD7YMQDEKPM7ORDQGUVALANCNFSM4F3JD5WA>
.
|
A couple of us saw this issue in another PDF yesterday. Looking to dig into the pdfminer issue tracker for this. |
Is this issue solved? Weirdly, I am using version 0.7.3 but when I tried to read a procurement document which is in plain English, the word gets read from right to left. I tried playing around with shift_text parameter as stated in the documents but is of no effect. |
I'm trying to work on a Hebrew document and I'm getting the same problem with the reversed text. Has this issue been resolved? |
Versions:
Linux-4.9.0-6-amd64-x86_64
Python 3.6.5 |Anaconda, Inc.| (default, Apr 29 2018, 16:14:56)
[GCC 7.2.0]
NumPy 1.15.2
OpenCV 3.4.2
Hi,
Can you please support reading languages in Arabic fonts?
In particular, I'm trying to extract tables from this document (backup link since Scribd seems to be down right now).
Starting at page 6, the document presents lines of Arabic on the left column and lines of English in the right column. I used these commands to extract that text as a table:
However, the extracted table has two issues:
1. The (I think unicode) characters for the Arabic text seem to have either been corrupted in the process or they've lost any mapping to the Arabic fonts. Even when I tried opening up the file in Google Sheets & other editors and set the font to Arabic the words would still not be properly displayed
2. Text that should have been on the same row of two different rows is instead placed in two different columns of the same row. (This is a minor annoyance that I can easily work around, and perhaps you tool already has a fix for this that I haven't discovered)
Below is the full output generated by the above command. Interestingly, in the beginning part of the file (line 3 and a bit of line 4) you can at least recognize the Arabic letters. However, again there are two issues:
The text was updated successfully, but these errors were encountered: