Please add support for reading tables with Arabic fonts #141

ZainRizvi · 2018-10-12T22:28:32Z

Versions:
Linux-4.9.0-6-amd64-x86_64
Python 3.6.5 |Anaconda, Inc.| (default, Apr 29 2018, 16:14:56)
[GCC 7.2.0]
NumPy 1.15.2
OpenCV 3.4.2

Hi,

Can you please support reading languages in Arabic fonts?

In particular, I'm trying to extract tables from this document (backup link since Scribd seems to be down right now).

Starting at page 6, the document presents lines of Arabic on the left column and lines of English in the right column. I used these commands to extract that text as a table:

tables = camelot.read_pdf('quran.pdf', pages='6',columns=['240'])
tables.export('quran.csv', f='csv', compress=False)

However, the extracted table has two issues:

1. The (I think unicode) characters for the Arabic text seem to have either been corrupted in the process or they've lost any mapping to the Arabic fonts. Even when I tried opening up the file in Google Sheets & other editors and set the font to Arabic the words would still not be properly displayed

2. Text that should have been on the same row of two different rows is instead placed in two different columns of the same row. (This is a minor annoyance that I can easily work around, and perhaps you tool already has a fix for this that I haven't discovered)

Below is the full output generated by the above command. Interestingly, in the beginning part of the file (line 3 and a bit of line 4) you can at least recognize the Arabic letters. However, again there are two issues:

The order of the letters has been flipped around. This is probably due to the fact that Arabic reads from right to left. I suspect PDFminer gave Camelot the letters in the "correct" left to right order, but Camelot, not being aware that the letters should be read in the opposite order, flipped the order around
The two lines of visible Arabic are both from the same word, the characters of which somehow got split into two different rows

"ِ",""
"",""
"ِةِِ حـِتاَف",""
"ْلَا","al-Fātiḥah"
"َ",""
"",""
"","1.  1In the Name of Allah,"
"",""
"","the All-beneficent, the All-merciful."
"",""
"","2. All praise belongs to Allah,2"
"",""
"",""
"","Lord of all the worlds,"
"",""
"","3. the All-beneficent, the All-merciful,"
"",""
"","4. Master3 of the Day of Retribution."
"",""
"","5. You [alone] do we worship,"
"",""
"","and to You [alone] do we turn for help."
"",""
"","6. Guide us on the straight path,"
"",""
"","7. the path of those whom You have blessed4"
"",""
"","—such as5 have not incurred Your wrath,6"
"1 That is, ‘the opening’ sūrah. Another common name of the sūrah is ‘Sūrat al-Ḥamd, ’that is, the sūrah of",""
"the [Lord’s] praise.",""
"2 In Muslim parlance the phrase al-ḥamdu lillāh also signifies ‘thanks to Allah.’",""
"3 This is in accordance with the reading mālik yawm al-dīn, adopted by ‘Āṣim, al-Kisā’ī, Ya‘qūb al-Ḥaḍramī,",""
"and Khalaf. Other authorities of qirā’ah (the art of recitation of the Qur’ān) have read ‘malik yawm al-",""
"","dīn,’meaning ‘Sovereign of the Day of Retribution’(see Mu‘jam al-Qirā’āt al-Qur’āniyyah). Traditions ascribe"
"both readings to Imam Ja‘far al-Ṣādiq (‘a). See al-Qummī, al-‘Ayyāshī, Tafsīr al-Imām al-‘Askarī.",""
"4 For further Qur’ānic references to ‘those whom Allah has blessed,’see 4:69 and 19:58; see also 5:23, 110;",""
"12:6; 27:19; 28:17; 43:59; 48:2.",""
"5 This is in accordance with the qirā’ah of ‘Āṣim, ghayril-maghḍūbi, which appears in the Arabic text above.",""
"","However, in accordance with an alternative, and perhaps preferable, reading ghayral-maghḍūbi (attributed"

The text was updated successfully, but these errors were encountered:

vinayak-mehta · 2018-10-13T09:57:08Z

Thanks for the detailed report @ZainRizvi!

The (I think unicode) characters for the Arabic text seem to have either been corrupted in the process or they've lost any mapping to the Arabic fonts. Even when I tried opening up the file in Google Sheets & other editors and set the font to Arabic the words would still not be properly displayed

This is a problem in the PDF itself, its ToUnicode map is incorrect. I tried copying and pasting the arabic text from the PDF into a text editor and got boxes instead of arabic characters. In the past, I've used OCR to get text out of such PDFs.

Text that should have been on the same row of two different rows is instead placed in two different columns of the same row. (This is a minor annoyance that I can easily work around, and perhaps you tool already has a fix for this that I haven't discovered)

Can you give me an example of this and help me understand this better? Perhaps you mean "same column of two different rows"?

The order of the letters has been flipped around. This is probably due to the fact that Arabic reads from right to left. I suspect PDFminer gave Camelot the letters in the "correct" left to right order, but Camelot, not being aware that the letters should be read in the opposite order, flipped the order around

You are correct, Camelot sorts the characters just as they would appear in english text. This is a bug, let me work out a fix for this.

ZainRizvi · 2018-10-13T15:24:57Z

Sorry, yes, I meant same column of two different rows. Thanks for looking at this bug so quickly!

…

On Sat, Oct 13, 2018, 2:57 AM Vinayak Mehta ***@***.***> wrote: Thanks for the detailed report @ZainRizvi <https://github.com/ZainRizvi>! 1. The (I think unicode) characters for the Arabic text seem to have either been corrupted in the process or they've lost any mapping to the Arabic fonts. Even when I tried opening up the file in Google Sheets & other editors and set the font to Arabic the words would still not be properly displayed This is a problem in the PDF itself, its ToUnicode map is incorrect. I tried copying and pasting the arabic text from the PDF into a text editor and got boxes instead of arabic characters. In the past, I've used OCR to get text out of such PDFs. 1. Text that should have been on the same row of two different rows is instead placed in two different columns of the same row. (This is a minor annoyance that I can easily work around, and perhaps you tool already has a fix for this that I haven't discovered) Can you give me an example of this and help me understand this better? Perhaps you mean "same *column* of two different rows"? The order of the letters has been flipped around. This is probably due to the fact that Arabic reads from right to left. I suspect PDFminer gave Camelot the letters in the "correct" left to right order, but Camelot, not being aware that the letters should be read in the opposite order, flipped the order around You are correct, Camelot sorts the characters just as they would appear in english text. This is a bug, let me work out a fix for this. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#141 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AEQw5yZXECSM4kTb2CBjQDaEfejUPVPxks5ukbj2gaJpZM4XaR9s> .

ZainRizvi · 2018-10-13T15:29:52Z

"This is a problem in the PDF itself, its ToUnicode map is incorrect. I tried copying and pasting the arabic text from the PDF into a text editor and got boxes instead of arabic characters. In the past, I've used OCR to get text out of such PDFs."

Interesting. I'm not very familiar with the PDF format. If the ToUnicode map is incorrect for those characters then how do PDF readers manage to render those characters correctly? Is there some custom font embedded into the PDF which described how to convert each character?

vinayak-mehta · 2018-10-13T18:25:40Z

The ToUnicode map contains a mapping of each font glyph to a corresponding unicode character. This mapping is broken in the PDF above. The PDF reader knows where to place each font glyph using the specified x,y coordinates.

vinayak-mehta · 2018-12-13T07:55:39Z

@ZainRizvi Can you extract the table from this PDF and tell me if the output is correct or not? From some visual pattern matching, I can tell that the text is extracted in the correct right-to-left reading order.

Camelot uses text lines computed by PDFMiner and assigns them to specific cells. Even if PDFMiner creates text lines by combining individual characters in left-to-right order, the final result should be correct when read in right-to-left order. Correct me if I'm wrong here.

vinayak-mehta · 2018-12-13T07:57:31Z

Also, I added a test for the PDF mentioned in the comment above. Strangely, when I see this list in the terminal, it looks fine, but the order is messed up when viewing in VS Code or Github.

kaneprajakta · 2019-02-16T07:01:15Z

Can you help me parsing a file in regional language like marathi in camelot?

vinayak-mehta · 2019-02-18T17:40:15Z

@kaneprajakta What is the problem that you're facing? Can you also post the code snippet that you're using?

abedkhooli · 2019-07-04T18:02:39Z

I can see that Arabic is still an issue as of version 0.7.2 although Camelot is doing a great job parsing pdf tables. Arabic text is backward (words in a phrase and letters in a word). Here's a Colab notebook (test pdf from Tabula).
https://colab.research.google.com/drive/1gRrGs8P41CQRKHnRLo0z4Or8YtUS1K_V

alexzabbey · 2019-08-27T08:51:39Z

Same issue as abedkhooli, using 0.7.2.
Also, it parses ال and لا as ال, which is a mistake

vinayak-mehta · 2019-08-27T13:39:12Z

Also, it parses ال and لا as ال, which is a mistake

Looks like an issue with the PDF itself. Both ال and لا could be mapped to ال in the PDF's ToUnicode map.

alexzabbey · 2019-08-27T15:31:21Z

I'm pretty sure that isn't the problem, since when I copy and paste from the PDF it copies correctly. How can the ToUnicode mapping be viewed? Is there any way to override the mapping?

…

On Tue, Aug 27, 2019, 16:39 Vinayak Mehta ***@***.***> wrote: Also, it parses ال and لا as ال, which is a mistake Looks like an issue with the PDF itself. Both ال and لا could be mapped to ال in the PDF's ToUnicode map. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#141?email_source=notifications&email_token=AACQ6TXBUW2XKB55S42GPTTQGUVALA5CNFSM4F3JD5WKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD5HYPNY#issuecomment-525305783>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AACQ6TRSRD7YMQDEKPM7ORDQGUVALANCNFSM4F3JD5WA> .

vinayak-mehta · 2019-10-15T05:17:52Z

A couple of us saw this issue in another PDF yesterday. Looking to dig into the pdfminer issue tracker for this.

PremChandran15 · 2019-10-30T16:56:15Z

Is this issue solved? Weirdly, I am using version 0.7.3 but when I tried to read a procurement document which is in plain English, the word gets read from right to left. I tried playing around with shift_text parameter as stated in the documents but is of no effect.

TomerGadol · 2022-12-20T16:28:55Z

I'm trying to work on a Hebrew document and I'm getting the same problem with the reversed text. Has this issue been resolved?

vinayak-mehta added the bug label Oct 13, 2018

vinayak-mehta added this to the v0.5.0 milestone Dec 2, 2018

vinayak-mehta mentioned this issue Dec 12, 2018

Fix v0.5.0 bugs #227

Merged

vinayak-mehta removed this from the v0.5.0 milestone Dec 13, 2018

vinayak-mehta mentioned this issue Oct 15, 2019

Add support for reading tables with Arabic fonts camelot-dev/camelot#83

Open

vinayak-mehta closed this as completed Oct 15, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Please add support for reading tables with Arabic fonts #141

Please add support for reading tables with Arabic fonts #141

ZainRizvi commented Oct 12, 2018

vinayak-mehta commented Oct 13, 2018

ZainRizvi commented Oct 13, 2018 via email

ZainRizvi commented Oct 13, 2018

vinayak-mehta commented Oct 13, 2018

vinayak-mehta commented Dec 13, 2018

vinayak-mehta commented Dec 13, 2018

kaneprajakta commented Feb 16, 2019 •

edited

Loading

vinayak-mehta commented Feb 18, 2019

abedkhooli commented Jul 4, 2019

alexzabbey commented Aug 27, 2019

vinayak-mehta commented Aug 27, 2019

alexzabbey commented Aug 27, 2019 via email

vinayak-mehta commented Oct 15, 2019

PremChandran15 commented Oct 30, 2019

TomerGadol commented Dec 20, 2022

Please add support for reading tables with Arabic fonts #141

Please add support for reading tables with Arabic fonts #141

Comments

ZainRizvi commented Oct 12, 2018

vinayak-mehta commented Oct 13, 2018

ZainRizvi commented Oct 13, 2018 via email

ZainRizvi commented Oct 13, 2018

vinayak-mehta commented Oct 13, 2018

vinayak-mehta commented Dec 13, 2018

vinayak-mehta commented Dec 13, 2018

kaneprajakta commented Feb 16, 2019 • edited Loading

vinayak-mehta commented Feb 18, 2019

abedkhooli commented Jul 4, 2019

alexzabbey commented Aug 27, 2019

vinayak-mehta commented Aug 27, 2019

alexzabbey commented Aug 27, 2019 via email

vinayak-mehta commented Oct 15, 2019

PremChandran15 commented Oct 30, 2019

TomerGadol commented Dec 20, 2022

kaneprajakta commented Feb 16, 2019 •

edited

Loading