Skip to content
This repository has been archived by the owner on Jan 6, 2025. It is now read-only.

Please add support for reading tables with Arabic fonts #141

Closed
ZainRizvi opened this issue Oct 12, 2018 · 15 comments
Closed

Please add support for reading tables with Arabic fonts #141

ZainRizvi opened this issue Oct 12, 2018 · 15 comments
Labels

Comments

@ZainRizvi
Copy link

Versions:
Linux-4.9.0-6-amd64-x86_64
Python 3.6.5 |Anaconda, Inc.| (default, Apr 29 2018, 16:14:56)
[GCC 7.2.0]
NumPy 1.15.2
OpenCV 3.4.2

Hi,

Can you please support reading languages in Arabic fonts?

In particular, I'm trying to extract tables from this document (backup link since Scribd seems to be down right now).

Starting at page 6, the document presents lines of Arabic on the left column and lines of English in the right column. I used these commands to extract that text as a table:

tables = camelot.read_pdf('quran.pdf', pages='6',columns=['240'])
tables.export('quran.csv', f='csv', compress=False)

However, the extracted table has two issues:

1. The (I think unicode) characters for the Arabic text seem to have either been corrupted in the process or they've lost any mapping to the Arabic fonts. Even when I tried opening up the file in Google Sheets & other editors and set the font to Arabic the words would still not be properly displayed

2. Text that should have been on the same row of two different rows is instead placed in two different columns of the same row. (This is a minor annoyance that I can easily work around, and perhaps you tool already has a fix for this that I haven't discovered)

Below is the full output generated by the above command. Interestingly, in the beginning part of the file (line 3 and a bit of line 4) you can at least recognize the Arabic letters. However, again there are two issues:

  1. The order of the letters has been flipped around. This is probably due to the fact that Arabic reads from right to left. I suspect PDFminer gave Camelot the letters in the "correct" left to right order, but Camelot, not being aware that the letters should be read in the opposite order, flipped the order around
  2. The two lines of visible Arabic are both from the same word, the characters of which somehow got split into two different rows
"ِ",""
"",""
"ِةِِ حـِتاَف",""
"ْلَا","al-Fātiḥah"
"َ",""
"",""
"","1.  1In the Name of Allah,"
"",""
"","the All-beneficent, the All-merciful."
"",""
"","2. All praise belongs to Allah,2"
"",""
"",""
"","Lord of all the worlds,"
"",""
"","3. the All-beneficent, the All-merciful,"
"",""
"","4. Master3 of the Day of Retribution."
"",""
"","5. You [alone] do we worship,"
"",""
"","and to You [alone] do we turn for help."
"",""
"","6. Guide us on the straight path,"
"",""
"","7. the path of those whom You have blessed4"
"",""
"","—such as5 have not incurred Your wrath,6"
"1 That is, ‘the opening’ sūrah. Another common name of the sūrah is ‘Sūrat al-Ḥamd, ’that is, the sūrah of",""
"the [Lord’s] praise.",""
"2 In Muslim parlance the phrase al-ḥamdu lillāh also signifies ‘thanks to Allah.’",""
"3 This is in accordance with the reading mālik yawm al-dīn, adopted by ‘Āṣim, al-Kisā’ī, Ya‘qūb al-Ḥaḍramī,",""
"and Khalaf. Other authorities of qirā’ah (the art of recitation of the Qur’ān) have read ‘malik yawm al-",""
"","dīn,’meaning ‘Sovereign of the Day of Retribution’(see Mu‘jam al-Qirā’āt al-Qur’āniyyah). Traditions ascribe"
"both readings to Imam Ja‘far al-Ṣādiq (‘a). See al-Qummī, al-‘Ayyāshī, Tafsīr al-Imām al-‘Askarī.",""
"4 For further Qur’ānic references to ‘those whom Allah has blessed,’see 4:69 and 19:58; see also 5:23, 110;",""
"12:6; 27:19; 28:17; 43:59; 48:2.",""
"5 This is in accordance with the qirā’ah of ‘Āṣim, ghayril-maghḍūbi, which appears in the Arabic text above.",""
"","However, in accordance with an alternative, and perhaps preferable, reading ghayral-maghḍūbi (attributed"
@vinayak-mehta
Copy link
Contributor

Thanks for the detailed report @ZainRizvi!

  1. The (I think unicode) characters for the Arabic text seem to have either been corrupted in the process or they've lost any mapping to the Arabic fonts. Even when I tried opening up the file in Google Sheets & other editors and set the font to Arabic the words would still not be properly displayed

This is a problem in the PDF itself, its ToUnicode map is incorrect. I tried copying and pasting the arabic text from the PDF into a text editor and got boxes instead of arabic characters. In the past, I've used OCR to get text out of such PDFs.

  1. Text that should have been on the same row of two different rows is instead placed in two different columns of the same row. (This is a minor annoyance that I can easily work around, and perhaps you tool already has a fix for this that I haven't discovered)

Can you give me an example of this and help me understand this better? Perhaps you mean "same column of two different rows"?

The order of the letters has been flipped around. This is probably due to the fact that Arabic reads from right to left. I suspect PDFminer gave Camelot the letters in the "correct" left to right order, but Camelot, not being aware that the letters should be read in the opposite order, flipped the order around

You are correct, Camelot sorts the characters just as they would appear in english text. This is a bug, let me work out a fix for this.

@ZainRizvi
Copy link
Author

ZainRizvi commented Oct 13, 2018 via email

@ZainRizvi
Copy link
Author

"This is a problem in the PDF itself, its ToUnicode map is incorrect. I tried copying and pasting the arabic text from the PDF into a text editor and got boxes instead of arabic characters. In the past, I've used OCR to get text out of such PDFs."

Interesting. I'm not very familiar with the PDF format. If the ToUnicode map is incorrect for those characters then how do PDF readers manage to render those characters correctly? Is there some custom font embedded into the PDF which described how to convert each character?

@vinayak-mehta
Copy link
Contributor

The ToUnicode map contains a mapping of each font glyph to a corresponding unicode character. This mapping is broken in the PDF above. The PDF reader knows where to place each font glyph using the specified x,y coordinates.

@vinayak-mehta vinayak-mehta added this to the v0.5.0 milestone Dec 2, 2018
@vinayak-mehta
Copy link
Contributor

@ZainRizvi Can you extract the table from this PDF and tell me if the output is correct or not? From some visual pattern matching, I can tell that the text is extracted in the correct right-to-left reading order.

Camelot uses text lines computed by PDFMiner and assigns them to specific cells. Even if PDFMiner creates text lines by combining individual characters in left-to-right order, the final result should be correct when read in right-to-left order. Correct me if I'm wrong here.

@vinayak-mehta
Copy link
Contributor

Also, I added a test for the PDF mentioned in the comment above. Strangely, when I see this list in the terminal, it looks fine, but the order is messed up when viewing in VS Code or Github.

@vinayak-mehta vinayak-mehta removed this from the v0.5.0 milestone Dec 13, 2018
@kaneprajakta
Copy link

kaneprajakta commented Feb 16, 2019

Can you help me parsing a file in regional language like marathi in camelot?

@vinayak-mehta
Copy link
Contributor

@kaneprajakta What is the problem that you're facing? Can you also post the code snippet that you're using?

@abedkhooli
Copy link

I can see that Arabic is still an issue as of version 0.7.2 although Camelot is doing a great job parsing pdf tables. Arabic text is backward (words in a phrase and letters in a word). Here's a Colab notebook (test pdf from Tabula).
https://colab.research.google.com/drive/1gRrGs8P41CQRKHnRLo0z4Or8YtUS1K_V

@alexzabbey
Copy link

Same issue as abedkhooli, using 0.7.2.
Also, it parses ال and لا as ال, which is a mistake

@vinayak-mehta
Copy link
Contributor

Also, it parses ال and لا as ال, which is a mistake

Looks like an issue with the PDF itself. Both ال and لا could be mapped to ال in the PDF's ToUnicode map.

@alexzabbey
Copy link

alexzabbey commented Aug 27, 2019 via email

@vinayak-mehta
Copy link
Contributor

A couple of us saw this issue in another PDF yesterday. Looking to dig into the pdfminer issue tracker for this.

@PremChandran15
Copy link

Is this issue solved? Weirdly, I am using version 0.7.3 but when I tried to read a procurement document which is in plain English, the word gets read from right to left. I tried playing around with shift_text parameter as stated in the documents but is of no effect.

@TomerGadol
Copy link

I'm trying to work on a Hebrew document and I'm getting the same problem with the reversed text. Has this issue been resolved?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

8 participants
@alexzabbey @vinayak-mehta @ZainRizvi @abedkhooli @PremChandran15 @TomerGadol @kaneprajakta and others