You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
You might find that when you call extractText you get garbly goop. Sometimes PDFs use a custom per font/bit of text cmap. ! = T, # = M etc etc. The order will be generated based on text in the PDF and then the map embedded in the pdf itself.
here is a modified version of extractText. since i don't have time to put in a PR etc hoping someone finds it helpful and it leads to support. Thanks to all the authors.
The below code admittedly needs a lot of refactoring, but hopefully its good enough to understand.
# Crystal encodes font specially in some cases. Each font has a map from one hex code to ascii hex code.
def buildCharMap(self, pdf, font_name="/a"):
mapDict = {}
cmap = pdf.getPage(0)["/Resources"]["/Font"][font_name]["/ToUnicode"].getData().decode('utf-8')
start = 'beginbfrange'
end = 'endbfrang'
cmap = cmap[cmap.find(start)+len(start):cmap.find(end)]
codes = cmap.strip().replace('<', '').replace('>', '').split('\n')
for code in codes:
l = code.split(' ')
a = int(l[1], 16)
b = int(l[2], 16)
mapDict[chr(a)] = chr(b)
return mapDict
# take encoded text, decode it, and then append it.
def translate(self,pdf,operations,text_bits,op_pos,text):
back = 1
font_name = ""
while ( op_pos - back > -1 ):
font_op = operations[op_pos-back][1].decode('utf-8')
if font_op == 'Tf':
font_name = operations[op_pos-back][0][0]
break
back += 1
mapDict = self.buildCharMap(pdf,font_name)
text_bits.append(text.translate("".maketrans(mapDict)))
# borrowed this core from pypdf2. it didn't support encoded text so wrote support for that.
def extractText(self,page,pdf):
text = ""
text_bits = []
content = page.getContents().getObject()
if not isinstance(content, ContentStream):
content = ContentStream(content, pdf)
# Note: we check all strings are TextStringObjects. ByteStringObjects
# are strings where the byte->string encoding was unknown, so adding
# them to the text here would be gibberish.
op_pos = 0
for operands, operator in content.operations:
if operator == b_("Tj"):
_text = operands[0]
if isinstance(_text, TextStringObject):
text = _text
self.translate(pdf,content.operations,text_bits,op_pos,text)
elif operator == b_("T*"):
pass
elif operator == b_("'"):
_text = operands[0]
if isinstance(_text, TextStringObject):
text = operands[0]
self.translate(pdf,content.operations,text_bits,op_pos,text)
elif operator == b_('"'):
_text = operands[2]
if isinstance(_text, TextStringObject):
text = _text
self.translate(pdf,content.operations,text_bits,op_pos,text)
elif operator == b_("TJ"):
for i in operands[0]:
if isinstance(i, TextStringObject):
text += i
self.translate(pdf,content.operations,text_bits,op_pos,text)
op_pos += 1
return text_bits
is-questionRather a question than an issue. Should usually be a Discussion instead
1 participant
Converted from issue
This discussion was converted from issue #621 on April 09, 2022 10:12.
Heading
Bold
Italic
Quote
Code
Link
Numbered list
Unordered list
Task list
Attach files
Mention
Reference
Menu
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
You might find that when you call extractText you get garbly goop. Sometimes PDFs use a custom per font/bit of text cmap. ! = T, # = M etc etc. The order will be generated based on text in the PDF and then the map embedded in the pdf itself.
here is a modified version of extractText. since i don't have time to put in a PR etc hoping someone finds it helpful and it leads to support. Thanks to all the authors.
The below code admittedly needs a lot of refactoring, but hopefully its good enough to understand.
Beta Was this translation helpful? Give feedback.
All reactions