how to extract text when the pdf is using a custom cmap #711

twiggy · 2021-05-21T18:12:49Z

twiggy
May 21, 2021

You might find that when you call extractText you get garbly goop. Sometimes PDFs use a custom per font/bit of text cmap. ! = T, # = M etc etc. The order will be generated based on text in the PDF and then the map embedded in the pdf itself.

here is a modified version of extractText. since i don't have time to put in a PR etc hoping someone finds it helpful and it leads to support. Thanks to all the authors.

The below code admittedly needs a lot of refactoring, but hopefully its good enough to understand.

# Crystal encodes font specially in some cases. Each font has a map from one hex code to ascii hex code.
    def buildCharMap(self, pdf, font_name="/a"):
        mapDict = {}
        cmap = pdf.getPage(0)["/Resources"]["/Font"][font_name]["/ToUnicode"].getData().decode('utf-8')
        start = 'beginbfrange'
        end = 'endbfrang'
        cmap = cmap[cmap.find(start)+len(start):cmap.find(end)]
        codes = cmap.strip().replace('<', '').replace('>', '').split('\n')
        
        for code in codes:
            l = code.split(' ')
            a = int(l[1], 16)
            b = int(l[2], 16)
            mapDict[chr(a)] = chr(b)
        
        return mapDict


    # take encoded text, decode it, and then append it.
    def translate(self,pdf,operations,text_bits,op_pos,text):
        back = 1
        font_name = ""
        while ( op_pos - back > -1  ):   
            font_op = operations[op_pos-back][1].decode('utf-8')
            if font_op == 'Tf':
                font_name = operations[op_pos-back][0][0]
                break
            back += 1
        mapDict = self.buildCharMap(pdf,font_name)
        
        text_bits.append(text.translate("".maketrans(mapDict)))

    
    # borrowed this core from pypdf2. it didn't support encoded text so wrote support for that.
    def extractText(self,page,pdf):
        text = ""
        text_bits = []
        content = page.getContents().getObject()
        if not isinstance(content, ContentStream):
            content = ContentStream(content, pdf)
        # Note: we check all strings are TextStringObjects.  ByteStringObjects
        # are strings where the byte->string encoding was unknown, so adding
        # them to the text here would be gibberish.
        op_pos = 0
        for operands, operator in content.operations:
            if operator == b_("Tj"):
                _text = operands[0]
                if isinstance(_text, TextStringObject):
                    text = _text
                    self.translate(pdf,content.operations,text_bits,op_pos,text)
            elif operator == b_("T*"):
                pass
            elif operator == b_("'"):
                _text = operands[0]
                if isinstance(_text, TextStringObject):
                    text = operands[0]
                    self.translate(pdf,content.operations,text_bits,op_pos,text)
            elif operator == b_('"'):
                _text = operands[2]
                if isinstance(_text, TextStringObject):
                    text = _text
                    self.translate(pdf,content.operations,text_bits,op_pos,text)
            elif operator == b_("TJ"):
                for i in operands[0]:
                    if isinstance(i, TextStringObject):
                        text += i
                self.translate(pdf,content.operations,text_bits,op_pos,text)
                
            op_pos += 1
        return text_bits

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

how to extract text when the pdf is using a custom cmap #711

{{title}}

Replies: 0 comments

Select a reply

how to extract text when the pdf is using a custom cmap #711

twiggy May 21, 2021

Replies: 0 comments

twiggy
May 21, 2021