#3 Using PdfReader causes a crash #2836

macdeport · 2024-09-06T08:44:51Z

Environment

Which environment were you using when you encountered the problem?

$ python -m platform
macOS-13.6.9-arm64-arm-64bit

$ python -c "import pypdf;print(pypdf._debug_versions)"
pypdf==4.3.1, crypt_provider=('cryptography', '41.0.7'), PIL=10.1.0

Code + PDF

This is a minimal, complete example that shows the issue:

	from pypdf import PdfReader
	
	reader = PdfReader(pdf_path); txt= ''
	for page in reader.pages:
		txt += page.extract_text() # <= Crash

Sorry I can't share this PDF with private information.

Traceback

This is the complete traceback I see:

Traceback (most recent call last):
  File "/opt/local/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/pypdf/_cmap.py", line 209, in parse_encoding
    encoding[x] = adobe_glyphs[o]  # type: ignore
    ~~~~~~~~^^^
IndexError: list assignment index out of range

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/alain/Documents/Logiciels/Developpement/py-km-pathfinder-selection/pathfinder-selection-ocred-pdf-compress.py", line 1743, in <module>
    txt_in = pdf_text(fn_in) # <=
             ^^^^^^^^^^^^^^^
  File "/Users/alain/Documents/Logiciels/Developpement/py-km-pathfinder-selection/pathfinder-selection-ocred-pdf-compress.py", line 981, in pdf_text
    txt += page.extract_text() # <=
           ^^^^^^^^^^^^^^^^^^^
  File "/opt/local/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/pypdf/_page.py", line 2102, in extract_text
    return self._extract_text(
           ^^^^^^^^^^^^^^^^^^^
  File "/opt/local/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/pypdf/_page.py", line 1612, in _extract_text
    cmaps[f] = build_char_map(f, space_width, obj)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/local/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/pypdf/_cmap.py", line 33, in build_char_map
    font_subtype, font_halfspace, font_encoding, font_map = build_char_map_from_dict(
                                                            ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/local/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/pypdf/_cmap.py", line 57, in build_char_map_from_dict
    encoding, space_code = parse_encoding(ft, space_code)
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/local/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/pypdf/_cmap.py", line 211, in parse_encoding
    encoding[x] = o  # type: ignore
    ~~~~~~~~^^^
IndexError: list assignment index out of range

The text was updated successfully, but these errors were encountered:

pubpub-zz · 2024-09-06T10:46:44Z

@macdeport
Can you make a test pdf with one page only and usine page.remove_text()?

macdeport · 2024-09-06T11:25:34Z

fp='/Users/alain/Documents/Perso/Alain/SDC35rM/sdc35-24-4!4-240905.pdf'

#--------------------------
def pdf_text_test(pdf_path):
	"""
	
	(06/09/24 13:18:36)
	"""
	#https://pypdf.readthedocs.io/en/stable/
	#https://pypdf.readthedocs.io/en/stable/user/metadata.html
	from pypdf import PdfReader
	
	reader = PdfReader(pdf_path)
	#txt=''
	#for page in reader.pages:
	#	txt += page.extract_text() # <= PB Crash
	print(reader.pages[0])
	(reader.pages[0]).remove_text()
		
	return() # pdf_text()
#--------------------------

pdf_text_test(fp)

{'/Type': '/Page', '/Parent': IndirectObject(3, 0, 4337925520), '/Contents': IndirectObject(5, 0, 4337925520), '/MediaBox': [0, 0, 595, 841], '/Resources': {'/Font': {'/F00': IndirectObject(6, 0, 4337925520), '/F01': IndirectObject(8, 0, 4337925520), '/F02': IndirectObject(10, 0, 4337925520), '/F03': IndirectObject(12, 0, 4337925520)}, '/ProcSet': IndirectObject(15, 0, 4337925520)}}
Traceback (most recent call last):
  File "/var/folders/ps/z7flxvdj3b97p9_07lknl6dc0000gn/T/com.barebones.bbedit/BBEditRunTemp-untitled text 3.py", line 26, in <module>
    pdf_text_test(fp)
  File "/var/folders/ps/z7flxvdj3b97p9_07lknl6dc0000gn/T/com.barebones.bbedit/BBEditRunTemp-untitled text 3.py", line 21, in pdf_text_test
    (reader.pages[0]).remove_text()
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'PageObject' object has no attribute 'remove_text'

pubpub-zz · 2024-09-06T16:39:12Z

oups : remove_text() applies to the full pdf. so the code should be like (from the top of my head):

import pypdf
w = pypdf.PdfWriter()
w.append("original.pdf",[0])
w.remove_text()
w.write("test_file.pdf")

check the file : no sensitive data should be in

macdeport · 2024-09-06T18:23:37Z

Two pieces of good news:

remove_text() works perfectly: the private text has completely disappeared,
the attached file dumb_extract_text_crash.pdf continues to produce a crash despite the removal of the text.

dumb_extract_text_crash.pdf

closes py-pdf#2836

pubpub-zz added a commit to pubpub-zz/pypdf that referenced this issue Sep 26, 2024

BUG: cope with encoding with too many differences

29e690c

closes py-pdf#2836

pubpub-zz mentioned this issue Sep 26, 2024

BUG: cope with encoding with too many differences #2873

Merged

stefan6419846 closed this as completed in #2873 Sep 26, 2024

stefan6419846 closed this as completed in 3b89062 Sep 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

#3 Using PdfReader causes a crash #2836

#3 Using PdfReader causes a crash #2836

macdeport commented Sep 6, 2024 •

edited

Loading

pubpub-zz commented Sep 6, 2024

macdeport commented Sep 6, 2024

pubpub-zz commented Sep 6, 2024

macdeport commented Sep 6, 2024

#3 Using PdfReader causes a crash #2836

#3 Using PdfReader causes a crash #2836

Comments

macdeport commented Sep 6, 2024 • edited Loading

Environment

Code + PDF

Traceback

pubpub-zz commented Sep 6, 2024

macdeport commented Sep 6, 2024

pubpub-zz commented Sep 6, 2024

macdeport commented Sep 6, 2024

macdeport commented Sep 6, 2024 •

edited

Loading