Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

#3 Using PdfReader causes a crash #2836

Closed
macdeport opened this issue Sep 6, 2024 · 4 comments · Fixed by #2873
Closed

#3 Using PdfReader causes a crash #2836

macdeport opened this issue Sep 6, 2024 · 4 comments · Fixed by #2873
Labels
is-robustness-issue From a users perspective, this is about robustness workflow-text-extraction From a users perspective, text extraction is the affected feature/workflow

Comments

@macdeport
Copy link

macdeport commented Sep 6, 2024

Environment

Which environment were you using when you encountered the problem?

$ python -m platform
macOS-13.6.9-arm64-arm-64bit

$ python -c "import pypdf;print(pypdf._debug_versions)"
pypdf==4.3.1, crypt_provider=('cryptography', '41.0.7'), PIL=10.1.0

Code + PDF

This is a minimal, complete example that shows the issue:

	from pypdf import PdfReader
	
	reader = PdfReader(pdf_path); txt= ''
	for page in reader.pages:
		txt += page.extract_text() # <= Crash

Sorry I can't share this PDF with private information.

Traceback

This is the complete traceback I see:

Traceback (most recent call last):
  File "/opt/local/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/pypdf/_cmap.py", line 209, in parse_encoding
    encoding[x] = adobe_glyphs[o]  # type: ignore
    ~~~~~~~~^^^
IndexError: list assignment index out of range

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/alain/Documents/Logiciels/Developpement/py-km-pathfinder-selection/pathfinder-selection-ocred-pdf-compress.py", line 1743, in <module>
    txt_in = pdf_text(fn_in) # <=
             ^^^^^^^^^^^^^^^
  File "/Users/alain/Documents/Logiciels/Developpement/py-km-pathfinder-selection/pathfinder-selection-ocred-pdf-compress.py", line 981, in pdf_text
    txt += page.extract_text() # <=
           ^^^^^^^^^^^^^^^^^^^
  File "/opt/local/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/pypdf/_page.py", line 2102, in extract_text
    return self._extract_text(
           ^^^^^^^^^^^^^^^^^^^
  File "/opt/local/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/pypdf/_page.py", line 1612, in _extract_text
    cmaps[f] = build_char_map(f, space_width, obj)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/local/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/pypdf/_cmap.py", line 33, in build_char_map
    font_subtype, font_halfspace, font_encoding, font_map = build_char_map_from_dict(
                                                            ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/local/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/pypdf/_cmap.py", line 57, in build_char_map_from_dict
    encoding, space_code = parse_encoding(ft, space_code)
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/local/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/pypdf/_cmap.py", line 211, in parse_encoding
    encoding[x] = o  # type: ignore
    ~~~~~~~~^^^
IndexError: list assignment index out of range

@stefan6419846 stefan6419846 added workflow-text-extraction From a users perspective, text extraction is the affected feature/workflow is-uncaught-exception Use this label only for issues caused by broken PDF documents that cannot be recovered. is-robustness-issue From a users perspective, this is about robustness and removed is-uncaught-exception Use this label only for issues caused by broken PDF documents that cannot be recovered. labels Sep 6, 2024
@pubpub-zz
Copy link
Collaborator

@macdeport
Can you make a test pdf with one page only and usine page.remove_text()?

@macdeport
Copy link
Author

fp='/Users/alain/Documents/Perso/Alain/SDC35rM/sdc35-24-4!4-240905.pdf'

#--------------------------
def pdf_text_test(pdf_path):
	"""
	
	(06/09/24 13:18:36)
	"""
	#https://pypdf.readthedocs.io/en/stable/
	#https://pypdf.readthedocs.io/en/stable/user/metadata.html
	from pypdf import PdfReader
	
	reader = PdfReader(pdf_path)
	#txt=''
	#for page in reader.pages:
	#	txt += page.extract_text() # <= PB Crash
	print(reader.pages[0])
	(reader.pages[0]).remove_text()
		
	return() # pdf_text()
#--------------------------

pdf_text_test(fp)
{'/Type': '/Page', '/Parent': IndirectObject(3, 0, 4337925520), '/Contents': IndirectObject(5, 0, 4337925520), '/MediaBox': [0, 0, 595, 841], '/Resources': {'/Font': {'/F00': IndirectObject(6, 0, 4337925520), '/F01': IndirectObject(8, 0, 4337925520), '/F02': IndirectObject(10, 0, 4337925520), '/F03': IndirectObject(12, 0, 4337925520)}, '/ProcSet': IndirectObject(15, 0, 4337925520)}}
Traceback (most recent call last):
  File "/var/folders/ps/z7flxvdj3b97p9_07lknl6dc0000gn/T/com.barebones.bbedit/BBEditRunTemp-untitled text 3.py", line 26, in <module>
    pdf_text_test(fp)
  File "/var/folders/ps/z7flxvdj3b97p9_07lknl6dc0000gn/T/com.barebones.bbedit/BBEditRunTemp-untitled text 3.py", line 21, in pdf_text_test
    (reader.pages[0]).remove_text()
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'PageObject' object has no attribute 'remove_text'


@pubpub-zz
Copy link
Collaborator

oups : remove_text() applies to the full pdf. so the code should be like (from the top of my head):

import pypdf
w = pypdf.PdfWriter()
w.append("original.pdf",[0])
w.remove_text()
w.write("test_file.pdf")

check the file : no sensitive data should be in

@macdeport
Copy link
Author

Two pieces of good news:

  • remove_text() works perfectly: the private text has completely disappeared,
  • the attached file dumb_extract_text_crash.pdf continues to produce a crash despite the removal of the text.

dumb_extract_text_crash.pdf

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
is-robustness-issue From a users perspective, this is about robustness workflow-text-extraction From a users perspective, text extraction is the affected feature/workflow
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants