Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IndexError: string index out of range when writing a PDF with set fields #1713

Closed
cmin764 opened this issue Mar 15, 2023 · 13 comments
Closed
Labels
workflow-forms From a users perspective, forms is the affected feature/workflow

Comments

@cmin764
Copy link

cmin764 commented Mar 15, 2023

When filling in fields with values (in the right page) I get the trace below when trying to save the resulting PDF into an output file.

Environment

On Mac M1 with Python 3.9.13 and pypdf 3.5.2.

Code + PDF

This is the library wrapping code I use:

        self.ctx.switch_to_pdf(source_path)
        reader = self.active_pdf_document.reader
        if "/AcroForm" in reader.trailer["/Root"]:
            reader.trailer["/Root"]["/AcroForm"].update(
                {
                    pypdf.generic.NameObject(
                        "/NeedAppearances"
                    ): pypdf.generic.BooleanObject(True)
                }
            )
        writer = pypdf.PdfWriter()
        if use_appearances_writer:
            writer = self._set_need_appearances_writer(writer)

        if newvals:
            self.logger.debug("Updating form fields with provided values for all pages")
            updated_fields = newvals
        elif self.active_pdf_document.fields:
            self.logger.debug("Updating form fields with PDF values for all pages")
            updated_fields = {
                k: v["value"] or ""
                for (k, v) in self.active_pdf_document.fields.items()
            }
        else:
            self.logger.debug("No values available for updating the form fields")
            updated_fields = {}

        for page in reader.pages:
            if updated_fields:
                try:
                    writer.update_page_form_field_values(page, fields=updated_fields)
                except Exception as exc:  # pylint: disable=W0703
                    self.logger.warning(repr(exc))
            writer.add_page(page)

        if output_path is None:
            output_path = self.active_pdf_document.path
        with open(output_path, "wb") as stream:
            writer.write(stream)

I'll be sharing the problematic PDF file once I'm ensured it can be made public.

Traceback

This is the complete Traceback I see:

File "/Users/cmin/Repos/rpaframework/packages/pdf/src/RPA/PDF/keywords/model.py", line 839, in save_field_values
    writer.write(stream)
  File "/Users/cmin/Repos/rpaframework/packages/pdf/.venv/lib/python3.9/site-packages/pypdf/_writer.py", line 1117, in write
    self.write_stream(stream)
  File "/Users/cmin/Repos/rpaframework/packages/pdf/.venv/lib/python3.9/site-packages/pypdf/_writer.py", line 1089, in write_stream
    object_positions = self._write_header(stream)
  File "/Users/cmin/Repos/rpaframework/packages/pdf/.venv/lib/python3.9/site-packages/pypdf/_writer.py", line 1143, in _write_header
    obj.write_to_stream(stream, key)
  File "/Users/cmin/Repos/rpaframework/packages/pdf/.venv/lib/python3.9/site-packages/pypdf/generic/_data_structures.py", line 346, in write_to_stream
    value.write_to_stream(stream, encryption_key)
  File "/Users/cmin/Repos/rpaframework/packages/pdf/.venv/lib/python3.9/site-packages/pypdf/generic/_base.py", line 617, in write_to_stream
    stream.write(self.renumber())  # b_(renumber(self)))
  File "/Users/cmin/Repos/rpaframework/packages/pdf/.venv/lib/python3.9/site-packages/pypdf/generic/_base.py", line 626, in renumber
    out = self[0].encode("utf-8")
IndexError: string index out of range
@MartinThoma MartinThoma added the workflow-forms From a users perspective, forms is the affected feature/workflow label Mar 15, 2023
@pubpub-zz
Copy link
Collaborator

@cmin764
if you prefer you can send the file to @MartinThoma info@martin-thoma.de

@pubpub-zz
Copy link
Collaborator

@cmin764
can you propose a minimalist code please

@Plouc314
Copy link

@cmin764
I had the same problem and in my case the error was caused by setting a checkbox to "".
Here is an example with test.pdf:
This runs fine:

from pypdf import PdfReader, PdfWriter

filepath = "test.pdf"

reader = PdfReader(filepath)
writer = PdfWriter(clone_from=reader)

writer.update_page_form_field_values(
    writer.pages[0],
    {
        "CAC M": "/Oui",
    },
)

writer.write("out.pdf")

while this crashes:

from pypdf import PdfReader, PdfWriter

filepath = "test.pdf"

reader = PdfReader(filepath)
writer = PdfWriter(clone_from=reader)

writer.update_page_form_field_values(
    writer.pages[0],
    {
        "CAC M": "",
    },
)

writer.write("out.pdf")

Here is the traceback:

Traceback (most recent call last):
  File "/home/plouc314/Documents/househub/househub-backend/src/foo.py", line 15, in <module>
    writer.write("out.pdf")
  File "/home/plouc314/Documents/househub/househub-backend/.venv/lib/python3.11/site-packages/pypdf/_writer.py", line 1149, in write
    self.write_stream(stream)
  File "/home/plouc314/Documents/househub/househub-backend/.venv/lib/python3.11/site-packages/pypdf/_writer.py", line 1121, in write_stream
    object_positions = self._write_header(stream)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/plouc314/Documents/househub/househub-backend/.venv/lib/python3.11/site-packages/pypdf/_writer.py", line 1175, in _write_header
    obj.write_to_stream(stream, key)
  File "/home/plouc314/Documents/househub/househub-backend/.venv/lib/python3.11/site-packages/pypdf/generic/_data_structures.py", line 346, in write_to_stream
    value.write_to_stream(stream, encryption_key)
  File "/home/plouc314/Documents/househub/househub-backend/.venv/lib/python3.11/site-packages/pypdf/generic/_base.py", line 617, in write_to_stream
    stream.write(self.renumber())  # b_(renumber(self)))
                 ^^^^^^^^^^^^^^^
  File "/home/plouc314/Documents/househub/househub-backend/.venv/lib/python3.11/site-packages/pypdf/generic/_base.py", line 626, in renumber
    out = self[0].encode("utf-8")
          ~~~~^^^
IndexError: string index out of range

So I'm guessing your problem might come from k: v["value"] or "".

(On Ubuntu 20.04.5, Python 3.11.2, pypdf 3.7.0)

@cmin764
Copy link
Author

cmin764 commented Apr 10, 2023

Yes that's right! I think you cracked the case and I'll come back to it as soon as it is reported again from our side.

Btw, what would the author suggest as the default value when it is not provided by the user and we have to create the final output PDF? Would that be a None instead of an empty string or maybe the previously set object (like the value extracted from /V)?

@pubpub-zz
Copy link
Collaborator

The right member for /V of a chechbox shall be a NameObject:
image

as writter the value shall a NameObject ( string starting with / ) ""

@pubpub-zz
Copy link
Collaborator

I propose to close this issue. @MartinThoma do you think it should be converted into a discussion ?

@cmin764
Copy link
Author

cmin764 commented Apr 12, 2023

That sounds good, still, wouldn't be cool to treat this issue in the pypdf package as usually you expect to pass string values given the field setting and for most of the PDFs, setting an empty string just works? (or worked on previous versions; so more like a backwards compatible behavior)

@pubpub-zz
Copy link
Collaborator

the problem is not pypdf but the PDF standard: you can write wrong data but if not correctly interpreted by viewers it is useless...
you should have a look at
https://stackoverflow.com/questions/66527035/do-checkboxes-default-to-unchecked-in-a-pdf
you can try to delete/not create "/V", it may work

@pubpub-zz
Copy link
Collaborator

I've spent some time studying fields and got more details:
a check box when not checked is always using "/Off"
So your code should state:

writer.update_page_form_field_values(
    writer.pages[0],
    {
        "CAC M": "/Off",
    },
)

@pubpub-zz
Copy link
Collaborator

the PR I've prepared should help you : get_fields will now return the acceptable states in "/States" virtual entry (applicable both to Checkbox, Radio Buttons, and Lists

@pubpub-zz
Copy link
Collaborator

get_fields()[field]["_States_"] is implemented in #1864 : we can close this issue

@cmin764
Copy link
Author

cmin764 commented Aug 9, 2023

@pubpub-zz I know this is closed, but I'm having a question regarding the changes brought above: I observed that with the latest version (3.15.0) there's no /AcroForm dict in the /Root trailer anymore, thing which bring some challenges when setting fields in older PDFs:

[ WARN ] PyPdfError('No /AcroForm dictionary in PdfWriter Object')

How do you recommend to retrieve and set fields in PDFs now? Any link to relevant docs and code changes and/or examples will be greatly appreciated!

@pubpub-zz
Copy link
Collaborator

in order to copy the /Acroform entries you have to use either the clone_from parameter when calling the PdfWriter constructor, or use the append() function.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
workflow-forms From a users perspective, forms is the affected feature/workflow
Projects
None yet
Development

No branches or pull requests

4 participants