Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: "with pd.ExcelWriter" produces a corrupt Excel file in case of .xlsm extension #44868

Open
2 of 3 tasks
maximrobi opened this issue Dec 13, 2021 · 32 comments
Open
2 of 3 tasks
Labels
Bug Docs IO Excel read_excel, to_excel

Comments

@maximrobi
Copy link

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the master branch of pandas.

Reproducible Example

import pandas as pd

excel_path = r"D:\my_test.xlsm"
with pd.ExcelWriter(excel_path, mode='a') as writer:
    # df.to_excel(writer, sheet_name='Sheet1')
    pass

df_excel = pd.read_excel(excel_path, 'Sheet1')

Issue Description

The specified Excel file gets corrupted, despite the append mode. Changing its extension to .xlsx can retrieve the raw data from it, but the macro is forever gone.

In my example I used with and had only a pass statement inside it, yet the file became useless.

Excel_macro_corruption.zip

Q1: Can we know for sure that the default engine is used for .xlsm files?
Q2: Should I contact OpenPyxl instead of you with this issue?

Expected Behavior

As I've read since my first encounter of this bug, ExcelWriter doesn't support macros, but a warning of some kind would be nice before wiping out a carefully written VBA code from the Excel file.

Installed Versions

INSTALLED VERSIONS

commit : 945c9ed
python : 3.7.4.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.19041
machine : AMD64
processor : Intel64 Family 6 Model 158 Stepping 9, GenuineIntel
byteorder : little
LC_ALL : None
LANG : None
LOCALE : None.None
pandas : 1.3.4
numpy : 1.17.3
pytz : 2021.3
dateutil : 2.8.2
pip : 19.3.1
setuptools : 40.8.0
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : None
numexpr : None
odfpy : None
openpyxl : 3.0.9
pandas_gbq : None
pyarrow : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
numba : None

@maximrobi maximrobi added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Dec 13, 2021
@twoertwein
Copy link
Member

Do you have the same issue when using openpyxl directly (using it to read the file and then write it again)?

I assume you still lose your macro with the non-empty with-block?

If I understand the pandas code: pandas will use the engine (openpyxl in this case) to read the excel workbook, a user can append data to it (you choose not to, that should be okay - hopefully), and when pandas closes the file it asks the engine to write the potentially changed workbook back to the file.

If the engine does not support some excel features, these features will probably get lost.

@twoertwein twoertwein added the IO Excel read_excel, to_excel label Dec 13, 2021
@twoertwein
Copy link
Member

It would be good to emphasize that features not supported by the underlying engine (which I think is always openpyxl for mode="a") are lost in the mode section of ExcelWriter:
image

@maximrobi
Copy link
Author

Do you have the same issue when using openpyxl directly (using it to read the file and then write it again)?

I assume you still lose your macro with the non-empty with-block?

If I understand the pandas code: pandas will use the engine (openpyxl in this case) to read the excel workbook, a user can append data to it (you choose not to, that should be okay - hopefully), and when pandas closes the file it asks the engine to write the potentially changed workbook back to the file.

If the engine does not support some excel features, these features will probably get lost.

My main problem is, that after executing the code (which should do nothing..) my file looks intact with .xlsm extension, but I can't open it since it got corrupted somehow. I need to manually change the extension to .xlsx to be able to at least read the info in the cells - but this way the macro is well and truly lost. I will try to replicate the same behavior with openpyxl tomorrow and get back to you.

@maximrobi
Copy link
Author

Does it really choose a default engine based on the mode? I read that it's based on the filetype/extension:
image

Based on this, my .xlsm shouldn't even work - since it's not supported by any of the engines. I have no idea what happens when I execute the code. Can I debug somehow to see which engine is getting selected by default?

@twoertwein
Copy link
Member

Yes, you are right, it doesn't pick based on the mode (somehow I thought the only engine that supported it was openpyxl).

The engine is inferred by extension and based on the content. You could add a breakpoint in your empty with block and check writer.engine

@rhshadrach
Copy link
Member

Or just print:

with pd.ExcelWriter('test.xlsx', mode='a', engine='openpyxl') as writer:
    print(writer.engine)

@rhshadrach
Copy link
Member

@maximrobi - can you post a file example before the corruption occurs?

@maximrobi
Copy link
Author

maximrobi commented Dec 16, 2021

@maximrobi - can you post a file example before the corruption occurs?

@rhshadrach You have it in the zip file under Issue Description. The my_test - backup.xlsm should work 100%, just change the name to what the script is expecting to find.

@maximrobi
Copy link
Author

maximrobi commented Dec 16, 2021

Yes, you are right, it doesn't pick based on the mode (somehow I thought the only engine that supported it was openpyxl).

The engine is inferred by extension and based on the content. You could add a breakpoint in your empty with block and check writer.engine

@twoertwein Thanks for the info! I managed to debug the with statement and I can confirm that it is using openpyxl for my .xlsm file.

@simonjayhawkins simonjayhawkins removed the Needs Triage Issue that has not been reviewed by a pandas team member label Dec 17, 2021
@rhshadrach
Copy link
Member

rhshadrach commented Dec 18, 2021

I can't duplicate on Linux. I wonder if the code in OpenpyxlWriter.save is causing the issue. Upon the context manager exiting, we run

        self.book.save(self.handles.handle)
        if "r+" in self.mode and not isinstance(self.handles.handle, mmap.mmap):
            # truncate file to the written content
            self.handles.handle.truncate()

I haven't been able to discern what this truncate function is doing.

@twoertwein
Copy link
Member

We always write the workbook back, even if no changes are made. That's how we lose features that are not supported by an engine.

I haven't been able to discern what this truncate function is doing.

The truncate shouldn't cause an issue here (it is actually needed). It is there in case the original file is larger than the new file (for example, if unsupported features are dropped by the engine). Otherwise, we would simply overwrite the first n-bits of the file and leave the remaining bits untouched, truncate removes those remaining bits.

@maximrobi
Copy link
Author

maximrobi commented Dec 22, 2021

we lose features that are not supported by an engine.

Wouldn't it better to just pop an error (or at least a warning) that the filetype is not recognized and it'll remove some functionality from it? Maybe create a copy before manipulating it. I remember seeing some kind of warning for the Data Validation logic inside the same Excel document, but I couldn't replicate it since then.

@maximrobi
Copy link
Author

maximrobi commented Dec 22, 2021

I can't duplicate on Linux. I wonder if the code in OpenpyxlWriter.save is causing the issue.

Would this mean that this is a Windows-related bug?
What was your end-result after running my code: a non-corrupt Excel file with .xlsm extension? Did it keep the macro function intact inside it?

@rhshadrach
Copy link
Member

@maximrobi - I got a valid Excel file without the macro.

Wouldn't it better to just pop an error (or at least a warning) that the filetype is not recognized and it'll remove some functionality from it?

That is currently the behavior: we raise if the type is not able to be determined and the engine is not explicitly specified. The issue here is that pandas determines the file is an xlsx because it is a zip file with the content "xl/workbook.xml". Is the conclusion here incorrect?

@rhshadrach
Copy link
Member

I think I understand now. xlsm / xlsb files can contain macros, whereas xlsx cannot. Openpyxl allows the engine argument keep_vba. Running this on the test file here preserves the macro:

with pd.ExcelWriter(excel_path, mode='a', engine_kwargs={'keep_vba': True}) as writer:
    pass

What happens when you run this on Windows @maximrobi?

In order to have a warning, we'd need to inspect the file as well as the engine_kwargs to determine if macros will be lost, warning when they are. +1 on adding such a warning.

@maximrobi
Copy link
Author

I think I understand now. xlsm / xlsb files can contain macros, whereas xlsx cannot. Openpyxl allows the engine argument keep_vba. Running this on the test file here preserves the macro:

with pd.ExcelWriter(excel_path, mode='a', engine_kwargs={'keep_vba': True}) as writer:
    pass

What happens when you run this on Windows @maximrobi?

In order to have a warning, we'd need to inspect the file as well as the engine_kwargs to determine if macros will be lost, warning when they are. +1 on adding such a warning.

I tried this and it results in the same corrupt file :(

@maximrobi
Copy link
Author

I think I understand now. xlsm / xlsb files can contain macros, whereas xlsx cannot. Openpyxl allows the engine argument keep_vba. Running this on the test file here preserves the macro:

with pd.ExcelWriter(excel_path, mode='a', engine_kwargs={'keep_vba': True}) as writer:
    pass

What happens when you run this on Windows @maximrobi?

In order to have a warning, we'd need to inspect the file as well as the engine_kwargs to determine if macros will be lost, warning when they are. +1 on adding such a warning.

I'm not really familiar what the difference between .xlsx and .xlsm is on a lower level, but you are correct in that they both have workbooks. Theoretically if you open an .xlsm file az zip, you should find the macro inside the vbaproject.bin file.

@AlejoB13C
Copy link

The most I could do with this was to modify pandas.io.excel._openpyxl.py.OpenpyxlWriter.__init__.book (line 65) to directly add "keep_vba = True" since this line does not include any motor_kwargs. And from what I tried this doesn't affect .xlsx files

However to append data to existing sheet, you should do something like:

path = './file.xlsm'
sheet_name = 'sheet_name'
adf = pd.read_excel(path, sheet_name=sheet_name, engine='openpyxl')
wb = load_workbook(path, keep_vba=True)

with pd.ExcelWriter(path, engine = 'openpyxl') as writer:
    writer.book = wb
    writer.sheets = {n: wb[n] for n in wb.sheetnames}
    df.to_excel(writer, sheet_name = sheetname, index=False, header=False, startrow=len(adf)+1)

or modify again _openpyxl.py to add something like an "append" option to def write_cells()

@ryanBeattie
Copy link

@maximrobi I have had the same issue. I tried each solution in this thread and others found on stack overflow https://stackoverflow.com/questions/28169731/write-pandas-dataframe-to-xlsm-file-excel-with-macros-enabled. I have found a workaround using xlwings to save the file as a xlsm file after the data frame has been written to it.

`
import os

import xlswriter
import pandas as pd
from xlwings import Book

df = pd.DataFrame(columns=['one', 'two'], data=[[1,2], [3,4]])

filename = 'TEST.xlsx'
filename_macro = 'TEST.xlsm'

with pd.ExcelWriter(filename, engine='xlsxwriter') as writer:
df.to_excel(writer)
writer.save()

book = Book(filename)
book.save(filename_macro)
book.close()

os.remove(filename)
os.system("TASKKILL /F /IM Excel.exe")
`

The first part of the code is standard writing to an xlsx file in pandas. The second part is using the xlwings Book class constructor to open the xlsx file inside your directory and then saving it with the xlsm file extension.
In order to remove the redundant xlsx file, using os module to delete it from the system.

There is a minor bug that when the Book() constructor is called and then closed, the excel application opens and doesnt terminate. It stays open as a minimised window, hence the os module is required again to terminate the process using the system function.

I am not very experienced and still at university so maybe my implementation is messy or unnecessarily complicated, but it worked for me after hours of headache so I hope it helps.

I also did not add in that I create the workbook excel file using xlsxwriter first. This may be redundant when using the pandas module however.

@AlejoB13C
Copy link

AlejoB13C commented Jan 21, 2022

What I do to create a valid .xlsm file directly in python is also with openpyxl
I do something like

from openpyxl import Workbook, load_workbook

path = './book.xlsm'
wb = Workbook()
wb.save(path)

wb = load_workbook(path, keep_vba=True)
wb.save(path)
wb.close()

The first part with Workbook() creates the file, but it is corrupted, however, when you open it with load_workbook and the keep_vba=True arg, the file is magically fixed and ready to be manipulated.

@tehunter
Copy link
Contributor

tehunter commented Oct 4, 2022

@maximrobi Is this resolved?

I just ran into the issue initially before setting the keep_vba parameter in engine_kwargs. Everything seems to work properly now (this is on Windows btw)

I think I understand now. xlsm / xlsb files can contain macros, whereas xlsx cannot. Openpyxl allows the engine argument keep_vba. Running this on the test file here preserves the macro:

with pd.ExcelWriter(excel_path, mode='a', engine_kwargs={'keep_vba': True}) as writer:
    pass

What happens when you run this on Windows @maximrobi?

In order to have a warning, we'd need to inspect the file as well as the engine_kwargs to determine if macros will be lost, warning when they are. +1 on adding such a warning.

+1 to issuing a warning as well. Without a warning, this could be an issue as users convert from directly setting the book with openpyxl.load_workbook in pre-1.5 to using the file copy + mode="a" approach in post-1.5 as discussed in #48780

@maximrobi
Copy link
Author

What I do to create a valid .xlsm file directly in python is also with openpyxl I do something like

from openpyxl import Workbook, load_workbook

path = './book.xlsm'
wb = Workbook()
wb.save(path)

wb = load_workbook(path, keep_vba=True)
wb.save(path)
wb.close()

The first part with Workbook() creates the file, but it is corrupted, however, when you open it with load_workbook and the keep_vba=True arg, the file is magically fixed and ready to be manipulated.

Wow, maybe I should try this at my end as well - although I already changed my workstation and maybe a newer version of Pandas already has a fix for my problem. I'll let you know if I manage to get any results!

@maximrobi
Copy link
Author

@maximrobi Is this resolved?

I just ran into the issue initially before setting the keep_vba parameter in engine_kwargs. Everything seems to work properly now (this is on Windows btw)

I think I understand now. xlsm / xlsb files can contain macros, whereas xlsx cannot. Openpyxl allows the engine argument keep_vba. Running this on the test file here preserves the macro:

with pd.ExcelWriter(excel_path, mode='a', engine_kwargs={'keep_vba': True}) as writer:
    pass

What happens when you run this on Windows @maximrobi?
In order to have a warning, we'd need to inspect the file as well as the engine_kwargs to determine if macros will be lost, warning when they are. +1 on adding such a warning.

+1 to issuing a warning as well. Without a warning, this could be an issue as users convert from directly setting the book with openpyxl.load_workbook in pre-1.5 to using the file copy + mode="a" approach in post-1.5 as discussed in #48780

Not sure what a resolution would be to this issue, maybe a confirmation from someone that a warning was implemented and nobody loses 2 days' worth of scripting in VBA due to a mini-automatization script in Python :))
[and I learned that you never run a script with original input of a file..]

So no, as far as I know, this issue is still active. I will try it once again on my new machine and latest pandas version to see how it behaves.

@eldarmammadov
Copy link

eldarmammadov commented Nov 26, 2022

I wonder how issue has been resolved? I previously resolved same issue by installing pandas 1.4.4 version. Now, with v1.5.2 gives same corrupted excel file error
Magic line in v1.5.2
with pd.ExcelWriter(new_path03xl, engine='openpyxl', mode='r+', if_sheet_exists='overlay',engine_kwargs={'keep_vba': True}) as writer:
book = load_workbook(new_path03xl, keep_vba=True)

@sistaninia
Copy link

Do not save the writer, close it:

writer.close()

@ejabu
Copy link

ejabu commented Sep 14, 2023

Worked for me with below code

with pd.ExcelWriter(output_path) as writer:
        df2.to_excel(writer, sheet_name="v2", index=False)
        for sheet_name in ["v2"]:
            ws = writer.sheets[sheet_name]
            ws.column_dimensions['A'].width = 247 / 6
            ws.column_dimensions['B'].width = 78 / 6
            ws.column_dimensions['C'].width = 99 / 6
            ws.column_dimensions['D'].width = 208 / 6
            ws.column_dimensions['E'].width = 140 / 6
            ws.column_dimensions['F'].width = 200 / 6
            ws.column_dimensions['G'].width = 126 / 6
        # writer.close() # this line is the root cause for me

@iiosenov
Copy link

iiosenov commented Oct 4, 2023

Passing the keep_vba param in combination with openpyxl indeed solved the problem.

with pd.ExcelWriter(path, engine = 'openpyxl', engine_kwargs = {'keep_vba': True}, mode = 'a', if_sheet_exists = 'replace') as writer:
    df.to_excel(writer, sheet_name='df')

@rhshadrach
Copy link
Member

Thanks @ejabu and @iiosenov. I think the main issue has been solved by #52214 - namely users can now pass keep_vba=True if they desire. But this seems to result in a corrupted file if not passed.

I'm wondering what the negative side effects are of having keep_vba=True as the default.

@maximrobi
Copy link
Author

maximrobi commented Jan 16, 2024

Hi @rhshadrach,
and thank you for the summary! Yes, the issue is solved with the latest releases available in PyCharm (checked on my new workstation with Python3.10, Pandas 2.1.4 and Openpyxl 3.1.2).

Since I also tested it with the original versions from this ticket and it still results in a corrupt Excel document, is there any way we can backtrack and at least print a warning in the console? I know a lot of production code can't upgrade to newest Python/Pandas, so we could save them from losing valuable data due to this bug.

P.S: The ticket you linked mainly talks about .xls support - but in my case I only had problems due to the added macro (meaning .xlsm), everything worked okay on .xlsx documents. I double checked it right now - not sure if it makes a difference, but I felt a need to specify.

@rhshadrach
Copy link
Member

is there any way we can backtrack and at least print a warning in the console? I know a lot of production code can't upgrade to newest Python/Pandas, so we could save them from losing valuable data due to this bug.

pandas typically only releases new patches for the previous minor version (2.1 right now). We do not have the resources to support older versions in general. An issue has to be truly exceptional in order for us to consider patching older versions. I do not think this issue meets that bar.

@maximrobi
Copy link
Author

maximrobi commented Jan 17, 2024

pandas typically only releases new patches for the previous minor version (2.1 right now). We do not have the resources to support older versions in general. An issue has to be truly exceptional in order for us to consider patching older versions. I do not think this issue meets that bar.

Understood. Then I think we can close this issue - and as a solution I will either create a copy of my input documents before running ANYTHING on them, or just use the newer Pandas versions in the future.
Thanks again!

@maximrobi maximrobi closed this as not planned Won't fix, can't repro, duplicate, stale Jan 19, 2024
@rhshadrach
Copy link
Member

I still think we should explore having keep_vba default to True. I'd like to keep this issue open until that determination happens.

@rhshadrach rhshadrach reopened this Jan 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Docs IO Excel read_excel, to_excel
Projects
None yet
Development

No branches or pull requests