BUG: "with pd.ExcelWriter" produces a corrupt Excel file in case of .xlsm extension #44868

maximrobi · 2021-12-13T19:17:26Z

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the master branch of pandas.

Reproducible Example

import pandas as pd

excel_path = r"D:\my_test.xlsm"
with pd.ExcelWriter(excel_path, mode='a') as writer:
    # df.to_excel(writer, sheet_name='Sheet1')
    pass

df_excel = pd.read_excel(excel_path, 'Sheet1')

Issue Description

The specified Excel file gets corrupted, despite the append mode. Changing its extension to .xlsx can retrieve the raw data from it, but the macro is forever gone.

In my example I used with and had only a pass statement inside it, yet the file became useless.

Excel_macro_corruption.zip

Q1: Can we know for sure that the default engine is used for .xlsm files?
Q2: Should I contact OpenPyxl instead of you with this issue?

Expected Behavior

As I've read since my first encounter of this bug, ExcelWriter doesn't support macros, but a warning of some kind would be nice before wiping out a carefully written VBA code from the Excel file.

Installed Versions

INSTALLED VERSIONS

commit : 945c9ed
python : 3.7.4.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.19041
machine : AMD64
processor : Intel64 Family 6 Model 158 Stepping 9, GenuineIntel
byteorder : little
LC_ALL : None
LANG : None
LOCALE : None.None
pandas : 1.3.4
numpy : 1.17.3
pytz : 2021.3
dateutil : 2.8.2
pip : 19.3.1
setuptools : 40.8.0
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : None
numexpr : None
odfpy : None
openpyxl : 3.0.9
pandas_gbq : None
pyarrow : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
numba : None

The text was updated successfully, but these errors were encountered:

twoertwein · 2021-12-13T19:45:19Z

Do you have the same issue when using openpyxl directly (using it to read the file and then write it again)?

I assume you still lose your macro with the non-empty with-block?

If I understand the pandas code: pandas will use the engine (openpyxl in this case) to read the excel workbook, a user can append data to it (you choose not to, that should be okay - hopefully), and when pandas closes the file it asks the engine to write the potentially changed workbook back to the file.

If the engine does not support some excel features, these features will probably get lost.

twoertwein · 2021-12-14T18:00:19Z

It would be good to emphasize that features not supported by the underlying engine (which I think is always openpyxl for mode="a") are lost in the mode section of ExcelWriter:

maximrobi · 2021-12-14T22:01:49Z

Do you have the same issue when using openpyxl directly (using it to read the file and then write it again)?

I assume you still lose your macro with the non-empty with-block?

If I understand the pandas code: pandas will use the engine (openpyxl in this case) to read the excel workbook, a user can append data to it (you choose not to, that should be okay - hopefully), and when pandas closes the file it asks the engine to write the potentially changed workbook back to the file.

If the engine does not support some excel features, these features will probably get lost.

My main problem is, that after executing the code (which should do nothing..) my file looks intact with .xlsm extension, but I can't open it since it got corrupted somehow. I need to manually change the extension to .xlsx to be able to at least read the info in the cells - but this way the macro is well and truly lost. I will try to replicate the same behavior with openpyxl tomorrow and get back to you.

maximrobi · 2021-12-14T22:07:22Z

Does it really choose a default engine based on the mode? I read that it's based on the filetype/extension:

Based on this, my .xlsm shouldn't even work - since it's not supported by any of the engines. I have no idea what happens when I execute the code. Can I debug somehow to see which engine is getting selected by default?

twoertwein · 2021-12-14T23:48:59Z

Yes, you are right, it doesn't pick based on the mode (somehow I thought the only engine that supported it was openpyxl).

The engine is inferred by extension and based on the content. You could add a breakpoint in your empty with block and check writer.engine

rhshadrach · 2021-12-16T03:46:34Z

Or just print:

with pd.ExcelWriter('test.xlsx', mode='a', engine='openpyxl') as writer:
    print(writer.engine)

rhshadrach · 2021-12-16T03:47:27Z

@maximrobi - can you post a file example before the corruption occurs?

maximrobi · 2021-12-16T09:28:55Z

@maximrobi - can you post a file example before the corruption occurs?

@rhshadrach You have it in the zip file under Issue Description. The my_test - backup.xlsm should work 100%, just change the name to what the script is expecting to find.

maximrobi · 2021-12-16T09:31:24Z

Yes, you are right, it doesn't pick based on the mode (somehow I thought the only engine that supported it was openpyxl).

The engine is inferred by extension and based on the content. You could add a breakpoint in your empty with block and check writer.engine

@twoertwein Thanks for the info! I managed to debug the with statement and I can confirm that it is using openpyxl for my .xlsm file.

rhshadrach · 2021-12-18T03:29:45Z

I can't duplicate on Linux. I wonder if the code in OpenpyxlWriter.save is causing the issue. Upon the context manager exiting, we run

        self.book.save(self.handles.handle)
        if "r+" in self.mode and not isinstance(self.handles.handle, mmap.mmap):
            # truncate file to the written content
            self.handles.handle.truncate()

I haven't been able to discern what this truncate function is doing.

twoertwein · 2021-12-18T03:40:27Z

We always write the workbook back, even if no changes are made. That's how we lose features that are not supported by an engine.

I haven't been able to discern what this truncate function is doing.

The truncate shouldn't cause an issue here (it is actually needed). It is there in case the original file is larger than the new file (for example, if unsupported features are dropped by the engine). Otherwise, we would simply overwrite the first n-bits of the file and leave the remaining bits untouched, truncate removes those remaining bits.

maximrobi · 2021-12-22T01:10:26Z

we lose features that are not supported by an engine.

Wouldn't it better to just pop an error (or at least a warning) that the filetype is not recognized and it'll remove some functionality from it? Maybe create a copy before manipulating it. I remember seeing some kind of warning for the Data Validation logic inside the same Excel document, but I couldn't replicate it since then.

maximrobi · 2021-12-22T01:15:37Z

I can't duplicate on Linux. I wonder if the code in OpenpyxlWriter.save is causing the issue.

Would this mean that this is a Windows-related bug?
What was your end-result after running my code: a non-corrupt Excel file with .xlsm extension? Did it keep the macro function intact inside it?

rhshadrach · 2021-12-31T13:49:41Z

@maximrobi - I got a valid Excel file without the macro.

Wouldn't it better to just pop an error (or at least a warning) that the filetype is not recognized and it'll remove some functionality from it?

That is currently the behavior: we raise if the type is not able to be determined and the engine is not explicitly specified. The issue here is that pandas determines the file is an xlsx because it is a zip file with the content "xl/workbook.xml". Is the conclusion here incorrect?

rhshadrach · 2021-12-31T14:10:28Z

I think I understand now. xlsm / xlsb files can contain macros, whereas xlsx cannot. Openpyxl allows the engine argument keep_vba. Running this on the test file here preserves the macro:

with pd.ExcelWriter(excel_path, mode='a', engine_kwargs={'keep_vba': True}) as writer:
    pass

What happens when you run this on Windows @maximrobi?

In order to have a warning, we'd need to inspect the file as well as the engine_kwargs to determine if macros will be lost, warning when they are. +1 on adding such a warning.

maximrobi · 2022-01-06T14:18:08Z

I think I understand now. xlsm / xlsb files can contain macros, whereas xlsx cannot. Openpyxl allows the engine argument keep_vba. Running this on the test file here preserves the macro:
with pd.ExcelWriter(excel_path, mode='a', engine_kwargs={'keep_vba': True}) as writer:
    pass
What happens when you run this on Windows @maximrobi?

In order to have a warning, we'd need to inspect the file as well as the engine_kwargs to determine if macros will be lost, warning when they are. +1 on adding such a warning.

I tried this and it results in the same corrupt file :(

maximrobi · 2022-01-06T14:27:47Z

I think I understand now. xlsm / xlsb files can contain macros, whereas xlsx cannot. Openpyxl allows the engine argument keep_vba. Running this on the test file here preserves the macro:
with pd.ExcelWriter(excel_path, mode='a', engine_kwargs={'keep_vba': True}) as writer:
    pass
What happens when you run this on Windows @maximrobi?

In order to have a warning, we'd need to inspect the file as well as the engine_kwargs to determine if macros will be lost, warning when they are. +1 on adding such a warning.

I'm not really familiar what the difference between .xlsx and .xlsm is on a lower level, but you are correct in that they both have workbooks. Theoretically if you open an .xlsm file az zip, you should find the macro inside the vbaproject.bin file.

AlejoB13C · 2022-01-15T06:45:54Z

The most I could do with this was to modify pandas.io.excel._openpyxl.py.OpenpyxlWriter.__init__.book (line 65) to directly add "keep_vba = True" since this line does not include any motor_kwargs. And from what I tried this doesn't affect .xlsx files

However to append data to existing sheet, you should do something like:

path = './file.xlsm'
sheet_name = 'sheet_name'
adf = pd.read_excel(path, sheet_name=sheet_name, engine='openpyxl')
wb = load_workbook(path, keep_vba=True)

with pd.ExcelWriter(path, engine = 'openpyxl') as writer:
    writer.book = wb
    writer.sheets = {n: wb[n] for n in wb.sheetnames}
    df.to_excel(writer, sheet_name = sheetname, index=False, header=False, startrow=len(adf)+1)

or modify again _openpyxl.py to add something like an "append" option to def write_cells()

ryanBeattie · 2022-01-21T02:51:32Z

@maximrobi I have had the same issue. I tried each solution in this thread and others found on stack overflow https://stackoverflow.com/questions/28169731/write-pandas-dataframe-to-xlsm-file-excel-with-macros-enabled. I have found a workaround using xlwings to save the file as a xlsm file after the data frame has been written to it.

`
import os

import xlswriter
import pandas as pd
from xlwings import Book

df = pd.DataFrame(columns=['one', 'two'], data=[[1,2], [3,4]])

filename = 'TEST.xlsx'
filename_macro = 'TEST.xlsm'

with pd.ExcelWriter(filename, engine='xlsxwriter') as writer:
df.to_excel(writer)
writer.save()

book = Book(filename)
book.save(filename_macro)
book.close()

os.remove(filename)
os.system("TASKKILL /F /IM Excel.exe")
`

The first part of the code is standard writing to an xlsx file in pandas. The second part is using the xlwings Book class constructor to open the xlsx file inside your directory and then saving it with the xlsm file extension.
In order to remove the redundant xlsx file, using os module to delete it from the system.

There is a minor bug that when the Book() constructor is called and then closed, the excel application opens and doesnt terminate. It stays open as a minimised window, hence the os module is required again to terminate the process using the system function.

I am not very experienced and still at university so maybe my implementation is messy or unnecessarily complicated, but it worked for me after hours of headache so I hope it helps.

I also did not add in that I create the workbook excel file using xlsxwriter first. This may be redundant when using the pandas module however.

AlejoB13C · 2022-01-21T06:53:18Z

What I do to create a valid .xlsm file directly in python is also with openpyxl
I do something like

from openpyxl import Workbook, load_workbook

path = './book.xlsm'
wb = Workbook()
wb.save(path)

wb = load_workbook(path, keep_vba=True)
wb.save(path)
wb.close()

The first part with Workbook() creates the file, but it is corrupted, however, when you open it with load_workbook and the keep_vba=True arg, the file is magically fixed and ready to be manipulated.

tehunter · 2022-10-04T13:01:43Z

@maximrobi Is this resolved?

I just ran into the issue initially before setting the keep_vba parameter in engine_kwargs. Everything seems to work properly now (this is on Windows btw)

I think I understand now. xlsm / xlsb files can contain macros, whereas xlsx cannot. Openpyxl allows the engine argument keep_vba. Running this on the test file here preserves the macro:
with pd.ExcelWriter(excel_path, mode='a', engine_kwargs={'keep_vba': True}) as writer:
    pass
What happens when you run this on Windows @maximrobi?

In order to have a warning, we'd need to inspect the file as well as the engine_kwargs to determine if macros will be lost, warning when they are. +1 on adding such a warning.

+1 to issuing a warning as well. Without a warning, this could be an issue as users convert from directly setting the book with openpyxl.load_workbook in pre-1.5 to using the file copy + mode="a" approach in post-1.5 as discussed in #48780

maximrobi · 2022-10-12T17:06:59Z

What I do to create a valid .xlsm file directly in python is also with openpyxl I do something like
from openpyxl import Workbook, load_workbook

path = './book.xlsm'
wb = Workbook()
wb.save(path)

wb = load_workbook(path, keep_vba=True)
wb.save(path)
wb.close()
The first part with Workbook() creates the file, but it is corrupted, however, when you open it with load_workbook and the keep_vba=True arg, the file is magically fixed and ready to be manipulated.

Wow, maybe I should try this at my end as well - although I already changed my workstation and maybe a newer version of Pandas already has a fix for my problem. I'll let you know if I manage to get any results!

maximrobi · 2022-10-12T17:10:41Z

@maximrobi Is this resolved?

I just ran into the issue initially before setting the keep_vba parameter in engine_kwargs. Everything seems to work properly now (this is on Windows btw)
I think I understand now. xlsm / xlsb files can contain macros, whereas xlsx cannot. Openpyxl allows the engine argument keep_vba. Running this on the test file here preserves the macro:
with pd.ExcelWriter(excel_path, mode='a', engine_kwargs={'keep_vba': True}) as writer:
    pass
What happens when you run this on Windows @maximrobi?
In order to have a warning, we'd need to inspect the file as well as the engine_kwargs to determine if macros will be lost, warning when they are. +1 on adding such a warning.
+1 to issuing a warning as well. Without a warning, this could be an issue as users convert from directly setting the book with openpyxl.load_workbook in pre-1.5 to using the file copy + mode="a" approach in post-1.5 as discussed in #48780

Not sure what a resolution would be to this issue, maybe a confirmation from someone that a warning was implemented and nobody loses 2 days' worth of scripting in VBA due to a mini-automatization script in Python :))
[and I learned that you never run a script with original input of a file..]

So no, as far as I know, this issue is still active. I will try it once again on my new machine and latest pandas version to see how it behaves.

eldarmammadov · 2022-11-26T17:44:14Z

I wonder how issue has been resolved? I previously resolved same issue by installing pandas 1.4.4 version. Now, with v1.5.2 gives same corrupted excel file error
Magic line in v1.5.2
with pd.ExcelWriter(new_path03xl, engine='openpyxl', mode='r+', if_sheet_exists='overlay',engine_kwargs={'keep_vba': True}) as writer:
book = load_workbook(new_path03xl, keep_vba=True)

sistaninia · 2022-11-28T17:04:25Z

Do not save the writer, close it:

writer.close()

ejabu · 2023-09-14T01:15:01Z

Worked for me with below code

with pd.ExcelWriter(output_path) as writer:
        df2.to_excel(writer, sheet_name="v2", index=False)
        for sheet_name in ["v2"]:
            ws = writer.sheets[sheet_name]
            ws.column_dimensions['A'].width = 247 / 6
            ws.column_dimensions['B'].width = 78 / 6
            ws.column_dimensions['C'].width = 99 / 6
            ws.column_dimensions['D'].width = 208 / 6
            ws.column_dimensions['E'].width = 140 / 6
            ws.column_dimensions['F'].width = 200 / 6
            ws.column_dimensions['G'].width = 126 / 6
        # writer.close() # this line is the root cause for me

iiosenov · 2023-10-04T13:01:07Z

Passing the keep_vba param in combination with openpyxl indeed solved the problem.

with pd.ExcelWriter(path, engine = 'openpyxl', engine_kwargs = {'keep_vba': True}, mode = 'a', if_sheet_exists = 'replace') as writer:
    df.to_excel(writer, sheet_name='df')

rhshadrach · 2023-10-07T15:49:53Z

Thanks @ejabu and @iiosenov. I think the main issue has been solved by #52214 - namely users can now pass keep_vba=True if they desire. But this seems to result in a corrupted file if not passed.

I'm wondering what the negative side effects are of having keep_vba=True as the default.

maximrobi · 2024-01-16T14:25:54Z

Hi @rhshadrach,
and thank you for the summary! Yes, the issue is solved with the latest releases available in PyCharm (checked on my new workstation with Python3.10, Pandas 2.1.4 and Openpyxl 3.1.2).

Since I also tested it with the original versions from this ticket and it still results in a corrupt Excel document, is there any way we can backtrack and at least print a warning in the console? I know a lot of production code can't upgrade to newest Python/Pandas, so we could save them from losing valuable data due to this bug.

P.S: The ticket you linked mainly talks about .xls support - but in my case I only had problems due to the added macro (meaning .xlsm), everything worked okay on .xlsx documents. I double checked it right now - not sure if it makes a difference, but I felt a need to specify.

rhshadrach · 2024-01-16T21:54:30Z

is there any way we can backtrack and at least print a warning in the console? I know a lot of production code can't upgrade to newest Python/Pandas, so we could save them from losing valuable data due to this bug.

pandas typically only releases new patches for the previous minor version (2.1 right now). We do not have the resources to support older versions in general. An issue has to be truly exceptional in order for us to consider patching older versions. I do not think this issue meets that bar.

maximrobi · 2024-01-17T11:05:35Z

pandas typically only releases new patches for the previous minor version (2.1 right now). We do not have the resources to support older versions in general. An issue has to be truly exceptional in order for us to consider patching older versions. I do not think this issue meets that bar.

Understood. Then I think we can close this issue - and as a solution I will either create a copy of my input documents before running ANYTHING on them, or just use the newer Pandas versions in the future.
Thanks again!

rhshadrach · 2024-01-19T21:14:40Z

I still think we should explore having keep_vba default to True. I'd like to keep this issue open until that determination happens.

maximrobi added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Dec 13, 2021

twoertwein added the IO Excel read_excel, to_excel label Dec 13, 2021

twoertwein added the Docs label Dec 14, 2021

simonjayhawkins removed the Needs Triage Issue that has not been reviewed by a pandas team member label Dec 17, 2021

twoertwein mentioned this issue Jan 23, 2022

ENH: Support for Excel features when writing #45572

Closed

maximrobi closed this as completed Jan 19, 2024

maximrobi closed this as not planned Won't fix, can't repro, duplicate, stale Jan 19, 2024

rhshadrach reopened this Jan 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: "with pd.ExcelWriter" produces a corrupt Excel file in case of .xlsm extension #44868

BUG: "with pd.ExcelWriter" produces a corrupt Excel file in case of .xlsm extension #44868

maximrobi commented Dec 13, 2021

INSTALLED VERSIONS

twoertwein commented Dec 13, 2021

twoertwein commented Dec 14, 2021

maximrobi commented Dec 14, 2021

maximrobi commented Dec 14, 2021

twoertwein commented Dec 14, 2021

rhshadrach commented Dec 16, 2021

rhshadrach commented Dec 16, 2021

maximrobi commented Dec 16, 2021 •

edited

Loading

maximrobi commented Dec 16, 2021 •

edited

Loading

rhshadrach commented Dec 18, 2021 •

edited

Loading

twoertwein commented Dec 18, 2021

maximrobi commented Dec 22, 2021 •

edited

Loading

maximrobi commented Dec 22, 2021 •

edited

Loading

rhshadrach commented Dec 31, 2021

rhshadrach commented Dec 31, 2021

maximrobi commented Jan 6, 2022

maximrobi commented Jan 6, 2022

AlejoB13C commented Jan 15, 2022

ryanBeattie commented Jan 21, 2022

AlejoB13C commented Jan 21, 2022 •

edited

Loading

tehunter commented Oct 4, 2022 •

edited

Loading

maximrobi commented Oct 12, 2022

maximrobi commented Oct 12, 2022

eldarmammadov commented Nov 26, 2022 •

edited

Loading

sistaninia commented Nov 28, 2022

ejabu commented Sep 14, 2023

iiosenov commented Oct 4, 2023

rhshadrach commented Oct 7, 2023

maximrobi commented Jan 16, 2024 •

edited

Loading

rhshadrach commented Jan 16, 2024

maximrobi commented Jan 17, 2024 •

edited

Loading

rhshadrach commented Jan 19, 2024

BUG: "with pd.ExcelWriter" produces a corrupt Excel file in case of .xlsm extension #44868

BUG: "with pd.ExcelWriter" produces a corrupt Excel file in case of .xlsm extension #44868

Comments

maximrobi commented Dec 13, 2021

Reproducible Example

Issue Description

Expected Behavior

Installed Versions

INSTALLED VERSIONS

twoertwein commented Dec 13, 2021

twoertwein commented Dec 14, 2021

maximrobi commented Dec 14, 2021

maximrobi commented Dec 14, 2021

twoertwein commented Dec 14, 2021

rhshadrach commented Dec 16, 2021

rhshadrach commented Dec 16, 2021

maximrobi commented Dec 16, 2021 • edited Loading

maximrobi commented Dec 16, 2021 • edited Loading

rhshadrach commented Dec 18, 2021 • edited Loading

twoertwein commented Dec 18, 2021

maximrobi commented Dec 22, 2021 • edited Loading

maximrobi commented Dec 22, 2021 • edited Loading

rhshadrach commented Dec 31, 2021

rhshadrach commented Dec 31, 2021

maximrobi commented Jan 6, 2022

maximrobi commented Jan 6, 2022

AlejoB13C commented Jan 15, 2022

ryanBeattie commented Jan 21, 2022

AlejoB13C commented Jan 21, 2022 • edited Loading

tehunter commented Oct 4, 2022 • edited Loading

maximrobi commented Oct 12, 2022

maximrobi commented Oct 12, 2022

eldarmammadov commented Nov 26, 2022 • edited Loading

sistaninia commented Nov 28, 2022

ejabu commented Sep 14, 2023

iiosenov commented Oct 4, 2023

rhshadrach commented Oct 7, 2023

maximrobi commented Jan 16, 2024 • edited Loading

rhshadrach commented Jan 16, 2024

maximrobi commented Jan 17, 2024 • edited Loading

rhshadrach commented Jan 19, 2024

maximrobi commented Dec 16, 2021 •

edited

Loading

maximrobi commented Dec 16, 2021 •

edited

Loading

rhshadrach commented Dec 18, 2021 •

edited

Loading

maximrobi commented Dec 22, 2021 •

edited

Loading

maximrobi commented Dec 22, 2021 •

edited

Loading

AlejoB13C commented Jan 21, 2022 •

edited

Loading

tehunter commented Oct 4, 2022 •

edited

Loading

eldarmammadov commented Nov 26, 2022 •

edited

Loading

maximrobi commented Jan 16, 2024 •

edited

Loading

maximrobi commented Jan 17, 2024 •

edited

Loading