Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CLN: consider deprecating convert_floats from read_excel #41127

Closed
ahawryluk opened this issue Apr 23, 2021 · 5 comments · Fixed by #41176
Closed

CLN: consider deprecating convert_floats from read_excel #41127

ahawryluk opened this issue Apr 23, 2021 · 5 comments · Fixed by #41176
Assignees
Labels
Deprecate Functionality to remove in pandas IO Excel read_excel, to_excel
Milestone

Comments

@ahawryluk
Copy link
Contributor

As the docs explain, the binary spreadsheet formats (xls, xlsb) store all numbers as floats, so by default pandas tries to convert floats to integers if it doesn’t lose information (1.0 --> 1). "You can pass ``convert_float=False` to disable this behavior, which may give a slight performance improvement." I tested this on four file types for a spreadsheet of ~440,000 cells, and recorded the best times out of 10 repetitions:

File type convert_floats=True convert_floats=False speed up
xls 1.081 1.036 4.2%
xlsb 3.413 3.357 1.6%
ods 27.798 27.770 0.1%
xlsx 5.182 5.189 -0.1%

convert_floats was probably written for the benefit of .xls files, but the benefit is minor. The .xlsx files even have a slight penalty because openpyxl already converts floats to int where possible, and so pandas converts them back to float if convert_floats=False.

Since .xlsx files are now the most common spreadsheet format (citation: google search), and convert_floats only exists for performance, is it time to remove convert_floats? The spreadsheet engines would keep the behaviour of convert_floats=True and the argument would be deprecated. This change would simplify all four engines, and if anybody really needs their ints as floats, they can always specify a dtype. Note: this possible deprecation came up in #8212 (comment) before dtype was finalized in read_excel.

I can work on this if the community likes the idea.

@jreback
Copy link
Contributor

jreback commented Apr 23, 2021

+1 on depreciating

you would make the default None and warn if people are setting it

@phofl phofl added Deprecate Functionality to remove in pandas IO Excel read_excel, to_excel labels Apr 23, 2021
@phofl
Copy link
Member

phofl commented Apr 23, 2021

+1 too

@ahawryluk
Copy link
Contributor Author

take

@jreback jreback added this to the 1.3 milestone Apr 30, 2021
@italo-turing
Copy link

I have a .xlsx file that contains a column of mostly strings but with the occasional number. I want those numbers to be read as shown in Excel (integers as integers, floats as floats). Before this deprecation, read_excel read those numbers as integers when appropriate, but now they are always read as floats. Specifying {col: str} doesn't help; the numbers still get parsed as floats. So, for example, a cell showing 121 in Excel is read as 121.0.

My current solution is to manually iterate through that column later and figure those cases out. Is there a better alternative here?

@ahawryluk
Copy link
Contributor Author

Hi @italo-turing I think you've already found the best alternative if you're constrained to .xlsx as input. If you're free to change your input file to .csv then the mixed-type (object) column will load with both ints and floats.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Deprecate Functionality to remove in pandas IO Excel read_excel, to_excel
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants