CLN: consider deprecating convert_floats from read_excel #41127

ahawryluk · 2021-04-23T20:57:21Z

As the docs explain, the binary spreadsheet formats (xls, xlsb) store all numbers as floats, so by default pandas tries to convert floats to integers if it doesn’t lose information (1.0 --> 1). "You can pass ``convert_float=False` to disable this behavior, which may give a slight performance improvement." I tested this on four file types for a spreadsheet of ~440,000 cells, and recorded the best times out of 10 repetitions:

File type	convert_floats=True	convert_floats=False	speed up
xls	1.081	1.036	4.2%
xlsb	3.413	3.357	1.6%
ods	27.798	27.770	0.1%
xlsx	5.182	5.189	-0.1%

convert_floats was probably written for the benefit of .xls files, but the benefit is minor. The .xlsx files even have a slight penalty because openpyxl already converts floats to int where possible, and so pandas converts them back to float if convert_floats=False.

Since .xlsx files are now the most common spreadsheet format (citation: google search), and convert_floats only exists for performance, is it time to remove convert_floats? The spreadsheet engines would keep the behaviour of convert_floats=True and the argument would be deprecated. This change would simplify all four engines, and if anybody really needs their ints as floats, they can always specify a dtype. Note: this possible deprecation came up in #8212 (comment) before dtype was finalized in read_excel.

I can work on this if the community likes the idea.

The text was updated successfully, but these errors were encountered:

jreback · 2021-04-23T21:01:57Z

+1 on depreciating

you would make the default None and warn if people are setting it

phofl · 2021-04-23T22:17:56Z

+1 too

ahawryluk · 2021-04-25T20:31:42Z

take

italo-turing · 2021-07-13T18:51:24Z

I have a .xlsx file that contains a column of mostly strings but with the occasional number. I want those numbers to be read as shown in Excel (integers as integers, floats as floats). Before this deprecation, read_excel read those numbers as integers when appropriate, but now they are always read as floats. Specifying {col: str} doesn't help; the numbers still get parsed as floats. So, for example, a cell showing 121 in Excel is read as 121.0.

My current solution is to manually iterate through that column later and figure those cases out. Is there a better alternative here?

ahawryluk · 2021-07-23T22:10:45Z

Hi @italo-turing I think you've already found the best alternative if you're constrained to .xlsx as input. If you're free to change your input file to .csv then the mixed-type (object) column will load with both ints and floats.

phofl added Deprecate Functionality to remove in pandas IO Excel read_excel, to_excel labels Apr 23, 2021

github-actions bot assigned ahawryluk Apr 25, 2021

ahawryluk mentioned this issue Apr 27, 2021

DEPR: Deprecate convert_float #41176

Merged

4 tasks

jreback added this to the 1.3 milestone Apr 30, 2021

jreback closed this as completed in #41176 May 26, 2021

smcclure17 mentioned this issue Dec 16, 2023

Update to Python 3.10 and bump necessary dependencies act-now-coalition/covid-data-model#1392

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CLN: consider deprecating convert_floats from read_excel #41127

CLN: consider deprecating convert_floats from read_excel #41127

ahawryluk commented Apr 23, 2021

jreback commented Apr 23, 2021

phofl commented Apr 23, 2021

ahawryluk commented Apr 25, 2021

italo-turing commented Jul 13, 2021

ahawryluk commented Jul 23, 2021

CLN: consider deprecating convert_floats from read_excel #41127

CLN: consider deprecating convert_floats from read_excel #41127

Comments

ahawryluk commented Apr 23, 2021

jreback commented Apr 23, 2021

phofl commented Apr 23, 2021

ahawryluk commented Apr 25, 2021

italo-turing commented Jul 13, 2021

ahawryluk commented Jul 23, 2021