PERF: Construction of a DatetimeIndex from a list of Timestamp with timezone #49048

bernardkaplan · 2022-10-11T20:07:09Z

Pandas version checks

I have checked that this issue has not already been reported.
I have confirmed this issue exists on the latest version of pandas.
I have confirmed this issue exists on the main branch of pandas.

Reproducible Example

The construction of a DatetimeIndex from a list of Timestamp with timezone appears to be slow when pandas 1.5 is installed in a virtual environment (venv) on Windows.

Here is a mininal script that run slowly on my machine:

# Pandas 1.5.0
python -m timeit -n 5 "import pandas as pd; x = pd.DatetimeIndex([pd.Timestamp.now(tz='Europe/Paris')] * 1000)"
# output: 5 loops, best of 5: 1.38 sec per loop

Remarks:

Pandas 1.4.4 works fine on the same machine (see below)
There is no issue on linux
There is no issue if the timezone is UTC or None
There is no issue when pandas 1.5 is installed in a conda environment on the same machine

Here is the script I used to install the virtual environment

python -m venv venv
.\venv\Scripts\activate
pip install pandas

Installed Versions

INSTALLED VERSIONS

commit : 87cfe4e
python : 3.9.12.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.22000
machine : AMD64
processor : Intel64 Family 6 Model 140 Stepping 2, GenuineIntel
byteorder : little
LC_ALL : None
LANG : None
LOCALE : fr_FR.cp1252

pandas : 1.5.0
numpy : 1.23.3
pytz : 2022.4
dateutil : 2.8.2
setuptools : 58.1.0
pip : 22.0.4
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
brotli : None
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : None
numba : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : None
snappy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
zstandard : None
tzdata : None

Prior Performance

# Pandas 1.4.4
python -m timeit -n 5 "import pandas as pd; x = pd.DatetimeIndex([pd.Timestamp.now(tz='Europe/Paris')] * 1000)"
# output: 5 loops, best of 5: 2.25 msec per loop
# :0: UserWarning: The test results are likely unreliable. The worst time (176 msec) was more than four times slower than the best time (2.25 msec).

The text was updated successfully, but these errors were encountered:

jbrockmendel · 2022-10-12T14:45:14Z

I'm seeing about 2.5ms on main.

bernardkaplan · 2022-10-12T20:32:17Z

I installed pandas from source (version 1.6.0.dev0+319.g98323eec0e) and the issue still persists (almost 2 seconds per loop).

INSTALLED VERSIONS

commit : 98323ee
python : 3.9.12.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.22621
machine : AMD64
processor : Intel64 Family 6 Model 140 Stepping 2, GenuineIntel
byteorder : little
LC_ALL : None
LANG : None
LOCALE : fr_FR.cp1252

pandas : 1.6.0.dev0+319.g98323eec0e
numpy : 1.23.4
pytz : 2022.4
dateutil : 2.8.2
setuptools : 58.1.0
pip : 22.2.2
Cython : 0.29.32
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
brotli : None
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : None
numba : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : None
snappy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
zstandard : None
tzdata : None

borlaugen · 2022-10-21T13:03:38Z

This has a very hight impact on our code. Some of the code we develop has gone from 100 seconds to 4000 seconds just by changing from pandas 1.4.4 to > 1.5.

Our profiling shows that the function \pandas\core\arrays\datetimes,py:1980(_sequence_to_dt64ns) spends 18 seconds on average in 1.5 but on 1.4.4 spends 0.003 seconds.

I too can confirm that this is not an issue on linux (debian) but is on Windows 11.

The issue seems to be present also in non virtual environments.

jbrockmendel · 2022-10-21T16:13:38Z

Our profiling shows that the function \pandas\core\arrays\datetimes,py:1980(_sequence_to_dt64ns) spends 18 seconds on average in 1.5 but on 1.4.4 spends 0.003 seconds.

Do the profiling results point to anything more specific?

borlaugen · 2022-10-21T16:36:30Z

After looking more closely at the results it looks like nt.stat is called roughly 10 000 times more in 1.5 than in 1.4 for one of our tests. Below is an excerpt of cprofile's output sorted on total time. This is the same run where i got the 18 seconds on average per call for _sequence_to_dt64ns. I am still a newbie at reading cprofiler's results, let me know if there is anything I could do to provide additional information.

ncalls	        tottime	percall	cumtime	percall	filename:lineno(function)					
18213004	732,81	0	732,81		0		{built-in	method	nt,stat}			
91057598	696,116	0	1002,256	0		<frozen	importlib,_bootstrap_external>:96(_path_join)				
18211525	460,503	0	2272,45		0		<frozen	importlib,_bootstrap_external>:1536(find_spec)				
1821172		113,012	0	2417,21		0,001	<frozen	importlib,_bootstrap_external>:1399(_get_spec)				
91057598	94,61	0	119,267		0		<frozen	importlib,_bootstrap_external>:119(<listcomp>)				
1821172		77,473	0	2549,822	0,001	<frozen	importlib,_bootstrap>:921(_find_spec)				
273173609	55,334	0	55,334		0		{method	'endswith'	of	'str'	objects}	
194868241	48,031	0	48,031		0		{method	'startswith'	of	'str'	objects}	
274996319	43,611	0	43,611		0		{method	'rstrip'	of	'str'	objects}

edit: tried to make the table pretty

jbrockmendel · 2022-10-21T16:41:47Z

can you use .sort_stat("cumtime") before printing the profiler results? also need more lines of output, as what you've posted doesn't look like its even getting to the pandas code

borlaugen · 2022-10-21T16:49:58Z

Attached is the entire output from two different runs with 1.4.4 and 1.5.1 respectively.

profiler_1_5_github.txt
profiler_1_4_github.txt

jbrockmendel · 2022-10-21T16:59:34Z

@pganssle it looks like zoneinfo load_tzdata is taking something like 93% of the runtime here. any idea what's going on?

pganssle · 2022-10-21T17:23:04Z

@jbrockmendel Are you doing something weird to load the ZoneInfo objects? I don't really understand why load_tzdata is being called 118k times, since you should be hitting the cache after the first one.

Do you see the same issue if you add TZ=zoneinfo.ZoneInfo("Europe/Paris") to the setup code? This might be an artifact of the way you are doing the benchmark.

Edit Looking at this more closely, I don't actually see why zoneinfo is getting called at all. From what I can tell the slow MWE code exclusively uses pytz, so presumably something is calling some zoneinfo thing when it doesn't need to. The repeated calls to load_tzdata make me think that it never actually succeeds in constructing a ZoneInfo file, so it's probably a side effect of some "check if this is a ZI object" code that isn't caching properly.

@bernardkaplan What happens if you install the tzdata Python package before running the tests. Does that fix it?

borlaugen · 2022-10-21T17:50:43Z

@pganssle, I install tzdata and now the run times are normal. See attached cprofiler output.

profiler_1_5_tzdata.txt

stalehuse · 2022-10-21T17:56:49Z

python -m timeit -n 5 "import pandas as pd; x = pd.DatetimeIndex([pd.Timestamp.now(tz='Europe/Paris')] * 1000)"

The code above speeds up from 2 seconds to 3 milliseconds after installing tzdata

pganssle · 2022-10-21T18:03:57Z

OK, I think the problem is probably this. If you don't have zoneinfo data, It tries to construct ZoneInfo("UTC") every time it's called, and fails only after searching a bunch of places looking for time zone data. It's not wrong to do this, since its possible that you will install some time zone data after the application is open, but in the median case where tzdata is not installed, that's going to be expensive.

You can probably get closer to ideal by doing something like this (the line starting if tz is None is the part that is changed):

cdef inline bint is_utc_zoneinfo(tzinfo tz):
    # Workaround for cases with missing tzdata
    #  https://github.com/pandas-dev/pandas/pull/46425#discussion_r830633025
    if tz is None or zoneinfo is None or not isinstance(tz, ZoneInfo):
        return False

    global utc_zoneinfo
    if utc_zoneinfo is None:
        try:
            utc_zoneinfo = ZoneInfo("UTC")
        except zoneinfo.ZoneInfoNotFoundError:
            return False
        # Warn if tzdata is too old, even if there is a system tzdata to alert
        # users about the mismatch between local/system tzdata
        import_optional_dependency("tzdata", errors="warn", min_version="2022.1")

    return tz is utc_zoneinfo

ZoneInfo("UTC") will always be an instance of zoneinfo.ZoneInfo, and it's a very slim chance that someone will pass you a zoneinfo.ZoneInfo file on a system where there's no UTC tzif file. Those people will have the same performance regression, and if that's a real problem then you may need to decide between the two very uncommon use cases of "no UTC time zone data but they are actively using zoneinfo.ZoneInfo" and "wants to be able to install time zone data while the program is running and have pandas still be able to recognize that it's a UTC object".

You can also make this a bit more targeted by moving the isinstance check to: if utc_zoneinfo is None and isinstance(tz, ZoneInfo), if you feel that having the isinstance check slows down the case where the user actually has passed ZoneInfo("UTC") enough that it's preferable to put it under the guard where we are trying to populate utc_zoneinfo.

Current users can install tzdata as a workaround until the bug is fixed.

bernardkaplan · 2022-10-22T08:40:52Z

@bernardkaplan What happens if you install the tzdata Python package before running the tests. Does that fix it?

I confirm that installing tzdata solved the performance issue.

MarcoGorelli · 2023-01-05T13:13:48Z

Anyone know how to reproduce this on Windows on main?

I followed the contributing guide for Windows (which I hadn't done before, I was always using Linux), then did pip uninstall tzdata -y. But then I see:

(.venv) PS C:\Users\User\pandas-dev> python -m timeit -n 5 "import pandas as pd; x = pd.DatetimeIndex([pd.Timestamp.now(tz='Europe/Paris')] * 1000)"
5 loops, best of 5: 1.83 msec per loop
:0: UserWarning: The test results are likely unreliable. The worst time (123 msec) was more than four times slower than the best time (1.83 msec).

bernardkaplan added Needs Triage Issue that has not been reviewed by a pandas team member Performance Memory or execution speed performance labels Oct 11, 2022

phofl added Timezones Timezone data dtype Window rolling, ewma, expanding labels Oct 11, 2022

mroeschke added Windows Windows OS and removed Window rolling, ewma, expanding labels Oct 13, 2022

lithomas1 mentioned this issue Jan 4, 2023

PERF: objects_to_datetime64ns performance degradation for datetime columns #50537

Closed

3 tasks

MarcoGorelli mentioned this issue Feb 8, 2023

PERF: Construction of a DatetimeIndex from a list of Timestamp with timezone #51247

Merged

5 tasks

mroeschke closed this as completed in #51247 Mar 15, 2023

MarcoGorelli mentioned this issue Apr 3, 2023

pandas v2.0.0 conda-forge/pandas-feedstock#159

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF: Construction of a DatetimeIndex from a list of Timestamp with timezone #49048

PERF: Construction of a DatetimeIndex from a list of Timestamp with timezone #49048

bernardkaplan commented Oct 11, 2022

INSTALLED VERSIONS

jbrockmendel commented Oct 12, 2022

bernardkaplan commented Oct 12, 2022

INSTALLED VERSIONS

borlaugen commented Oct 21, 2022

jbrockmendel commented Oct 21, 2022

borlaugen commented Oct 21, 2022 •

edited

Loading

jbrockmendel commented Oct 21, 2022

borlaugen commented Oct 21, 2022

jbrockmendel commented Oct 21, 2022

pganssle commented Oct 21, 2022 •

edited

Loading

borlaugen commented Oct 21, 2022

stalehuse commented Oct 21, 2022

pganssle commented Oct 21, 2022 •

edited

Loading

bernardkaplan commented Oct 22, 2022

MarcoGorelli commented Jan 5, 2023

PERF: Construction of a DatetimeIndex from a list of Timestamp with timezone #49048

PERF: Construction of a DatetimeIndex from a list of Timestamp with timezone #49048

Comments

bernardkaplan commented Oct 11, 2022

Pandas version checks

Reproducible Example

Installed Versions

INSTALLED VERSIONS

Prior Performance

jbrockmendel commented Oct 12, 2022

bernardkaplan commented Oct 12, 2022

INSTALLED VERSIONS

borlaugen commented Oct 21, 2022

jbrockmendel commented Oct 21, 2022

borlaugen commented Oct 21, 2022 • edited Loading

jbrockmendel commented Oct 21, 2022

borlaugen commented Oct 21, 2022

jbrockmendel commented Oct 21, 2022

pganssle commented Oct 21, 2022 • edited Loading

borlaugen commented Oct 21, 2022

stalehuse commented Oct 21, 2022

pganssle commented Oct 21, 2022 • edited Loading

bernardkaplan commented Oct 22, 2022

MarcoGorelli commented Jan 5, 2023

borlaugen commented Oct 21, 2022 •

edited

Loading

pganssle commented Oct 21, 2022 •

edited

Loading

pganssle commented Oct 21, 2022 •

edited

Loading