Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: Construction of a DatetimeIndex from a list of Timestamp with timezone #49048

Closed
2 of 3 tasks
bernardkaplan opened this issue Oct 11, 2022 · 14 comments · Fixed by #51247
Closed
2 of 3 tasks

PERF: Construction of a DatetimeIndex from a list of Timestamp with timezone #49048

bernardkaplan opened this issue Oct 11, 2022 · 14 comments · Fixed by #51247
Labels
Needs Triage Issue that has not been reviewed by a pandas team member Performance Memory or execution speed performance Timezones Timezone data dtype Windows Windows OS

Comments

@bernardkaplan
Copy link

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this issue exists on the latest version of pandas.

  • I have confirmed this issue exists on the main branch of pandas.

Reproducible Example

The construction of a DatetimeIndex from a list of Timestamp with timezone appears to be slow when pandas 1.5 is installed in a virtual environment (venv) on Windows.

Here is a mininal script that run slowly on my machine:

# Pandas 1.5.0
python -m timeit -n 5 "import pandas as pd; x = pd.DatetimeIndex([pd.Timestamp.now(tz='Europe/Paris')] * 1000)"
# output: 5 loops, best of 5: 1.38 sec per loop

Remarks:

  • Pandas 1.4.4 works fine on the same machine (see below)
  • There is no issue on linux
  • There is no issue if the timezone is UTC or None
  • There is no issue when pandas 1.5 is installed in a conda environment on the same machine
  • Here is the script I used to install the virtual environment
    python -m venv venv
    .\venv\Scripts\activate
    pip install pandas

Installed Versions

INSTALLED VERSIONS

commit : 87cfe4e
python : 3.9.12.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.22000
machine : AMD64
processor : Intel64 Family 6 Model 140 Stepping 2, GenuineIntel
byteorder : little
LC_ALL : None
LANG : None
LOCALE : fr_FR.cp1252

pandas : 1.5.0
numpy : 1.23.3
pytz : 2022.4
dateutil : 2.8.2
setuptools : 58.1.0
pip : 22.0.4
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
brotli : None
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : None
numba : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : None
snappy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
zstandard : None
tzdata : None

Prior Performance

# Pandas 1.4.4
python -m timeit -n 5 "import pandas as pd; x = pd.DatetimeIndex([pd.Timestamp.now(tz='Europe/Paris')] * 1000)"
# output: 5 loops, best of 5: 2.25 msec per loop
# :0: UserWarning: The test results are likely unreliable. The worst time (176 msec) was more than four times slower than the best time (2.25 msec).
@bernardkaplan bernardkaplan added Needs Triage Issue that has not been reviewed by a pandas team member Performance Memory or execution speed performance labels Oct 11, 2022
@phofl phofl added Timezones Timezone data dtype Window rolling, ewma, expanding labels Oct 11, 2022
@jbrockmendel
Copy link
Member

I'm seeing about 2.5ms on main.

@bernardkaplan
Copy link
Author

I installed pandas from source (version 1.6.0.dev0+319.g98323eec0e) and the issue still persists (almost 2 seconds per loop).

INSTALLED VERSIONS

commit : 98323ee
python : 3.9.12.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.22621
machine : AMD64
processor : Intel64 Family 6 Model 140 Stepping 2, GenuineIntel
byteorder : little
LC_ALL : None
LANG : None
LOCALE : fr_FR.cp1252

pandas : 1.6.0.dev0+319.g98323eec0e
numpy : 1.23.4
pytz : 2022.4
dateutil : 2.8.2
setuptools : 58.1.0
pip : 22.2.2
Cython : 0.29.32
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
brotli : None
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : None
numba : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : None
snappy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
zstandard : None
tzdata : None

@mroeschke mroeschke added Windows Windows OS and removed Window rolling, ewma, expanding labels Oct 13, 2022
@borlaugen
Copy link

This has a very hight impact on our code. Some of the code we develop has gone from 100 seconds to 4000 seconds just by changing from pandas 1.4.4 to > 1.5.

Our profiling shows that the function \pandas\core\arrays\datetimes,py:1980(_sequence_to_dt64ns) spends 18 seconds on average in 1.5 but on 1.4.4 spends 0.003 seconds.

I too can confirm that this is not an issue on linux (debian) but is on Windows 11.

The issue seems to be present also in non virtual environments.

@jbrockmendel
Copy link
Member

Our profiling shows that the function \pandas\core\arrays\datetimes,py:1980(_sequence_to_dt64ns) spends 18 seconds on average in 1.5 but on 1.4.4 spends 0.003 seconds.

Do the profiling results point to anything more specific?

@borlaugen
Copy link

borlaugen commented Oct 21, 2022

After looking more closely at the results it looks like nt.stat is called roughly 10 000 times more in 1.5 than in 1.4 for one of our tests. Below is an excerpt of cprofile's output sorted on total time. This is the same run where i got the 18 seconds on average per call for _sequence_to_dt64ns. I am still a newbie at reading cprofiler's results, let me know if there is anything I could do to provide additional information.

ncalls	        tottime	percall	cumtime	percall	filename:lineno(function)					
18213004	732,81	0	732,81		0		{built-in	method	nt,stat}			
91057598	696,116	0	1002,256	0		<frozen	importlib,_bootstrap_external>:96(_path_join)				
18211525	460,503	0	2272,45		0		<frozen	importlib,_bootstrap_external>:1536(find_spec)				
1821172		113,012	0	2417,21		0,001	<frozen	importlib,_bootstrap_external>:1399(_get_spec)				
91057598	94,61	0	119,267		0		<frozen	importlib,_bootstrap_external>:119(<listcomp>)				
1821172		77,473	0	2549,822	0,001	<frozen	importlib,_bootstrap>:921(_find_spec)				
273173609	55,334	0	55,334		0		{method	'endswith'	of	'str'	objects}	
194868241	48,031	0	48,031		0		{method	'startswith'	of	'str'	objects}	
274996319	43,611	0	43,611		0		{method	'rstrip'	of	'str'	objects}	

edit: tried to make the table pretty

@jbrockmendel
Copy link
Member

can you use .sort_stat("cumtime") before printing the profiler results? also need more lines of output, as what you've posted doesn't look like its even getting to the pandas code

@borlaugen
Copy link

Attached is the entire output from two different runs with 1.4.4 and 1.5.1 respectively.

profiler_1_5_github.txt
profiler_1_4_github.txt

@jbrockmendel
Copy link
Member

@pganssle it looks like zoneinfo load_tzdata is taking something like 93% of the runtime here. any idea what's going on?

@pganssle
Copy link
Contributor

pganssle commented Oct 21, 2022

@jbrockmendel Are you doing something weird to load the ZoneInfo objects? I don't really understand why load_tzdata is being called 118k times, since you should be hitting the cache after the first one.

Do you see the same issue if you add TZ=zoneinfo.ZoneInfo("Europe/Paris") to the setup code? This might be an artifact of the way you are doing the benchmark.

Edit Looking at this more closely, I don't actually see why zoneinfo is getting called at all. From what I can tell the slow MWE code exclusively uses pytz, so presumably something is calling some zoneinfo thing when it doesn't need to. The repeated calls to load_tzdata make me think that it never actually succeeds in constructing a ZoneInfo file, so it's probably a side effect of some "check if this is a ZI object" code that isn't caching properly.

@bernardkaplan What happens if you install the tzdata Python package before running the tests. Does that fix it?

@borlaugen
Copy link

@pganssle, I install tzdata and now the run times are normal. See attached cprofiler output.

profiler_1_5_tzdata.txt

@stalehuse
Copy link

python -m timeit -n 5 "import pandas as pd; x = pd.DatetimeIndex([pd.Timestamp.now(tz='Europe/Paris')] * 1000)"

The code above speeds up from 2 seconds to 3 milliseconds after installing tzdata

@pganssle
Copy link
Contributor

pganssle commented Oct 21, 2022

OK, I think the problem is probably this. If you don't have zoneinfo data, It tries to construct ZoneInfo("UTC") every time it's called, and fails only after searching a bunch of places looking for time zone data. It's not wrong to do this, since its possible that you will install some time zone data after the application is open, but in the median case where tzdata is not installed, that's going to be expensive.

You can probably get closer to ideal by doing something like this (the line starting if tz is None is the part that is changed):

cdef inline bint is_utc_zoneinfo(tzinfo tz):
    # Workaround for cases with missing tzdata
    #  https://github.com/pandas-dev/pandas/pull/46425#discussion_r830633025
    if tz is None or zoneinfo is None or not isinstance(tz, ZoneInfo):
        return False

    global utc_zoneinfo
    if utc_zoneinfo is None:
        try:
            utc_zoneinfo = ZoneInfo("UTC")
        except zoneinfo.ZoneInfoNotFoundError:
            return False
        # Warn if tzdata is too old, even if there is a system tzdata to alert
        # users about the mismatch between local/system tzdata
        import_optional_dependency("tzdata", errors="warn", min_version="2022.1")

    return tz is utc_zoneinfo

ZoneInfo("UTC") will always be an instance of zoneinfo.ZoneInfo, and it's a very slim chance that someone will pass you a zoneinfo.ZoneInfo file on a system where there's no UTC tzif file. Those people will have the same performance regression, and if that's a real problem then you may need to decide between the two very uncommon use cases of "no UTC time zone data but they are actively using zoneinfo.ZoneInfo" and "wants to be able to install time zone data while the program is running and have pandas still be able to recognize that it's a UTC object".

You can also make this a bit more targeted by moving the isinstance check to: if utc_zoneinfo is None and isinstance(tz, ZoneInfo), if you feel that having the isinstance check slows down the case where the user actually has passed ZoneInfo("UTC") enough that it's preferable to put it under the guard where we are trying to populate utc_zoneinfo.

Current users can install tzdata as a workaround until the bug is fixed.

@bernardkaplan
Copy link
Author

@bernardkaplan What happens if you install the tzdata Python package before running the tests. Does that fix it?

I confirm that installing tzdata solved the performance issue.

@MarcoGorelli
Copy link
Member

Anyone know how to reproduce this on Windows on main?

I followed the contributing guide for Windows (which I hadn't done before, I was always using Linux), then did pip uninstall tzdata -y. But then I see:

(.venv) PS C:\Users\User\pandas-dev> python -m timeit -n 5 "import pandas as pd; x = pd.DatetimeIndex([pd.Timestamp.now(tz='Europe/Paris')] * 1000)"
5 loops, best of 5: 1.83 msec per loop
:0: UserWarning: The test results are likely unreliable. The worst time (123 msec) was more than four times slower than the best time (1.83 msec).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Needs Triage Issue that has not been reviewed by a pandas team member Performance Memory or execution speed performance Timezones Timezone data dtype Windows Windows OS
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants