Left join on column with timezone aware index with unexpected behavior #26335

P-Tillmann · 2019-05-10T12:07:23Z

I stumbled across some unexpected behavior when performing a left join on a specified column with time zone aware columns.

# Your code here
df1 = pd.DataFrame({
    'date': pd.date_range(start='2018-01-01', periods=5, tz='America/Chicago'),
    'vals': list('abcde')}
)

df2 = pd.DataFrame({
    'date': pd.date_range(start='2018-01-03', periods=5, tz='America/Chicago'),
    'vals_2': list('tuvwx')}
)
df1.join(df2.set_index('date'), on='date')

                       date vals vals_2
0 2018-01-01 00:00:00-06:00    a    NaN
1 2018-01-02 00:00:00-06:00    b    NaN
2 2018-01-03 00:00:00-06:00    c    NaN
3 2018-01-04 00:00:00-06:00    d    NaN
4 2018-01-05 00:00:00-06:00    e    NaN

When i was expecting

                       date vals vals_2
0 2018-01-01 00:00:00-06:00    a    NaN
1 2018-01-02 00:00:00-06:00    b    NaN
2 2018-01-03 00:00:00-06:00    c      t
3 2018-01-04 00:00:00-06:00    d      u
4 2018-01-05 00:00:00-06:00    e      v

In PR #25260 the test case was specified with all NaN in vals_2 as expected. I don't understand why considering how merge on two columns or join on two indicies work:

df1.set_index('date').join(df2.set_index('date')).reset_index()
pd.merge(df1, df2, on='date', how='left')

Both yield the expected behavior.
Is there something I'm missing or is this inconsistency a bug?

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit: None
python: 3.7.1.final.0
python-bits: 64
OS: Linux
OS-release: 4.15.0-48-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.24.2
pytest: 4.1.1
pip: 18.1
setuptools: 40.6.3
Cython: None
numpy: 1.15.4
scipy: 1.1.0
pyarrow: 0.11.1
xarray: None
IPython: 7.2.0
sphinx: 1.8.4
patsy: 0.5.1
dateutil: 2.7.5
pytz: 2018.7
blosc: None
bottleneck: None
tables: 3.4.4
numexpr: 2.6.8
feather: None
matplotlib: 3.0.2
openpyxl: 2.6.1
xlrd: None
xlwt: None
xlsxwriter: None
lxml.etree: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: 0.1.6
pandas_gbq: None
pandas_datareader: None
gcsfs: None

The text was updated successfully, but these errors were encountered:

jschendel · 2019-05-10T22:04:14Z

Yes, it's a bit strange that this is inconsistent with identical DataFrames that are tz naive:

In [1]: import pandas as pd; pd.__version__
Out[1]: '0.25.0.dev0+520.g2ef50aea04'

In [2]: df1 = pd.DataFrame({
   ...:     'date': pd.date_range(start='2018-01-01', periods=5, tz='America/Chicago'),
   ...:     'vals': list('abcde')})
   ...:
   ...: df2 = pd.DataFrame({
   ...:     'date': pd.date_range(start='2018-01-03', periods=5, tz='America/Chicago'),
   ...:     'vals_2': list('tuvwx')})
   ...:
   ...: df1.join(df2.set_index('date'), on='date')
Out[2]:
                       date vals vals_2
0 2018-01-01 00:00:00-06:00    a    NaN
1 2018-01-02 00:00:00-06:00    b    NaN
2 2018-01-03 00:00:00-06:00    c    NaN
3 2018-01-04 00:00:00-06:00    d    NaN
4 2018-01-05 00:00:00-06:00    e    NaN

In [3]: df1 = pd.DataFrame({
   ...:     'date': pd.date_range(start='2018-01-01', periods=5),
   ...:     'vals': list('abcde')})
   ...:
   ...: df2 = pd.DataFrame({
   ...:     'date': pd.date_range(start='2018-01-03', periods=5),
   ...:     'vals_2': list('tuvwx')})
   ...:
   ...: df1.join(df2.set_index('date'), on='date')
Out[3]:
        date vals vals_2
0 2018-01-01    a    NaN
1 2018-01-02    b    NaN
2 2018-01-03    c      t
3 2018-01-04    d      u
4 2018-01-05    e      v

cc @mroeschke

jschendel added the Timezones Timezone data dtype label May 10, 2019

mroeschke mentioned this issue May 13, 2019

BUG: DataFrame.join on tz aware index and column #26362

Merged

4 tasks

jreback added this to the 0.25.0 milestone May 14, 2019

jreback added the Reshaping Concat, Merge/Join, Stack/Unstack, Explode label May 14, 2019

jreback closed this as completed in #26362 May 14, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Left join on column with timezone aware index with unexpected behavior #26335

Left join on column with timezone aware index with unexpected behavior #26335

P-Tillmann commented May 10, 2019

INSTALLED VERSIONS

jschendel commented May 10, 2019

Left join on column with timezone aware index with unexpected behavior #26335

Left join on column with timezone aware index with unexpected behavior #26335

Comments

P-Tillmann commented May 10, 2019

Output of pd.show_versions()

INSTALLED VERSIONS

jschendel commented May 10, 2019

Output of `pd.show_versions()`