Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Left join on column with timezone aware index with unexpected behavior #26335

Closed
P-Tillmann opened this issue May 10, 2019 · 1 comment · Fixed by #26362
Closed

Left join on column with timezone aware index with unexpected behavior #26335

P-Tillmann opened this issue May 10, 2019 · 1 comment · Fixed by #26362
Labels
Reshaping Concat, Merge/Join, Stack/Unstack, Explode Timezones Timezone data dtype
Milestone

Comments

@P-Tillmann
Copy link
Contributor

I stumbled across some unexpected behavior when performing a left join on a specified column with time zone aware columns.

# Your code here
df1 = pd.DataFrame({
    'date': pd.date_range(start='2018-01-01', periods=5, tz='America/Chicago'),
    'vals': list('abcde')}
)

df2 = pd.DataFrame({
    'date': pd.date_range(start='2018-01-03', periods=5, tz='America/Chicago'),
    'vals_2': list('tuvwx')}
)
df1.join(df2.set_index('date'), on='date')
                       date vals vals_2
0 2018-01-01 00:00:00-06:00    a    NaN
1 2018-01-02 00:00:00-06:00    b    NaN
2 2018-01-03 00:00:00-06:00    c    NaN
3 2018-01-04 00:00:00-06:00    d    NaN
4 2018-01-05 00:00:00-06:00    e    NaN

When i was expecting

                       date vals vals_2
0 2018-01-01 00:00:00-06:00    a    NaN
1 2018-01-02 00:00:00-06:00    b    NaN
2 2018-01-03 00:00:00-06:00    c      t
3 2018-01-04 00:00:00-06:00    d      u
4 2018-01-05 00:00:00-06:00    e      v

In PR #25260 the test case was specified with all NaN in vals_2 as expected. I don't understand why considering how merge on two columns or join on two indicies work:

df1.set_index('date').join(df2.set_index('date')).reset_index()
pd.merge(df1, df2, on='date', how='left')

Both yield the expected behavior.
Is there something I'm missing or is this inconsistency a bug?

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.7.1.final.0
python-bits: 64
OS: Linux
OS-release: 4.15.0-48-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.24.2
pytest: 4.1.1
pip: 18.1
setuptools: 40.6.3
Cython: None
numpy: 1.15.4
scipy: 1.1.0
pyarrow: 0.11.1
xarray: None
IPython: 7.2.0
sphinx: 1.8.4
patsy: 0.5.1
dateutil: 2.7.5
pytz: 2018.7
blosc: None
bottleneck: None
tables: 3.4.4
numexpr: 2.6.8
feather: None
matplotlib: 3.0.2
openpyxl: 2.6.1
xlrd: None
xlwt: None
xlsxwriter: None
lxml.etree: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: 0.1.6
pandas_gbq: None
pandas_datareader: None
gcsfs: None

@jschendel
Copy link
Member

Yes, it's a bit strange that this is inconsistent with identical DataFrames that are tz naive:

In [1]: import pandas as pd; pd.__version__
Out[1]: '0.25.0.dev0+520.g2ef50aea04'

In [2]: df1 = pd.DataFrame({
   ...:     'date': pd.date_range(start='2018-01-01', periods=5, tz='America/Chicago'),
   ...:     'vals': list('abcde')})
   ...:
   ...: df2 = pd.DataFrame({
   ...:     'date': pd.date_range(start='2018-01-03', periods=5, tz='America/Chicago'),
   ...:     'vals_2': list('tuvwx')})
   ...:
   ...: df1.join(df2.set_index('date'), on='date')
Out[2]:
                       date vals vals_2
0 2018-01-01 00:00:00-06:00    a    NaN
1 2018-01-02 00:00:00-06:00    b    NaN
2 2018-01-03 00:00:00-06:00    c    NaN
3 2018-01-04 00:00:00-06:00    d    NaN
4 2018-01-05 00:00:00-06:00    e    NaN

In [3]: df1 = pd.DataFrame({
   ...:     'date': pd.date_range(start='2018-01-01', periods=5),
   ...:     'vals': list('abcde')})
   ...:
   ...: df2 = pd.DataFrame({
   ...:     'date': pd.date_range(start='2018-01-03', periods=5),
   ...:     'vals_2': list('tuvwx')})
   ...:
   ...: df1.join(df2.set_index('date'), on='date')
Out[3]:
        date vals vals_2
0 2018-01-01    a    NaN
1 2018-01-02    b    NaN
2 2018-01-03    c      t
3 2018-01-04    d      u
4 2018-01-05    e      v

cc @mroeschke

@jschendel jschendel added the Timezones Timezone data dtype label May 10, 2019
@jreback jreback added this to the 0.25.0 milestone May 14, 2019
@jreback jreback added the Reshaping Concat, Merge/Join, Stack/Unstack, Explode label May 14, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Reshaping Concat, Merge/Join, Stack/Unstack, Explode Timezones Timezone data dtype
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants