Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG:Sanity check on merge parameters for correct exception #26824 #26855

Closed
wants to merge 28 commits into from

Conversation

harshit-py
Copy link

@harshit-py harshit-py commented Jun 14, 2019

@simonjayhawkins simonjayhawkins added the Error Reporting Incorrect or improved errors from pandas label Jun 14, 2019
@codecov
Copy link

codecov bot commented Jun 14, 2019

Codecov Report

Merging #26855 into master will decrease coverage by 0.01%.
The diff coverage is 50%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #26855      +/-   ##
==========================================
- Coverage   91.88%   91.87%   -0.02%     
==========================================
  Files         179      179              
  Lines       50696    50700       +4     
==========================================
- Hits        46581    46579       -2     
- Misses       4115     4121       +6
Flag Coverage Δ
#multiple 90.46% <50%> (-0.01%) ⬇️
#single 41.08% <0%> (-0.09%) ⬇️
Impacted Files Coverage Δ
pandas/core/reshape/merge.py 94.23% <50%> (-0.25%) ⬇️
pandas/io/gbq.py 88.88% <0%> (-11.12%) ⬇️
pandas/core/frame.py 96.88% <0%> (-0.12%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 430f0fd...c3064e6. Read the comment docs.

@codecov
Copy link

codecov bot commented Jun 14, 2019

Codecov Report

❗ No coverage uploaded for pull request base (master@45ea267). Click here to learn what that means.
The diff coverage is 0%.

Impacted file tree graph

@@            Coverage Diff            @@
##             master   #26855   +/-   ##
=========================================
  Coverage          ?   41.09%           
=========================================
  Files             ?      180           
  Lines             ?    50747           
  Branches          ?        0           
=========================================
  Hits              ?    20856           
  Misses            ?    29891           
  Partials          ?        0
Flag Coverage Δ
#single 41.09% <0%> (?)
Impacted Files Coverage Δ
pandas/core/reshape/merge.py 9.34% <0%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 45ea267...e4e486c. Read the comment docs.

Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a test for this? Typically the first thing we ask for as part of any PR

@harshit-py
Copy link
Author

Can you add a test for this? Typically the first thing we ask for as part of any PR

Sure, just to confirm this test would go in pandas/tests/reshape/merge/test_merge.py?

Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good! Can you also add a whatsnew note for v0.25 describing that this now raises a ValueError when both arguments are not provided?



@pytest.mark.parametrize('merge_type', ['left_on', 'right_on'])
def test_merge_correct_exception(merge_type):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's just call this test_missing_on_raises

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure

'C': ['L', 'M', 'N', 'O', 'P', 'Q']
})
msg = 'both left_on and right_on should be passed'
if merge_type == 'left_on':
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than the if condition since you are parmatrizing this argument (which is nice btw) you can send it through as kwargs. So:

kwargs = {merge_type: 'A'}
with pytest.raises(ValueError, match=msg):
    pd.merge(df1, df2, how='left', **kwargs)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, will do

@WillAyd
Copy link
Member

WillAyd commented Jun 15, 2019

@simonjayhawkins if you care to take a look

@WillAyd WillAyd added this to the 0.25.0 milestone Jun 15, 2019
@@ -1089,13 +1089,19 @@ def _validate_specification(self):
raise ValueError('len(left_on) must equal the number '
'of levels in the index of "right"')
self.right_on = [None] * n
if not self.right_on:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we instead try to set

self.left_on = self.left_on or [None] * len( self.right_on)
and same for right_on

before these if clauses; that way don’t need to add additional checks

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure

elif self.right_on is not None:
n = len(self.right_on)
if self.left_index:
if len(self.right_on) != self.left.index.nlevels:
raise ValueError('len(right_on) must equal the number '
'of levels in the index of "left"')
self.left_on = [None] * n
self.left_on = self.left_on or [None] * len(self.right_on)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you don’t need the lines above (not both cases)
also can use n

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure I get it. The previous line needs to go?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

try it - i think it’s unnecessary

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

u can also combine the condition in 1095 and 1096 i think as well

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left_index and right_index are False by default (in the _MergeOperation instantiation), so both left_on and right_on would need to be defined out of the if clause at 1095 and 1087 (correct me if I'm wrong)

Copy link
Contributor

@jreback jreback Jun 15, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok see if u can simplify things with passing tests

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

separately, can you create an issue (PR would also be fantastic), do factor out the validation / error checking tests on parameters out of test_merge.py to test_merge_validation.py

@@ -1089,13 +1089,15 @@ def _validate_specification(self):
raise ValueError('len(left_on) must equal the number '
'of levels in the index of "right"')
self.right_on = [None] * n
self.right_on = self.right_on or [None] * (n + 1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do you need to add 1 this seems odd

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so, the thinking here was to not add additional check and force the cases where either of left_on or right_on is absent to the if clause at 1099.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right

but i think this is actually changing things slightly

maybe let’s go back to your original patch

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but i think this is actually changing things slightly

additional message checks maybe should be added to the following tests as a precursor. xref #23922

pandas/tests/reshape/merge/test_merge_asof.py:        with pytest.raises(MergeError):
pandas/tests/reshape/merge/test_merge_asof.py:        with pytest.raises(MergeError):
pandas/tests/reshape/merge/test_merge_asof.py:        with pytest.raises(MergeError):
pandas/tests/reshape/merge/test_merge_asof.py:        with pytest.raises(MergeError):
pandas/tests/reshape/merge/test_merge_asof.py:        with pytest.raises(MergeError):
pandas/tests/reshape/merge/test_merge_asof.py:        with pytest.raises(MergeError):
pandas/tests/reshape/merge/test_merge_asof.py:        with pytest.raises(MergeError):
pandas/tests/reshape/merge/test_merge_asof.py:        with pytest.raises(MergeError):
pandas/tests/reshape/merge/test_merge_asof.py:        with pytest.raises(MergeError):
pandas/tests/reshape/merge/test_merge_asof.py:        with pytest.raises(MergeError):
pandas/tests/reshape/merge/test_merge_asof.py:        with pytest.raises(MergeError):
pandas/tests/reshape/merge/test_merge_asof.py:        with pytest.raises(MergeError):
pandas/tests/reshape/merge/test_merge_asof.py:        with pytest.raises(MergeError):
pandas/tests/reshape/merge/test_merge_asof.py:        with pytest.raises(ValueError):
pandas/tests/reshape/merge/test_merge_asof.py:        with pytest.raises(ValueError):
pandas/tests/reshape/merge/test_multi.py:        with pytest.raises(ValueError):
pandas/tests/reshape/merge/test_multi.py:        with pytest.raises(ValueError):

also the following tests perhaps should be moved to pandas/tests/reshape/merge/test_merge.py ( could potentially add additional parametrisation to these tests rather than creating new)

def test_join_on_fails_with_different_right_index(self):
df = DataFrame({'a': np.random.choice(['m', 'f'], size=3),
'b': np.random.randn(3)})
df2 = DataFrame({'a': np.random.choice(['m', 'f'], size=10),
'b': np.random.randn(10)},
index=tm.makeCustomIndex(10, 2))
msg = (r'len\(left_on\) must equal the number of levels in the index'
' of "right"')
with pytest.raises(ValueError, match=msg):
merge(df, df2, left_on='a', right_index=True)
def test_join_on_fails_with_different_left_index(self):
df = DataFrame({'a': np.random.choice(['m', 'f'], size=3),
'b': np.random.randn(3)},
index=tm.makeCustomIndex(3, 2))
df2 = DataFrame({'a': np.random.choice(['m', 'f'], size=10),
'b': np.random.randn(10)})
msg = (r'len\(right_on\) must equal the number of levels in the index'
' of "left"')
with pytest.raises(ValueError, match=msg):
merge(df, df2, right_on='b', left_index=True)
def test_join_on_fails_with_different_column_counts(self):
df = DataFrame({'a': np.random.choice(['m', 'f'], size=3),
'b': np.random.randn(3)})
df2 = DataFrame({'a': np.random.choice(['m', 'f'], size=10),
'b': np.random.randn(10)},
index=tm.makeCustomIndex(10, 2))
msg = r"len\(right_on\) must equal len\(left_on\)"
with pytest.raises(ValueError, match=msg):
merge(df, df2, right_on='a', left_on=['a', 'b'])
@pytest.mark.parametrize("wrong_type", [2, 'str', None, np.array([0, 1])])
def test_join_on_fails_with_wrong_object_type(self, wrong_type):
# GH12081 - original issue
# GH21220 - merging of Series and DataFrame is now allowed
# Edited test to remove the Series object from test parameters
df = DataFrame({'a': [1, 1]})
msg = ("Can only merge Series or DataFrame objects, a {} was passed"
.format(str(type(wrong_type))))
with pytest.raises(TypeError, match=msg):
merge(wrong_type, df, left_on='a', right_on='a')
with pytest.raises(TypeError, match=msg):
merge(df, wrong_type, left_on='a', right_on='a')

@pytest.mark.parametrize('merge_type', ['left_on', 'right_on'])
def test_missing_on_raises(merge_type):
# GH26824
df1 = DataFrame({
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

call these left & right

@@ -1763,3 +1763,20 @@ def test_merge_equal_cat_dtypes2():

# Categorical is unordered, so don't check ordering.
tm.assert_frame_equal(result, expected, check_categorical=False)


@pytest.mark.parametrize('merge_type', ['left_on', 'right_on'])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move this near the test_validation test

@simonjayhawkins
Copy link
Member

I think the has no len() part of the message in issue #26824 could be a red herring here.

An exception with the len(right_on) must equal len(left_on) message is raised when the length of the right_on and left_on parameters is not the same, eg.

pd.merge(left, right, left_on=['key1'], right_on=['key1', 'key2']) # ValueError: len(right_on) must equal len(left_on)

In cases such as this issue where neither right_on or right_index parameters are passed, we should expect to see:

pd.merge(left, right, left_on=['key1']) # MergeError: Must pass right_on or right_index=True

So it could be that, this exception is not being raised, since in #26824, as well as not passing either right_on or right_index, the example does not have the key passed in left_on.

If we add the right_on to the issue:

pd.merge(df2, df1, how='left', left_on='A', right_on='boro') # KeyError: 'A'

we get the more informative KeyError

So probably this should also be the exception expected in this scenarios even when neither right_on or right_index parameters are passed

@@ -99,6 +99,7 @@ Other Enhancements
- Error message for missing required imports now includes the original import error's text (:issue:`23868`)
- :class:`DatetimeIndex` and :class:`TimedeltaIndex` now have a ``mean`` method (:issue:`24757`)
- :meth:`DataFrame.describe` now formats integer percentiles without decimal point (:issue:`26660`)
- :func:`pandas.merge` now raises ``ValueError`` when either of ``left_on`` or ``right_on`` is not provided (:issue:`26855`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you elaborate slightly; this only matters if on itself is not provided. Do we have a check if on is provide and either of left_on or right_on is not None?

@pep8speaks
Copy link

pep8speaks commented Jun 19, 2019

Hello @harshit-py! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2019-06-28 14:17:00 UTC

@harshit-py
Copy link
Author

@jreback Hey I think I messed this up while trying to resolve conflicts in my PR. Can I open a separate PR for this?

@TomAugspurger
Copy link
Contributor

@harshit-py I made harshit-py#1. I think if you merge that, it'll merge cleanly against master.

diff --git a/doc/source/whatsnew/v0.25.0.rst b/doc/source/whatsnew/v0.25.0.rst
index a58cdc8c9..01aa9ad97 100644
--- a/doc/source/whatsnew/v0.25.0.rst
+++ b/doc/source/whatsnew/v0.25.0.rst
@@ -74,7 +74,6 @@ a dict to a Series groupby aggregation (:ref:`whatsnew_0200.api_breaking.depreca
 
 See :ref:`groupby.aggregate.named` for more.
 
-
 .. _whatsnew_0250.enhancements.multi_index_repr:
 
 Better repr for MultiIndex
@@ -105,7 +104,6 @@ than :attr:`options.display.max_seq_items` (default: 100 items). Horizontally,
 the output will truncate, if it's wider than :attr:`options.display.width`
 (default: 80 characters).
 
-
 .. _whatsnew_0250.enhancements.other:
 
 Other Enhancements
@@ -131,6 +129,7 @@ Other Enhancements
 - Error message for missing required imports now includes the original import error's text (:issue:`23868`)
 - :class:`DatetimeIndex` and :class:`TimedeltaIndex` now have a ``mean`` method (:issue:`24757`)
 - :meth:`DataFrame.describe` now formats integer percentiles without decimal point (:issue:`26660`)
+- :func:`pandas.merge` now raises ``ValueError`` when either of ``left_on`` or ``right_on`` is not provided (:issue:`26855`)
 - Added support for reading SPSS .sav files using :func:`read_spss` (:issue:`26537`)
 - Added new option ``plotting.backend`` to be able to select a plotting backend different than the existing ``matplotlib`` one. Use ``pandas.set_option('plotting.backend', '<backend-module>')`` where ``<backend-module`` is a library implementing the pandas plotting API (:issue:`14130`)
 
@@ -538,6 +537,7 @@ Other Deprecations
 - :meth:`Timedelta.resolution` is deprecated and replaced with :meth:`Timedelta.resolution_string`.  In a future version, :meth:`Timedelta.resolution` will be changed to behave like the standard library :attr:`timedelta.resolution` (:issue:`21344`)
 - :meth:`Series.to_sparse`, :meth:`DataFrame.to_sparse`, :meth:`Series.to_dense` and :meth:`DataFrame.to_dense` are deprecated and will be removed in a future version. (:issue:`26557`).
 
+
 .. _whatsnew_0250.prior_deprecations:
 
 Removal of prior version deprecations/changes
diff --git a/pandas/core/reshape/merge.py b/pandas/core/reshape/merge.py
index d21ad58e7..decc21ed4 100644
--- a/pandas/core/reshape/merge.py
+++ b/pandas/core/reshape/merge.py
@@ -1091,6 +1091,9 @@ class _MergeOperation:
                     raise ValueError('len(left_on) must equal the number '
                                      'of levels in the index of "right"')
                 self.right_on = [None] * n
+            if not self.right_on:
+                raise ValueError('both "right_on" and "left_on" '
+                                 'should be passed')
         elif self.right_on is not None:
             n = len(self.right_on)
             if self.left_index:
@@ -1098,6 +1101,9 @@ class _MergeOperation:
                     raise ValueError('len(right_on) must equal the number '
                                      'of levels in the index of "left"')
                 self.left_on = [None] * n
+            if not self.left_on:
+                raise ValueError('both "right_on" and "left_on" '
+                                 'should be passed')
         if len(self.right_on) != len(self.left_on):
             raise ValueError("len(right_on) must equal len(left_on)")
 
diff --git a/pandas/tests/reshape/merge/test_merge.py b/pandas/tests/reshape/merge/test_merge.py
index 8eb414155..940e4d09a 100644
--- a/pandas/tests/reshape/merge/test_merge.py
+++ b/pandas/tests/reshape/merge/test_merge.py
@@ -1026,6 +1026,22 @@ class TestMerge:
         result = merge(left, right, on=['a', 'b'], validate='1:1')
         assert_frame_equal(result, expected_multi)
 
+    @pytest.mark.parametrize('merge_type', ['left_on', 'right_on'])
+    def test_missing_on_raises(self, merge_type):
+        # GH26824
+        left = DataFrame({
+            'A': [1, 2, 3, 4, 5, 6],
+            'B': ['P', 'Q', 'R', 'S', 'T', 'U']
+        })
+        right = DataFrame({
+            'A': [1, 2, 4, 5, 7, 8],
+            'C': ['L', 'M', 'N', 'O', 'P', 'Q']
+        })
+        msg = 'must equal'
+        kwargs = {merge_type: 'A'}
+        with pytest.raises(ValueError, match=msg):
+            pd.merge(left, right, how='left', **kwargs)
+
     def test_merge_two_empty_df_no_division_error(self):
         # GH17776, PR #17846
         a = pd.DataFrame({'a': [], 'b': [], 'c': []})
@@ -1763,3 +1779,20 @@ def test_merge_equal_cat_dtypes2():
 
     # Categorical is unordered, so don't check ordering.
     tm.assert_frame_equal(result, expected, check_categorical=False)
+
+
+@pytest.mark.parametrize('merge_type', ['left_on', 'right_on'])
+def test_missing_on_raises(merge_type):
+    # GH26824
+    df1 = DataFrame({
+        'A': [1, 2, 3, 4, 5, 6],
+        'B': ['P', 'Q', 'R', 'S', 'T', 'U']
+    })
+    df2 = DataFrame({
+        'A': [1, 2, 4, 5, 7, 8],
+        'C': ['L', 'M', 'N', 'O', 'P', 'Q']
+    })
+    msg = 'must equal'
+    kwargs = {merge_type: 'A'}
+    with pytest.raises(ValueError, match=msg):
+        pd.merge(df1, df2, how='left', **kwargs)

@jreback
Copy link
Contributor

jreback commented Jun 28, 2019

lgtm. @WillAyd if any other comments; merge on green.

@jreback jreback removed this from the 0.25.0 milestone Jun 28, 2019
@TomAugspurger
Copy link
Contributor

Looks like CI is failing here.

@jreback
Copy link
Contributor

jreback commented Sep 8, 2019

@harshit-py can you merge master and move the release note to 1.0

@WillAyd
Copy link
Member

WillAyd commented Sep 20, 2019

Nice PR but CI is all red and I think this has gone stale. Closing for now but @harshit-py please ping if you'd like to pick back up

@WillAyd WillAyd closed this Sep 20, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Error Reporting Incorrect or improved errors from pandas
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Left join without passing right_on raises TypeError instead of intended ValueError
6 participants