-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG:Sanity check on merge parameters for correct exception #26824 #26855
Conversation
Codecov Report
@@ Coverage Diff @@
## master #26855 +/- ##
==========================================
- Coverage 91.88% 91.87% -0.02%
==========================================
Files 179 179
Lines 50696 50700 +4
==========================================
- Hits 46581 46579 -2
- Misses 4115 4121 +6
Continue to review full report at Codecov.
|
Codecov Report
@@ Coverage Diff @@
## master #26855 +/- ##
=========================================
Coverage ? 41.09%
=========================================
Files ? 180
Lines ? 50747
Branches ? 0
=========================================
Hits ? 20856
Misses ? 29891
Partials ? 0
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a test for this? Typically the first thing we ask for as part of any PR
Sure, just to confirm this test would go in pandas/tests/reshape/merge/test_merge.py? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good! Can you also add a whatsnew note for v0.25 describing that this now raises a ValueError when both arguments are not provided?
|
||
|
||
@pytest.mark.parametrize('merge_type', ['left_on', 'right_on']) | ||
def test_merge_correct_exception(merge_type): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's just call this test_missing_on_raises
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure
'C': ['L', 'M', 'N', 'O', 'P', 'Q'] | ||
}) | ||
msg = 'both left_on and right_on should be passed' | ||
if merge_type == 'left_on': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than the if condition since you are parmatrizing this argument (which is nice btw) you can send it through as kwargs. So:
kwargs = {merge_type: 'A'}
with pytest.raises(ValueError, match=msg):
pd.merge(df1, df2, how='left', **kwargs)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, will do
@simonjayhawkins if you care to take a look |
@@ -1089,13 +1089,19 @@ def _validate_specification(self): | |||
raise ValueError('len(left_on) must equal the number ' | |||
'of levels in the index of "right"') | |||
self.right_on = [None] * n | |||
if not self.right_on: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we instead try to set
self.left_on = self.left_on or [None] * len( self.right_on)
and same for right_on
before these if clauses; that way don’t need to add additional checks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure
pandas/core/reshape/merge.py
Outdated
elif self.right_on is not None: | ||
n = len(self.right_on) | ||
if self.left_index: | ||
if len(self.right_on) != self.left.index.nlevels: | ||
raise ValueError('len(right_on) must equal the number ' | ||
'of levels in the index of "left"') | ||
self.left_on = [None] * n | ||
self.left_on = self.left_on or [None] * len(self.right_on) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you don’t need the lines above (not both cases)
also can use n
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure I get it. The previous line needs to go?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
try it - i think it’s unnecessary
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
u can also combine the condition in 1095 and 1096 i think as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
left_index
and right_index
are False
by default (in the _MergeOperation instantiation), so both left_on and right_on would need to be defined out of the if clause at 1095 and 1087 (correct me if I'm wrong)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok see if u can simplify things with passing tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
separately, can you create an issue (PR would also be fantastic), do factor out the validation / error checking tests on parameters out of test_merge.py to test_merge_validation.py
pandas/core/reshape/merge.py
Outdated
@@ -1089,13 +1089,15 @@ def _validate_specification(self): | |||
raise ValueError('len(left_on) must equal the number ' | |||
'of levels in the index of "right"') | |||
self.right_on = [None] * n | |||
self.right_on = self.right_on or [None] * (n + 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do you need to add 1 this seems odd
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so, the thinking here was to not add additional check and force the cases where either of left_on or right_on is absent to the if clause at 1099.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right
but i think this is actually changing things slightly
maybe let’s go back to your original patch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but i think this is actually changing things slightly
additional message checks maybe should be added to the following tests as a precursor. xref #23922
pandas/tests/reshape/merge/test_merge_asof.py: with pytest.raises(MergeError):
pandas/tests/reshape/merge/test_merge_asof.py: with pytest.raises(MergeError):
pandas/tests/reshape/merge/test_merge_asof.py: with pytest.raises(MergeError):
pandas/tests/reshape/merge/test_merge_asof.py: with pytest.raises(MergeError):
pandas/tests/reshape/merge/test_merge_asof.py: with pytest.raises(MergeError):
pandas/tests/reshape/merge/test_merge_asof.py: with pytest.raises(MergeError):
pandas/tests/reshape/merge/test_merge_asof.py: with pytest.raises(MergeError):
pandas/tests/reshape/merge/test_merge_asof.py: with pytest.raises(MergeError):
pandas/tests/reshape/merge/test_merge_asof.py: with pytest.raises(MergeError):
pandas/tests/reshape/merge/test_merge_asof.py: with pytest.raises(MergeError):
pandas/tests/reshape/merge/test_merge_asof.py: with pytest.raises(MergeError):
pandas/tests/reshape/merge/test_merge_asof.py: with pytest.raises(MergeError):
pandas/tests/reshape/merge/test_merge_asof.py: with pytest.raises(MergeError):
pandas/tests/reshape/merge/test_merge_asof.py: with pytest.raises(ValueError):
pandas/tests/reshape/merge/test_merge_asof.py: with pytest.raises(ValueError):
pandas/tests/reshape/merge/test_multi.py: with pytest.raises(ValueError):
pandas/tests/reshape/merge/test_multi.py: with pytest.raises(ValueError):
also the following tests perhaps should be moved to pandas/tests/reshape/merge/test_merge.py ( could potentially add additional parametrisation to these tests rather than creating new)
pandas/pandas/tests/reshape/merge/test_join.py
Lines 202 to 247 in a738223
def test_join_on_fails_with_different_right_index(self): | |
df = DataFrame({'a': np.random.choice(['m', 'f'], size=3), | |
'b': np.random.randn(3)}) | |
df2 = DataFrame({'a': np.random.choice(['m', 'f'], size=10), | |
'b': np.random.randn(10)}, | |
index=tm.makeCustomIndex(10, 2)) | |
msg = (r'len\(left_on\) must equal the number of levels in the index' | |
' of "right"') | |
with pytest.raises(ValueError, match=msg): | |
merge(df, df2, left_on='a', right_index=True) | |
def test_join_on_fails_with_different_left_index(self): | |
df = DataFrame({'a': np.random.choice(['m', 'f'], size=3), | |
'b': np.random.randn(3)}, | |
index=tm.makeCustomIndex(3, 2)) | |
df2 = DataFrame({'a': np.random.choice(['m', 'f'], size=10), | |
'b': np.random.randn(10)}) | |
msg = (r'len\(right_on\) must equal the number of levels in the index' | |
' of "left"') | |
with pytest.raises(ValueError, match=msg): | |
merge(df, df2, right_on='b', left_index=True) | |
def test_join_on_fails_with_different_column_counts(self): | |
df = DataFrame({'a': np.random.choice(['m', 'f'], size=3), | |
'b': np.random.randn(3)}) | |
df2 = DataFrame({'a': np.random.choice(['m', 'f'], size=10), | |
'b': np.random.randn(10)}, | |
index=tm.makeCustomIndex(10, 2)) | |
msg = r"len\(right_on\) must equal len\(left_on\)" | |
with pytest.raises(ValueError, match=msg): | |
merge(df, df2, right_on='a', left_on=['a', 'b']) | |
@pytest.mark.parametrize("wrong_type", [2, 'str', None, np.array([0, 1])]) | |
def test_join_on_fails_with_wrong_object_type(self, wrong_type): | |
# GH12081 - original issue | |
# GH21220 - merging of Series and DataFrame is now allowed | |
# Edited test to remove the Series object from test parameters | |
df = DataFrame({'a': [1, 1]}) | |
msg = ("Can only merge Series or DataFrame objects, a {} was passed" | |
.format(str(type(wrong_type)))) | |
with pytest.raises(TypeError, match=msg): | |
merge(wrong_type, df, left_on='a', right_on='a') | |
with pytest.raises(TypeError, match=msg): | |
merge(df, wrong_type, left_on='a', right_on='a') |
@pytest.mark.parametrize('merge_type', ['left_on', 'right_on']) | ||
def test_missing_on_raises(merge_type): | ||
# GH26824 | ||
df1 = DataFrame({ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
call these left & right
@@ -1763,3 +1763,20 @@ def test_merge_equal_cat_dtypes2(): | |||
|
|||
# Categorical is unordered, so don't check ordering. | |||
tm.assert_frame_equal(result, expected, check_categorical=False) | |||
|
|||
|
|||
@pytest.mark.parametrize('merge_type', ['left_on', 'right_on']) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
move this near the test_validation test
I think the An exception with the pd.merge(left, right, left_on=['key1'], right_on=['key1', 'key2']) # ValueError: len(right_on) must equal len(left_on) In cases such as this issue where neither pd.merge(left, right, left_on=['key1']) # MergeError: Must pass right_on or right_index=True So it could be that, this exception is not being raised, since in #26824, as well as not passing either If we add the right_on to the issue:
we get the more informative So probably this should also be the exception expected in this scenarios even when neither |
@@ -99,6 +99,7 @@ Other Enhancements | |||
- Error message for missing required imports now includes the original import error's text (:issue:`23868`) | |||
- :class:`DatetimeIndex` and :class:`TimedeltaIndex` now have a ``mean`` method (:issue:`24757`) | |||
- :meth:`DataFrame.describe` now formats integer percentiles without decimal point (:issue:`26660`) | |||
- :func:`pandas.merge` now raises ``ValueError`` when either of ``left_on`` or ``right_on`` is not provided (:issue:`26855`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you elaborate slightly; this only matters if on
itself is not provided. Do we have a check if on
is provide and either of left_on
or right_on
is not None
?
This reverts commit 78f7f117aaaae5db03f2bf62f32f884412109899.
This reverts commit 0cb8843.
Hello @harshit-py! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found: There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻 Comment last updated at 2019-06-28 14:17:00 UTC |
@jreback Hey I think I messed this up while trying to resolve conflicts in my PR. Can I open a separate PR for this? |
@harshit-py I made harshit-py#1. I think if you merge that, it'll merge cleanly against master. diff --git a/doc/source/whatsnew/v0.25.0.rst b/doc/source/whatsnew/v0.25.0.rst
index a58cdc8c9..01aa9ad97 100644
--- a/doc/source/whatsnew/v0.25.0.rst
+++ b/doc/source/whatsnew/v0.25.0.rst
@@ -74,7 +74,6 @@ a dict to a Series groupby aggregation (:ref:`whatsnew_0200.api_breaking.depreca
See :ref:`groupby.aggregate.named` for more.
-
.. _whatsnew_0250.enhancements.multi_index_repr:
Better repr for MultiIndex
@@ -105,7 +104,6 @@ than :attr:`options.display.max_seq_items` (default: 100 items). Horizontally,
the output will truncate, if it's wider than :attr:`options.display.width`
(default: 80 characters).
-
.. _whatsnew_0250.enhancements.other:
Other Enhancements
@@ -131,6 +129,7 @@ Other Enhancements
- Error message for missing required imports now includes the original import error's text (:issue:`23868`)
- :class:`DatetimeIndex` and :class:`TimedeltaIndex` now have a ``mean`` method (:issue:`24757`)
- :meth:`DataFrame.describe` now formats integer percentiles without decimal point (:issue:`26660`)
+- :func:`pandas.merge` now raises ``ValueError`` when either of ``left_on`` or ``right_on`` is not provided (:issue:`26855`)
- Added support for reading SPSS .sav files using :func:`read_spss` (:issue:`26537`)
- Added new option ``plotting.backend`` to be able to select a plotting backend different than the existing ``matplotlib`` one. Use ``pandas.set_option('plotting.backend', '<backend-module>')`` where ``<backend-module`` is a library implementing the pandas plotting API (:issue:`14130`)
@@ -538,6 +537,7 @@ Other Deprecations
- :meth:`Timedelta.resolution` is deprecated and replaced with :meth:`Timedelta.resolution_string`. In a future version, :meth:`Timedelta.resolution` will be changed to behave like the standard library :attr:`timedelta.resolution` (:issue:`21344`)
- :meth:`Series.to_sparse`, :meth:`DataFrame.to_sparse`, :meth:`Series.to_dense` and :meth:`DataFrame.to_dense` are deprecated and will be removed in a future version. (:issue:`26557`).
+
.. _whatsnew_0250.prior_deprecations:
Removal of prior version deprecations/changes
diff --git a/pandas/core/reshape/merge.py b/pandas/core/reshape/merge.py
index d21ad58e7..decc21ed4 100644
--- a/pandas/core/reshape/merge.py
+++ b/pandas/core/reshape/merge.py
@@ -1091,6 +1091,9 @@ class _MergeOperation:
raise ValueError('len(left_on) must equal the number '
'of levels in the index of "right"')
self.right_on = [None] * n
+ if not self.right_on:
+ raise ValueError('both "right_on" and "left_on" '
+ 'should be passed')
elif self.right_on is not None:
n = len(self.right_on)
if self.left_index:
@@ -1098,6 +1101,9 @@ class _MergeOperation:
raise ValueError('len(right_on) must equal the number '
'of levels in the index of "left"')
self.left_on = [None] * n
+ if not self.left_on:
+ raise ValueError('both "right_on" and "left_on" '
+ 'should be passed')
if len(self.right_on) != len(self.left_on):
raise ValueError("len(right_on) must equal len(left_on)")
diff --git a/pandas/tests/reshape/merge/test_merge.py b/pandas/tests/reshape/merge/test_merge.py
index 8eb414155..940e4d09a 100644
--- a/pandas/tests/reshape/merge/test_merge.py
+++ b/pandas/tests/reshape/merge/test_merge.py
@@ -1026,6 +1026,22 @@ class TestMerge:
result = merge(left, right, on=['a', 'b'], validate='1:1')
assert_frame_equal(result, expected_multi)
+ @pytest.mark.parametrize('merge_type', ['left_on', 'right_on'])
+ def test_missing_on_raises(self, merge_type):
+ # GH26824
+ left = DataFrame({
+ 'A': [1, 2, 3, 4, 5, 6],
+ 'B': ['P', 'Q', 'R', 'S', 'T', 'U']
+ })
+ right = DataFrame({
+ 'A': [1, 2, 4, 5, 7, 8],
+ 'C': ['L', 'M', 'N', 'O', 'P', 'Q']
+ })
+ msg = 'must equal'
+ kwargs = {merge_type: 'A'}
+ with pytest.raises(ValueError, match=msg):
+ pd.merge(left, right, how='left', **kwargs)
+
def test_merge_two_empty_df_no_division_error(self):
# GH17776, PR #17846
a = pd.DataFrame({'a': [], 'b': [], 'c': []})
@@ -1763,3 +1779,20 @@ def test_merge_equal_cat_dtypes2():
# Categorical is unordered, so don't check ordering.
tm.assert_frame_equal(result, expected, check_categorical=False)
+
+
+@pytest.mark.parametrize('merge_type', ['left_on', 'right_on'])
+def test_missing_on_raises(merge_type):
+ # GH26824
+ df1 = DataFrame({
+ 'A': [1, 2, 3, 4, 5, 6],
+ 'B': ['P', 'Q', 'R', 'S', 'T', 'U']
+ })
+ df2 = DataFrame({
+ 'A': [1, 2, 4, 5, 7, 8],
+ 'C': ['L', 'M', 'N', 'O', 'P', 'Q']
+ })
+ msg = 'must equal'
+ kwargs = {merge_type: 'A'}
+ with pytest.raises(ValueError, match=msg):
+ pd.merge(df1, df2, how='left', **kwargs) |
lgtm. @WillAyd if any other comments; merge on green. |
Looks like CI is failing here. |
@harshit-py can you merge master and move the release note to 1.0 |
Nice PR but CI is all red and I think this has gone stale. Closing for now but @harshit-py please ping if you'd like to pick back up |
right_on
raises TypeError instead of intended ValueError #26824git diff upstream/master -u -- "*.py" | flake8 --diff