Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gh 36562 typeerror comparison not supported between float and str #37096

Merged
merged 23 commits into from
Nov 4, 2020
Merged

Gh 36562 typeerror comparison not supported between float and str #37096

merged 23 commits into from
Nov 4, 2020

Conversation

ssche
Copy link
Contributor

@ssche ssche commented Oct 13, 2020

ssche added 4 commits October 13, 2020 20:44
* Use special sorting comparator for tuple arrays which can be created when consolidate_first is called on DataFrames with MultiIndex which contain nan and string values
…r-comparison-not-supported-between-float-and-str

# Conflicts:
#	doc/source/whatsnew/v1.2.0.rst
Copy link
Member

@arw2019 arw2019 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @SSHE for working on this!

Some comments. In the code they're almost all cosmetic. In the combine_first test you want to be using pandas._testing methods

pandas/core/algorithms.py Outdated Show resolved Hide resolved
pandas/core/algorithms.py Outdated Show resolved Hide resolved
# unorderable in py3 if mixed str/int
ordered = sort_mixed(values)
elif not ext_arr and values.size and isinstance(values[0], tuple):
# 1-D arrays with tuples of potentially mixed type (solves GH36562)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would skip this comment.
In general we don't leave references to issues in the code, git/blame keeps track of that
You have a comment upstairs in sort_tuples that explains what it does so I think no need to repeat here

pandas/tests/indexing/multiindex/test_multiindex.py Outdated Show resolved Hide resolved
pandas/tests/indexing/multiindex/test_multiindex.py Outdated Show resolved Hide resolved
y = values[index_y]
if x == y:
return 0
len_x = len(x)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't explicitly assign these, just use them directly whenever they're needed
it's an O(1) lookup so this doesn't save any time but makes code a bit more verbose

return -1
if i >= len_y:
return +1
x_i_na = isna(x[i])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

x_is_na and y_is_na

x_i_na = isna(x[i])
y_i_na = isna(y[i])
# values are the same -> resolve tie with next element
if (x_i_na and y_i_na) or (x[i] == y[i]):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need (x_i_na and y_i_na)? Could one of them be null?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

equality check - both of the values are the same (both na or both the same non-na value).

# values are the same -> resolve tie with next element
if (x_i_na and y_i_na) or (x[i] == y[i]):
continue
# check for nan values (sort nan to the end which is consistent
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could do this before checking for equality?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd say no need to comment about consistency with numpy in the code (but thanks for noting it here!)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could do this before checking for equality?

sure, but why?

return 0
len_x = len(x)
len_y = len(y)
for i in range(max(len_x, len_y)):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

generally speaking I wonder if the control flow can be simplified a bit here. you have a number of ifs but only two possible outputs

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 outcomes (=,<,>), but in combination with nan, they increase.

there are 9 ifs of which 2 are not relevant (shortcut before the loop, fall-though after the loop).

7 ifs remaining:

  1. equality case (continue)
  2. greater than case
  3. smaller than case
  4. x is nan, but y is not
  5. y is nan, but x is not
  6. xvec is smaller than yvec
  7. yvec is smaller than xvec

I could merge some of the cases (with same return value), but that would compromise readability and require nan checks to be done earlier.

@ssche
Copy link
Contributor Author

ssche commented Oct 14, 2020

@arw2019 comments addressed (most of them), please review.

def cmp_func(index_x, index_y):
x = values[index_x]
y = values[index_y]
# shortcut loop in case both tuples are the same
Copy link
Contributor

@jreback jreback Oct 14, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you don't need to do any of this

see safe_sort

Copy link
Contributor Author

@ssche ssche Oct 14, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what are you saying? what isn't needed?

this PR is changing safe_sort to accommodate mixed-type tuples

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and how does this not do it now!

iirc this is adding a lot of non performing code for the purpose of fixing the error message?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and how does this not do it now!

now a TypeError will be raised in df.combine_first when the df contains a MultiIndex with nan and string values (sort_mixed, which is used now, fails in that case).

iirc this is adding a lot of non performing code for the purpose of fixing the error message?

in the event of all other (more performant) sorters failing, this (unarguably slower) sorter will be used. this should not compromise any other use case's performance, but at least makes the code in the ticket description succeed (slower, but at least not failing).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback I moved the slow sort_tuples method so it really isn't called when any of the other (faster) sorters can work (2677166#diff-c8f3ad29eaf121537b999e88e9117f3e3702d0b818a67516da25093fe2890ce8R2114). please have another look and provide feedback.

@ssche ssche requested a review from jreback October 16, 2020 06:05
doc/source/whatsnew/v1.2.0.rst Outdated Show resolved Hide resolved
@@ -2055,13 +2056,52 @@ def sort_mixed(values):
strs = np.sort(values[str_pos])
return np.concatenate([nums, np.asarray(strs, dtype=object)])

def sort_tuples(values):
# sorts tuples with mixed values. can handle nan vs string comparisons.
def cmp_func(index_x, index_y):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes let's change the name and please add typing for cmp_func and sort_tuples. (sort_mixed if you can as well :->)

@@ -2055,20 +2056,63 @@ def sort_mixed(values):
strs = np.sort(values[str_pos])
return np.concatenate([nums, np.asarray(strs, dtype=object)])

def sort_tuples(values):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you simply try to remove the nan's from values and add them to the end is fine and just fix up sort_mixed, we do this in a number of places e.g. something like

In [204]: arr = np.array(['foo', 3, np.nan], dtype=object)                                                                                                                                                             

In [205]: mask = pd.notna(arr)                                                                                                                                                                                         

In [206]: arr.take(np.arange(len(arr))[mask])                                                                                                                                                                          
Out[206]: array(['foo', 3], dtype=object)

Copy link
Contributor Author

@ssche ssche Oct 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I started off with trying to fix sort_mixed and removed nans as you suggested and would have loved to use it instead. however the structure of the array doesn't allow for that (we are talking about a 1-d array of tuples, i.e. dtype object, not an N-d array which would allow for all those vector operations to be applicable).

here's the data of values that gets passed in to sort_tuples() when running the test I added:

In[2]: 
values
Out[2]: 
array([('b', 1), ('b', 2), ('c', 3), ('a', 4), ('b', 5), (nan, 6),
       ('a', 1), ('c', 1), ('d', 1)], dtype=object)
values[0]
Out[3]: ('b', 1)
values.shape
Out[4]: (9,)

I could convert value to a N-d array (and use sort_mixed), but that solution would come with its own overhead costs...

In[5]: 
np.asarray(list(values))
Out[5]: 
array([['b', '1'],
       ['b', '2'],
       ['c', '3'],
       ['a', '4'],
       ['b', '5'],
       ['nan', '6'],
       ['a', '1'],
       ['c', '1'],
       ['d', '1']], dtype='<U3')

your call.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any feedback on the above, @jreback?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i care about code complexity here, this is adding a lot, try this

@ssche ssche requested a review from jreback October 21, 2020 22:58
@@ -2055,20 +2056,63 @@ def sort_mixed(values):
strs = np.sort(values[str_pos])
return np.concatenate([nums, np.asarray(strs, dtype=object)])

def sort_tuples(values):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i care about code complexity here, this is adding a lot, try this

@jreback jreback added Numeric Operations Arithmetic, Comparison, and Logical operations Dtype Conversions Unexpected or buggy dtype conversions MultiIndex labels Oct 26, 2020
ssche added 4 commits October 31, 2020 16:31
* Extract column arrays and use pandas' internal functions to obtain index which sorts the array of tuples
* Add function annotations to document expected argument types of `sort_tuples()`
…r-comparison-not-supported-between-float-and-str

# Conflicts:
#	pandas/tests/indexing/multiindex/test_multiindex.py
pandas/core/algorithms.py Outdated Show resolved Hide resolved
pandas/tests/indexing/multiindex/test_multiindex.py Outdated Show resolved Hide resolved
pandas/core/algorithms.py Outdated Show resolved Hide resolved
@jreback jreback added this to the 1.2 milestone Oct 31, 2020
ssche added 2 commits November 1, 2020 07:44
* factor out inner functions
* relocated test case
* simplified try/except
…r-comparison-not-supported-between-float-and-str
@ssche ssche requested a review from jreback November 1, 2020 10:40
…r-comparison-not-supported-between-float-and-str

# Conflicts:
#	doc/source/whatsnew/v1.2.0.rst
#	pandas/tests/frame/methods/test_combine_first.py
@pep8speaks
Copy link

pep8speaks commented Nov 3, 2020

Hello @ssche! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-11-03 23:54:13 UTC

@jreback jreback merged commit 83c2e65 into pandas-dev:master Nov 4, 2020
@jreback
Copy link
Contributor

jreback commented Nov 4, 2020

thanks @ssche very nice!

@ssche ssche deleted the gh-36562-typeerror-comparison-not-supported-between-float-and-str branch November 4, 2020 22:07
jreback added a commit that referenced this pull request Nov 13, 2020
… (#37655)

* Moving the file test_frame.py to a new directory

* Сreated file test_frame_color.py

* Transfer tests
of test_frame.py
to test_frame_color.py

* PEP 8 fixes

* Transfer tests

of test_frame.py
to test_frame_groupby.py and test_frame_subplots.py

* Removing unnecessary imports

* PEP 8 fixes

* Fixed class name

* Transfer tests

of test_frame.py
to test_frame_subplots.py

* Transfer tests

of test_frame.py
to test_frame_groupby.py, test_frame_subplots.py, test_frame_color.py

* Changed class names

* Removed unnecessary imports

* Removed import

* catch FutureWarnings (#37587)

* TST/REF: collect indexing tests by method (#37590)

* REF: prelims for single-path setitem_with_indexer (#37588)

* ENH: __repr__ for 2D DTA/TDA (#37164)

* CLN: de-duplicate _validate_where_value with _validate_setitem_value (#37595)

* TST/REF: collect tests by method (#37589)

* TST/REF: move remaining setitem tests from test_timeseries

* TST/REF: rehome test_timezones test

* move misplaced arithmetic test

* collect tests by method

* move misplaced file

* REF: Categorical.is_dtype_equal -> categories_match_up_to_permutation (#37545)

* CLN refactor non-core (#37580)

* refactor core/computation (#37585)

* TST/REF: share method tests between DataFrame and Series (#37596)

* BUG: Index.where casting ints to str (#37591)

* REF: IntervalArray comparisons (#37124)

* regression fix for merging DF with datetime index with empty DF (#36897)

* ERR: fix error message in Period for invalid frequency (#37602)

* CLN: remove rebox_native (#37608)

* TST/REF: tests.generic (#37618)

* TST: collect tests by method (#37617)

* TST/REF: collect test_timeseries tests by method

* misplaced DataFrame.values tst

* misplaced dataframe.values test

* collect test by method

* TST/REF: share tests across Series/DataFrame (#37616)

* Gh 36562 typeerror comparison not supported between float and str (#37096)

* docs: fix punctuation (#37612)

* REGR: pd.to_hdf(..., dropna=True) not dropping missing rows (#37564)

* parametrize set_axis tests (#37619)

* CLN: clean color selection in _matplotlib/style (#37203)

* DEPR: DataFrame/Series.slice_shift (#37601)

* REF: re-use validate_setitem_value in Categorical.fillna (#37597)

* PERF: release gil for ewma_time (#37389)

* BUG: Groupy dropped nan groups from result when grouping over single column (#36842)

* ENH: implement timeszones support for read_json(orient='table') and astype() from 'object' (#35973)

* REF/BUG/TYP: read_csv shouldn't close user-provided file handles (#36997)

* BUG/REF: read_csv shouldn't close user-provided file handles

* get_handle: typing, returns is_wrapped, use dataclass, and make sure that all created handlers are returned

* remove unused imports

* added IOHandleArgs.close

* added IOArgs.close

* mostly comments

* move memory_map from TextReader to CParserWrapper

* moved IOArgs and IOHandles

* more comments

Co-authored-by: Jeff Reback <jeff@reback.net>

* more typing checks to pre-commit (#37539)

* TST: 32bit dtype compat test_groupby_dropna (#37623)

* BUG: Metadata propagation for groupby iterator (#37461)

* BUG: read-only values in cython funcs (#37613)

* CLN refactor core/arrays (#37581)

* Fixed Metadata Propogation in DataFrame (#37381)

* TYP: add Shape alias to pandas._typing (#37128)

* DOC: Fix typo (#37630)

* CLN: parametrize test_nat_comparisons (#37195)

* dataframe dataclass docstring updated (#37632)

* refactor core/groupby (#37583)

* BUG: set index of DataFrame.apply(f) when f returns dict (#37544) (#37606)

* BUG: to_dict should return a native datetime object for NumPy backed dataframes (#37571)

* ENH: memory_map for compressed files (#37621)

* DOC: add example & prose of slicing with labels when index has duplicate labels  (#36814)

* DOC: add example & prose of slicing with labels when index has duplicate labels #36251

* DOC: proofread the sentence.

Co-authored-by: Jun Kudo <jun-lab@junnoMacBook-Pro.local>

* DOC: Fix typo (#37636)

"columns(s)" sounded odd, I believe it was supposed to be just "column(s)".

* CI: troubleshoot win py38 builds (#37652)

* TST/REF: collect indexing tests by method (#37638)

* TST/REF: collect tests for get_numeric_data (#37634)

* misplaced loc test

* TST/REF: collect get_numeric_data tests

* REF: de-duplicate _validate_insert_value with _validate_scalar (#37640)

* CI: catch windows py38 OSError (#37659)

* share test (#37679)

* TST: match matplotlib warning message (#37666)

* TST: match matplotlib warning message

* TST: match full message

* pd.Series.loc.__getitem__ promotes to float64 instead of raising KeyError (#37687)

* REF/TST: misplaced Categorical tests (#37678)

* REF/TST: collect indexing tests by method (#37677)

* CLN: only call _wrap_results one place in nanmedian (#37673)

* TYP: Index._concat (#37671)

* BUG: CategoricalIndex.equals casting non-categories to np.nan (#37667)

* CLN: _replace_single (#37683)

* REF: simplify _replace_single by noting regex kwarg is bool

* Annotate

* CLN: remove never-False convert kwarg

* TYP: make more internal funcs keyword-only (#37688)

* REF: make Series._replace_single a regular method (#37691)

* REF: simplify cycling through colors (#37664)

* REF: implement _wrap_reduction_result (#37660)

* BUG: preserve fold in Timestamp.replace (#37644)

* CLN: Clean indexing tests (#37689)

* TST: fix warning for pie chart (#37669)

* PERF: reverted change from commit 7d257c6 to solve issue #37081 (#37426)

* DataFrameGroupby.boxplot fails when subplots=False (#28102)

* ENH: Improve error reporting for wrong merge cols (#37547)

* Transfer tests
of test_frame.py
to test_frame_color.py

* PEP8

* Fixes for linter

* Сhange pd.DateFrame to DateFrame

* Move inconsistent namespace check to pre-commit, fixup more files (#37662)

* check for inconsistent namespace usage

* doc

* typos

* verbose regex

* use verbose flag

* use verbose flag

* match both directions

* add test

* don't import annotations from future

* update extra couple of cases

* 🚚 rename

* typing

* don't use subprocess

* don't type tests

* use pathlib

* REF: simplify NDFrame.replace, ObjectBlock.replace (#37704)

* REF: implement Categorical.encode_with_my_categories (#37650)

* REF: implement Categorical.encode_with_my_categories

* privatize

* BUG: unpickling modifies Block.ndim (#37657)

* REF: dont support dt64tz in nanmean (#37658)

* CLN: Simplify groupby head/tail tests (#37702)

* Bug in loc raised for numeric label even when label is in Index (#37675)

* REF: implement replace_regex, remove unreachable branch in ObjectBlock.replace (#37696)

* TYP: Check untyped defs (except vendored) (#37556)

* REF: remove ObjectBlock._replace_single (#37710)

* Transfer tests
of test_frame.py
to test_frame_color.py

* TST/REF: collect indexing tests by method (#37590)

* PEP8

* Сhange DateFrame to pd.DateFrame

* Сhange pd.DateFrame to DateFrame

* Removing imports

* Bug fixes

* Bug fixes

* Fix incorrect merge

* test_frame_color.py edit

* Transfer tests

of test_frame.py
to test_frame_color.py, test_frame_groupby.py and test_frame_subplots.py

* Removing unnecessary imports

* PEP8

* # Conflicts:
#	pandas/tests/plotting/frame/test_frame.py
#	pandas/tests/plotting/frame/test_frame_color.py
#	pandas/tests/plotting/frame/test_frame_subplots.py

* Moving the file test_frame.py to a new directory

* Transfer tests

of test_frame.py
to test_frame_color.py, test_frame_groupby.py and test_frame_subplots.py

* Removing unnecessary imports

* PEP8

* CLN: clean categorical indexes tests (#37721)

* Fix merge error

* PEP 8 fixes

* Fix merge error

* Removing unnecessary imports

* PEP 8 fixes

* Fixed class name

* Transfer tests

of test_frame.py
to test_frame_subplots.py

* Transfer tests

of test_frame.py
to test_frame_groupby.py, test_frame_subplots.py, test_frame_color.py

* Changed class names

* Removed unnecessary imports

* Removed import

* TST/REF: collect indexing tests by method (#37590)

* TST: match matplotlib warning message (#37666)

* TST: match matplotlib warning message

* TST: match full message

* TST: fix warning for pie chart (#37669)

* Transfer tests
of test_frame.py
to test_frame_color.py

* PEP8

* Fixes for linter

* Сhange pd.DateFrame to DateFrame

* Transfer tests
of test_frame.py
to test_frame_color.py

* PEP8

* Сhange DateFrame to pd.DateFrame

* Сhange pd.DateFrame to DateFrame

* Removing imports

* Bug fixes

* Bug fixes

* Fix incorrect merge

* test_frame_color.py edit

* Fix merge error

* Fix merge error

* Removing unnecessary features

* Resolving Commit Conflicts daf999f 365d843

* black fix

Co-authored-by: jbrockmendel <jbrockmendel@gmail.com>
Co-authored-by: Marco Gorelli <m.e.gorelli@gmail.com>
Co-authored-by: Philip Cerles <philip.cerles@gmail.com>
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Co-authored-by: Sven <sven.schellenberg@paradynsystems.com>
Co-authored-by: Micael Jarniac <micael@jarniac.com>
Co-authored-by: Andrew Wieteska <48889395+arw2019@users.noreply.github.com>
Co-authored-by: Maxim Ivanov <41443370+ivanovmg@users.noreply.github.com>
Co-authored-by: Erfan Nariman <34067903+erfannariman@users.noreply.github.com>
Co-authored-by: Fangchen Li <fangchen.li@outlook.com>
Co-authored-by: patrick <61934744+phofl@users.noreply.github.com>
Co-authored-by: attack68 <24256554+attack68@users.noreply.github.com>
Co-authored-by: Torsten Wörtwein <twoertwein@users.noreply.github.com>
Co-authored-by: Jeff Reback <jeff@reback.net>
Co-authored-by: Janus <janus@insignificancegalore.net>
Co-authored-by: Joel Whittier <rootbeerfriend@gmail.com>
Co-authored-by: taytzehao <jtth95@gmail.com>
Co-authored-by: ma3da <34522496+ma3da@users.noreply.github.com>
Co-authored-by: junk <juntrp0207@gmail.com>
Co-authored-by: Jun Kudo <jun-lab@junnoMacBook-Pro.local>
Co-authored-by: Alex Kirko <alexander.kirko@gmail.com>
Co-authored-by: Yassir Karroum <ukarroum17@gmail.com>
Co-authored-by: Kaiqi Dong <kaiqi@kth.se>
Co-authored-by: Richard Shadrach <45562402+rhshadrach@users.noreply.github.com>
Co-authored-by: Simon Hawkins <simonjayhawkins@gmail.com>
nans (can't use `np.sort` as it may fail when str and nan are mixed in a
column as types cannot be compared).
"""
from pandas.core.internals.construction import to_arrays
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ssche two completely contradictory questions/requests

  1. is there a reasonable way to do this without relying on core.internals?
  2. could pd.MultiIndex.from_tuples de-duplicate some code by using to_arrays?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a reasonable way to do this without relying on core.internals?

My first attempt (1320ff1) wasn't using core.internals, as far as I recall, but it was deemed too complex.

could pd.MultiIndex.from_tuples de-duplicate some code by using to_arrays?

Maybe, what are you trying to achieve?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe, what are you trying to achieve?

Just simplification/de-duplication

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if this is still relevant (I forgot to respond), but I would argue that from_arrays should be preferred (and from_tuples delegating to from_arrays) as from_arrays could store type information better as it's accepting data in columnar format (instead of row-wise which from_arrays does).

Anyway, I hope you could proceed with your much appreciated simplification/de-dupe efforts...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Dtype Conversions Unexpected or buggy dtype conversions MultiIndex Numeric Operations Arithmetic, Comparison, and Logical operations
Projects
None yet
6 participants