Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CLN: Index.append() refactoring #16236

Merged
merged 5 commits into from
Aug 22, 2017
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 5 additions & 1 deletion pandas/core/dtypes/common.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@
from .generic import (ABCCategorical, ABCPeriodIndex,
ABCDatetimeIndex, ABCSeries,
ABCSparseArray, ABCSparseSeries, ABCCategoricalIndex,
ABCIndexClass)
ABCIndexClass, ABCRangeIndex)
from .inference import is_string_like
from .inference import * # noqa

Expand Down Expand Up @@ -220,6 +220,10 @@ def is_categorical(arr):
return isinstance(arr, ABCCategorical) or is_categorical_dtype(arr)


def is_range(arr):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not similar to the other methods, which detect a type. a RangeIndex is not a type.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is only called in get_dtype_kinds, see below, so I could have just used isinstance(arr, ABCRangeIndex). But I tried to be coherent with is_categorical and is_sparse... what exactly is a "type"?!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are conflating a type with an Index. is_categorical will detect a categorical type which could be a Categorical (or the dtype=='category'), a CategoricalIndex happens to be of this type as well.

However RangeIndex is simply an Index subclassing Int64Index. its not a type (its dtype is int64). types can be the dtype of an Index.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In [3]: pd.Series([1,2,3]).to_sparse().dtype
Out[3]: dtype('int64')

... still,

In [4]: _concat.get_dtype_kinds([pd.Series([1,2,3]).to_sparse()])
Out[4]: {'sparse'}

... so the type returned by get_dtype_kinds is already not the dtype, and not even the dtype.kind.

But anyway, I don't particularly care about changing get_dtype_kinds. We just need a method which can tell us whether two indexes can be natively concatenated: this is currently get_dtype_kinds, but I can write a new one if you prefer.

return isinstance(arr, ABCRangeIndex)


def is_datetimetz(arr):
"""
Check whether an array-like is a datetime array-like with a timezone
Expand Down
38 changes: 38 additions & 0 deletions pandas/core/dtypes/concat.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@
is_object_dtype,
is_bool_dtype,
is_dtype_equal,
is_range,
_NS_DTYPE,
_TD_DTYPE)
from pandas.core.dtypes.generic import (
Expand Down Expand Up @@ -45,6 +46,8 @@ def get_dtype_kinds(l):
# if to_concat contains different tz,
# the result must be object dtype
typ = str(arr.dtype)
elif is_range(arr):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is very strange. why are you adding this here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are directly calling the routine (and you don't handle typ='range' anywhere), so not sure this is even hit

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is required so that Index._concat distinguishes between RangeIndex and Int64Index - as they should not be treated as equal (e.g. when appending). The range is indeed an arbitrary label.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok I see, then you should be explicit about testing what you need, e.g. ABCInt64Inde,
(rather than adding a helper function which serves no other purpose)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Index._concat has no idea of the specific types of indexes, and rightly so... it uses get_dtype_kinds just to test whether different types of indexes are being concatenated.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can simply check isinstance(arr, ABCRangeIndex), you are special casing this so I don't find this a problem. We don't have a special 'type' for this index so is_int64_dtype would not work here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with using isinstance(arr, ABCRangeIndex) in get_dtype_kinds... but I'm not special casing Index._concat: I'm only overwriting RangeIndex._append_same_dtype. So I need get_dtype_kinds to (also) distinguish RangeIndex. Isn't the point of get_dtype_kinds precisely to distinguish stuff which can be concatenated "natively" together?!

Copy link
Member Author

@toobaz toobaz Aug 20, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't have a special 'type' for this index so is_int64_dtype would not work here.

I had missed this, and hence maybe our misunderstanding: pd.RangeIndex(3).dtype is dtype('int64')... which makes sense, but is not what we want _concat.get_dtype_kinds to consider.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, if it is for a single call, I would just do the isinstance(arr, ABCRangeIndex) here instead of defining the new function

typ = 'range'
elif is_datetime64_dtype(dtype):
typ = 'datetime'
elif is_timedelta64_dtype(dtype):
Expand Down Expand Up @@ -559,3 +562,38 @@ def convert_sparse(x, axis):
# coerce to object if needed
result = result.astype('object')
return result


def _concat_indexes_same_dtype_rangeindex(indexes):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe _concat_rangeindex_same_dtype ? (little bit shorter and also clear I think)


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a comment about what this is doing / guarantees and example would be nice as well.

start = step = next = None

for obj in indexes:
if not len(obj):
continue

if start is None:
# This is set by the first non-empty index
start = obj._start
if step is None and len(obj) > 1:
step = obj._step
elif step is None:
# First non-empty index had only one element
if obj._start == start:
return _concat_index_asobject(indexes)
step = obj._start - start

non_consecutive = ((step != obj._step and len(obj) > 1) or
(next is not None and obj._start != next))
if non_consecutive:
# Not nice... but currently what happens in NumericIndex:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this comment needed?

Copy link
Member Author

@toobaz toobaz Aug 21, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see it as a reminder: I would have liked to use

return Int64Index._append_same_dtype([ix.astype(int) for ix in indexes])

... but then, numeric indexes currently do not special-case _append_same_dtype, so we end up calling _concat_index_asobject anyway.

But I can remove it, and just take a note of this TODO.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, OK to keep it then, but maybe make it a bit more informative (or remove the 'not nice ', just say that it is what is used by Int64Index._append_same_dtype)

(I also don't think it would give that much of gain for making a special cased one of integers)

return _concat_index_asobject(indexes)

if step is not None:
next = obj[-1] + step

if start is None:
start = obj._start
step = obj._step
stop = obj._stop if next is None else next
return indexes[0].__class__(start, stop, step)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is ugly... but the only alternative I see (an import inside the function) is uglier.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

its ok here

Copy link
Member

@jorisvandenbossche jorisvandenbossche Aug 21, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One alternative would be is to return the (start, stop, step) and do the construction in RangeIndex._append_same_dtype (then also the rename there is not needed)

Ah no, this wouldn't play nice with the case when no range index is returned, ignore this

7 changes: 3 additions & 4 deletions pandas/core/indexes/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -1741,10 +1741,9 @@ def append(self, other):
names = set([obj.name for obj in to_concat])
name = None if len(names) > 1 else self.name

if self.is_categorical():
# if calling index is category, don't check dtype of others
from pandas.core.indexes.category import CategoricalIndex
return CategoricalIndex._append_same_dtype(self, to_concat, name)
return self._concat(to_concat, name)

def _concat(self, to_concat, name):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you call this _append ? (then it is more in line with _append_same_dtype)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I think it would make more sense to change _append_same_dtype to _concat_same_dtype (also in IntervalIndex, DatetimeIndex, CategoryIndex), since it already disregards self (it is conceptually a @classmethod). Shall I proceed?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since it is only use by append, I prefer using append in the name, but no strong feelings

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right that it's currently used only by append, but usually you expect x.append(y) to concatenate x to y or to elements of y; instead this only concatenates elements of y. So since you don't object I will go with my proposal.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead this only concatenates elements of y

in the end it is used to concatenate both y to x, just that this is passed like that in append to this helper function. So it is still only used for append.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So it is still only used for append.

Sure, I don't object to that. We can agree it is a concat operation used to implement appending: the switch happens when append(self, other) does to_concat = [self] + list(other).


typs = _concat.get_dtype_kinds(to_concat)

Expand Down
4 changes: 4 additions & 0 deletions pandas/core/indexes/category.py
Original file line number Diff line number Diff line change
Expand Up @@ -633,6 +633,10 @@ def insert(self, loc, item):
codes = np.concatenate((codes[:loc], code, codes[loc:]))
return self._create_from_codes(codes)

def _concat(self, to_concat, name):
# if calling index is category, don't check dtype of others
return CategoricalIndex._append_same_dtype(self, to_concat, name)

def _append_same_dtype(self, to_concat, name):
"""
Concatenate to_concat which has the same class
Expand Down
60 changes: 4 additions & 56 deletions pandas/core/indexes/range.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@
from pandas.compat.numpy import function as nv
from pandas.core.indexes.base import Index, _index_shared_docs
from pandas.util._decorators import Appender, cache_readonly
import pandas.core.dtypes.concat as _concat
import pandas.core.indexes.base as ibase

from pandas.core.indexes.numeric import Int64Index
Expand Down Expand Up @@ -443,62 +444,9 @@ def join(self, other, how='left', level=None, return_indexers=False,
return super(RangeIndex, self).join(other, how, level, return_indexers,
sort)

def append(self, other):
"""
Append a collection of Index options together

Parameters
----------
other : Index or list/tuple of indices

Returns
-------
appended : RangeIndex if all indexes are consecutive RangeIndexes,
otherwise Int64Index or Index
"""

to_concat = [self]

if isinstance(other, (list, tuple)):
to_concat = to_concat + list(other)
else:
to_concat.append(other)

if not all([isinstance(i, RangeIndex) for i in to_concat]):
return super(RangeIndex, self).append(other)

start = step = next = None

for obj in to_concat:
if not len(obj):
continue

if start is None:
# This is set by the first non-empty index
start = obj._start
if step is None and len(obj) > 1:
step = obj._step
elif step is None:
# First non-empty index had only one element
if obj._start == start:
return super(RangeIndex, self).append(other)
step = obj._start - start

non_consecutive = ((step != obj._step and len(obj) > 1) or
(next is not None and obj._start != next))
if non_consecutive:
return super(RangeIndex, self).append(other)

if step is not None:
next = obj[-1] + step

if start is None:
start = obj._start
step = obj._step
stop = obj._stop if next is None else next
names = set([obj.name for obj in to_concat])
name = None if len(names) > 1 else self.name
return RangeIndex(start, stop, step, name=name)
def _append_same_dtype(self, indexes, name):
return _concat._concat_indexes_same_dtype_rangeindex(indexes
).rename(name)

def __len__(self):
"""
Expand Down