Skip to content

Commit

Permalink
ENH: Intervalindex
Browse files Browse the repository at this point in the history
closes #7640
closes #8625

reprise of #8707

Author: Jeff Reback <jeff@reback.net>
Author: Stephan Hoyer <shoyer@climate.com>

Closes #15309 from jreback/intervalindex and squashes the following commits:

11ab1e1 [Jeff Reback] merge conflicts
834df76 [Jeff Reback] more docs
fbc1cf8 [Jeff Reback] doc example and bug
7577335 [Jeff Reback] fixup on merge of changes in algorithms.py
3a3e02e [Jeff Reback] sorting example
4333937 [Jeff Reback] api-types test fixing
f0e3ad2 [Jeff Reback] pep
b2d26eb [Jeff Reback] more docs
e5f8082 [Jeff Reback] allow pd.cut to take an IntervalIndex for bins
4a5ebea [Jeff Reback] more tests & fixes for non-unique / overlaps rename _is_contained_in -> contains add sorting test
340c98b [Jeff Reback] CLN/COMPAT: IntervalIndex
74162aa [Stephan Hoyer] API/ENH: IntervalIndex
  • Loading branch information
jreback committed Apr 14, 2017
1 parent 3fde134 commit 9991579
Show file tree
Hide file tree
Showing 54 changed files with 4,195 additions and 504 deletions.
20 changes: 20 additions & 0 deletions asv_bench/benchmarks/indexing.py
Original file line number Diff line number Diff line change
Expand Up @@ -226,6 +226,26 @@ def time_is_monotonic(self):
self.miint.is_monotonic


class IntervalIndexing(object):
goal_time = 0.2

def setup(self):
self.monotonic = Series(np.arange(1000000),
index=IntervalIndex.from_breaks(np.arange(1000001)))

def time_getitem_scalar(self):
self.monotonic[80000]

def time_loc_scalar(self):
self.monotonic.loc[80000]

def time_getitem_list(self):
self.monotonic[80000:]

def time_loc_list(self):
self.monotonic.loc[80000:]


class PanelIndexing(object):
goal_time = 0.2

Expand Down
33 changes: 33 additions & 0 deletions doc/source/advanced.rst
Original file line number Diff line number Diff line change
Expand Up @@ -850,6 +850,39 @@ Of course if you need integer based selection, then use ``iloc``
dfir.iloc[0:5]
.. _indexing.intervallindex:
IntervalIndex
~~~~~~~~~~~~~
.. versionadded:: 0.20.0
.. warning::
These indexing behaviors are provisional and may change in a future version of pandas.
.. ipython:: python
df = pd.DataFrame({'A': [1, 2, 3, 4]},
index=pd.IntervalIndex.from_breaks([0, 1, 2, 3, 4]))
df
Label based indexing via ``.loc`` along the edges of an interval works as you would expect,
selecting that particular interval.
.. ipython:: python
df.loc[2]
df.loc[[2, 3]]
If you select a lable *contained* within an interval, this will also select the interval.
.. ipython:: python
df.loc[2.5]
df.loc[[2.5, 3.5]]
Miscellaneous indexing FAQ
--------------------------
Expand Down
21 changes: 21 additions & 0 deletions doc/source/api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -1405,6 +1405,27 @@ Categorical Components
CategoricalIndex.as_ordered
CategoricalIndex.as_unordered

.. _api.intervalindex:

IntervalIndex
-------------

.. autosummary::
:toctree: generated/

IntervalIndex

IntervalIndex Components
~~~~~~~~~~~~~~~~~~~~~~~~

.. autosummary::
:toctree: generated/

IntervalIndex.from_arrays
IntervalIndex.from_tuples
IntervalIndex.from_breaks
IntervalIndex.from_intervals

.. _api.multiindex:

MultiIndex
Expand Down
10 changes: 9 additions & 1 deletion doc/source/reshaping.rst
Original file line number Diff line number Diff line change
Expand Up @@ -517,7 +517,15 @@ Alternatively we can specify custom bin-edges:

.. ipython:: python
pd.cut(ages, bins=[0, 18, 35, 70])
c = pd.cut(ages, bins=[0, 18, 35, 70])
c
.. versionadded:: 0.20.0

If the ``bins`` keyword is an ``IntervalIndex``, then these will be
used to bin the passed data.

pd.cut([25, 20, 50], bins=c.categories)


.. _reshaping.dummies:
Expand Down
58 changes: 58 additions & 0 deletions doc/source/whatsnew/v0.20.0.txt
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@ Highlights include:
- ``Panel`` has been deprecated, see :ref:`here <whatsnew_0200.api_breaking.deprecate_panel>`
- Improved user API when accessing levels in ``.groupby()``, see :ref:`here <whatsnew_0200.enhancements.groupby_access>`
- Improved support for UInt64 dtypes, see :ref:`here <whatsnew_0200.enhancements.uint64_support>`
- Addition of an ``IntervalIndex`` and ``Interval`` scalar type, see :ref:`here <whatsnew_0200.enhancements.intervalindex>`
- A new orient for JSON serialization, ``orient='table'``, that uses the Table Schema spec, see :ref:`here <whatsnew_0200.enhancements.table_schema>`
- Window Binary Corr/Cov operations return a MultiIndexed ``DataFrame`` rather than a ``Panel``, as ``Panel`` is now deprecated, see :ref:`here <whatsnew_0200.api_breaking.rolling_pairwise>`
- Support for S3 handling now uses ``s3fs``, see :ref:`here <whatsnew_0200.api_breaking.s3>`
Expand Down Expand Up @@ -314,6 +315,63 @@ To convert a ``SparseDataFrame`` back to sparse SciPy matrix in COO format, you

sdf.to_coo()

.. _whatsnew_0200.enhancements.intervalindex:

IntervalIndex
^^^^^^^^^^^^^

pandas has gained an ``IntervalIndex`` with its own dtype, ``interval`` as well as the ``Interval`` scalar type. These allow first-class support for interval
notation, specifically as a return type for the categories in ``pd.cut`` and ``pd.qcut``. The ``IntervalIndex`` allows some unique indexing, see the
:ref:`docs <indexing.intervallindex>`. (:issue:`7640`, :issue:`8625`)

Previous behavior:

.. code-block:: ipython

In [2]: pd.cut(range(3), 2)
Out[2]:
[(-0.002, 1], (-0.002, 1], (1, 2]]
Categories (2, object): [(-0.002, 1] < (1, 2]]

# the returned categories are strings, representing Intervals
In [3]: pd.cut(range(3), 2).categories
Out[3]: Index(['(-0.002, 1]', '(1, 2]'], dtype='object')

New behavior:

.. ipython:: python

c = pd.cut(range(4), bins=2)
c
c.categories

Furthermore, this allows one to bin *other* data with these same bins. ``NaN`` represents a missing
value similar to other dtypes.

.. ipython:: python

pd.cut([0, 3, 1, 1], bins=c.categories)

These can also used in ``Series`` and ``DataFrame``, and indexed.

.. ipython:: python

df = pd.DataFrame({'A': range(4),
'B': pd.cut([0, 3, 1, 1], bins=c.categories)}
).set_index('B')

Selecting a specific interval

.. ipython:: python

df.loc[pd.Interval(1.5, 3.0)]

Selecting via a scalar value that is contained in the intervals.

.. ipython:: python

df.loc[0]

.. _whatsnew_0200.enhancements.other:

Other Enhancements
Expand Down
1 change: 0 additions & 1 deletion pandas/_libs/hashtable.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,6 @@ cdef extern from "Python.h":

cdef size_t _INIT_VEC_CAP = 128


include "hashtable_class_helper.pxi"
include "hashtable_func_helper.pxi"

Expand Down
Loading

0 comments on commit 9991579

Please sign in to comment.