ENH: Intervalindex

closes #7640 closes #8625 reprise of #8707 Author: Jeff Reback <jeff@reback.net> Author: Stephan Hoyer <shoyer@climate.com> Closes #15309 from jreback/intervalindex and squashes the following commits: 11ab1e1 [Jeff Reback] merge conflicts 834df76 [Jeff Reback] more docs fbc1cf8 [Jeff Reback] doc example and bug 7577335 [Jeff Reback] fixup on merge of changes in algorithms.py 3a3e02e [Jeff Reback] sorting example 4333937 [Jeff Reback] api-types test fixing f0e3ad2 [Jeff Reback] pep b2d26eb [Jeff Reback] more docs e5f8082 [Jeff Reback] allow pd.cut to take an IntervalIndex for bins 4a5ebea [Jeff Reback] more tests & fixes for non-unique / overlaps rename _is_contained_in -> contains add sorting test 340c98b [Jeff Reback] CLN/COMPAT: IntervalIndex 74162aa [Stephan Hoyer] API/ENH: IntervalIndex
pandas-dev · Apr 14, 2017 · 9991579 · 9991579
1 parent 3fde134
commit 9991579
Show file tree

Hide file tree

Showing 54 changed files with 4,195 additions and 504 deletions.
diff --git a/asv_bench/benchmarks/indexing.py b/asv_bench/benchmarks/indexing.py
@@ -226,6 +226,26 @@ def time_is_monotonic(self):
         self.miint.is_monotonic
 
 
+class IntervalIndexing(object):
+    goal_time = 0.2
+
+    def setup(self):
+        self.monotonic = Series(np.arange(1000000),
+                                index=IntervalIndex.from_breaks(np.arange(1000001)))
+
+    def time_getitem_scalar(self):
+        self.monotonic[80000]
+
+    def time_loc_scalar(self):
+        self.monotonic.loc[80000]
+
+    def time_getitem_list(self):
+        self.monotonic[80000:]
+
+    def time_loc_list(self):
+        self.monotonic.loc[80000:]
+
+
 class PanelIndexing(object):
     goal_time = 0.2
 

diff --git a/doc/source/advanced.rst b/doc/source/advanced.rst
@@ -850,6 +850,39 @@ Of course if you need integer based selection, then use ``iloc``
 
    dfir.iloc[0:5]
 
+.. _indexing.intervallindex:
+
+IntervalIndex
+~~~~~~~~~~~~~
+
+.. versionadded:: 0.20.0
+
+.. warning::
+
+   These indexing behaviors are provisional and may change in a future version of pandas.
+
+.. ipython:: python
+
+   df = pd.DataFrame({'A': [1, 2, 3, 4]},
+                      index=pd.IntervalIndex.from_breaks([0, 1, 2, 3, 4]))
+   df
+
+Label based indexing via ``.loc`` along the edges of an interval works as you would expect,
+selecting that particular interval.
+
+.. ipython:: python
+
+   df.loc[2]
+   df.loc[[2, 3]]
+
+If you select a lable *contained* within an interval, this will also select the interval.
+
+.. ipython:: python
+
+   df.loc[2.5]
+   df.loc[[2.5, 3.5]]
+
+
 Miscellaneous indexing FAQ
 --------------------------
 

diff --git a/doc/source/api.rst b/doc/source/api.rst
@@ -1405,6 +1405,27 @@ Categorical Components
    CategoricalIndex.as_ordered
    CategoricalIndex.as_unordered
 
+.. _api.intervalindex:
+
+IntervalIndex
+-------------
+
+.. autosummary::
+   :toctree: generated/
+
+   IntervalIndex
+
+IntervalIndex Components
+~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. autosummary::
+   :toctree: generated/
+
+   IntervalIndex.from_arrays
+   IntervalIndex.from_tuples
+   IntervalIndex.from_breaks
+   IntervalIndex.from_intervals
+
 .. _api.multiindex:
 
 MultiIndex

diff --git a/doc/source/reshaping.rst b/doc/source/reshaping.rst
@@ -517,7 +517,15 @@ Alternatively we can specify custom bin-edges:
 
 .. ipython:: python
 
-   pd.cut(ages, bins=[0, 18, 35, 70])
+   c = pd.cut(ages, bins=[0, 18, 35, 70])
+   c
+
+.. versionadded:: 0.20.0
+
+If the ``bins`` keyword is an ``IntervalIndex``, then these will be
+used to bin the passed data.
+
+   pd.cut([25, 20, 50], bins=c.categories)
 
 
 .. _reshaping.dummies:

diff --git a/doc/source/whatsnew/v0.20.0.txt b/doc/source/whatsnew/v0.20.0.txt
@@ -13,6 +13,7 @@ Highlights include:
 - ``Panel`` has been deprecated, see :ref:`here <whatsnew_0200.api_breaking.deprecate_panel>`
 - Improved user API when accessing levels in ``.groupby()``, see :ref:`here <whatsnew_0200.enhancements.groupby_access>`
 - Improved support for UInt64 dtypes, see :ref:`here <whatsnew_0200.enhancements.uint64_support>`
+- Addition of an ``IntervalIndex`` and ``Interval`` scalar type, see :ref:`here <whatsnew_0200.enhancements.intervalindex>`
 - A new orient for JSON serialization, ``orient='table'``, that uses the Table Schema spec, see :ref:`here <whatsnew_0200.enhancements.table_schema>`
 - Window Binary Corr/Cov operations return a MultiIndexed ``DataFrame`` rather than a ``Panel``, as ``Panel`` is now deprecated, see :ref:`here <whatsnew_0200.api_breaking.rolling_pairwise>`
 - Support for S3 handling now uses ``s3fs``, see :ref:`here <whatsnew_0200.api_breaking.s3>`
@@ -314,6 +315,63 @@ To convert a ``SparseDataFrame`` back to sparse SciPy matrix in COO format, you
 
    sdf.to_coo()
 
+.. _whatsnew_0200.enhancements.intervalindex:
+
+IntervalIndex
+^^^^^^^^^^^^^
+
+pandas has gained an ``IntervalIndex`` with its own dtype, ``interval`` as well as the ``Interval`` scalar type. These allow first-class support for interval
+notation, specifically as a return type for the categories in ``pd.cut`` and ``pd.qcut``. The ``IntervalIndex`` allows some unique indexing, see the
+:ref:`docs <indexing.intervallindex>`. (:issue:`7640`, :issue:`8625`)
+
+Previous behavior:
+
+.. code-block:: ipython
+
+   In [2]: pd.cut(range(3), 2)
+   Out[2]:
+   [(-0.002, 1], (-0.002, 1], (1, 2]]
+   Categories (2, object): [(-0.002, 1] < (1, 2]]
+
+   # the returned categories are strings, representing Intervals
+   In [3]: pd.cut(range(3), 2).categories
+   Out[3]: Index(['(-0.002, 1]', '(1, 2]'], dtype='object')
+
+New behavior:
+
+.. ipython:: python
+
+   c = pd.cut(range(4), bins=2)
+   c
+   c.categories
+
+Furthermore, this allows one to bin *other* data with these same bins. ``NaN`` represents a missing
+value similar to other dtypes.
+
+.. ipython:: python
+
+   pd.cut([0, 3, 1, 1], bins=c.categories)
+
+These can also used in ``Series`` and ``DataFrame``, and indexed.
+
+.. ipython:: python
+
+   df = pd.DataFrame({'A': range(4),
+                      'B': pd.cut([0, 3, 1, 1], bins=c.categories)}
+                    ).set_index('B')
+
+Selecting a specific interval
+
+.. ipython:: python
+
+   df.loc[pd.Interval(1.5, 3.0)]
+
+Selecting via a scalar value that is contained in the intervals.
+
+.. ipython:: python
+
+   df.loc[0]
+
 .. _whatsnew_0200.enhancements.other:
 
 Other Enhancements

diff --git a/pandas/_libs/hashtable.pyx b/pandas/_libs/hashtable.pyx
@@ -41,7 +41,6 @@ cdef extern from "Python.h":
 
 cdef size_t _INIT_VEC_CAP = 128
 
-
 include "hashtable_class_helper.pxi"
 include "hashtable_func_helper.pxi"