API: Public data for Series and Index: .array and .to_numpy() (pandas…

…-dev#23623)
Pingviinituutti · Feb 28, 2019 · 87a5385 · 87a5385
1 parent 3a6991c
commit 87a5385
Show file tree

Hide file tree

Showing 22 changed files with 501 additions and 55 deletions.
diff --git a/doc/source/10min.rst b/doc/source/10min.rst
@@ -113,13 +113,40 @@ Here is how to view the top and bottom rows of the frame:
    df.head()
    df.tail(3)
 
-Display the index, columns, and the underlying NumPy data:
+Display the index, columns:
 
 .. ipython:: python
 
    df.index
    df.columns
-   df.values
+
+:meth:`DataFrame.to_numpy` gives a NumPy representation of the underlying data.
+Note that his can be an expensive operation when your :class:`DataFrame` has
+columns with different data types, which comes down to a fundamental difference
+between pandas and NumPy: **NumPy arrays have one dtype for the entire array,
+while pandas DataFrames have one dtype per column**. When you call
+:meth:`DataFrame.to_numpy`, pandas will find the NumPy dtype that can hold *all*
+of the dtypes in the DataFrame. This may end up being ``object``, which requires
+casting every value to a Python object.
+
+For ``df``, our :class:`DataFrame` of all floating-point values,
+:meth:`DataFrame.to_numpy` is fast and doesn't require copying data.
+
+.. ipython:: python
+
+   df.to_numpy()
+
+For ``df2``, the :class:`DataFrame` with multiple dtypes,
+:meth:`DataFrame.to_numpy` is relatively expensive.
+
+.. ipython:: python
+
+   df2.to_numpy()
+
+.. note::
+
+   :meth:`DataFrame.to_numpy` does *not* include the index or column
+   labels in the output.
 
 :func:`~DataFrame.describe` shows a quick statistic summary of your data:
 

diff --git a/doc/source/advanced.rst b/doc/source/advanced.rst
@@ -188,7 +188,7 @@ highly performant. If you want to see only the used levels, you can use the
 
 .. ipython:: python
 
-   df[['foo', 'qux']].columns.values
+   df[['foo', 'qux']].columns.to_numpy()
 
    # for a specific level
    df[['foo', 'qux']].columns.get_level_values(0)

diff --git a/doc/source/basics.rst b/doc/source/basics.rst
@@ -46,8 +46,8 @@ of elements to display is five, but you may pass a custom number.
 
 .. _basics.attrs:
 
-Attributes and the raw ndarray(s)
----------------------------------
+Attributes and Underlying Data
+------------------------------
 
 pandas objects have a number of attributes enabling you to access the metadata
 
@@ -65,14 +65,43 @@ Note, **these attributes can be safely assigned to**!
    df.columns = [x.lower() for x in df.columns]
    df
 
-To get the actual data inside a data structure, one need only access the
-**values** property:
+Pandas objects (:class:`Index`, :class:`Series`, :class:`DataFrame`) can be
+thought of as containers for arrays, which hold the actual data and do the
+actual computation. For many types, the underlying array is a
+:class:`numpy.ndarray`. However, pandas and 3rd party libraries may *extend*
+NumPy's type system to add support for custom arrays
+(see :ref:`basics.dtypes`).
+
+To get the actual data inside a :class:`Index` or :class:`Series`, use
+the **array** property
+
+.. ipython:: python
+
+   s.array
+   s.index.array
+
+Depending on the data type (see :ref:`basics.dtypes`), :attr:`~Series.array`
+be either a NumPy array or an :ref:`ExtensionArray <extending.extension-type>`.
+If you know you need a NumPy array, use :meth:`~Series.to_numpy`
+or :meth:`numpy.asarray`.
 
 .. ipython:: python
 
-    s.values
-    df.values
-    wp.values
+   s.to_numpy()
+   np.asarray(s)
+
+For Series and Indexes backed by NumPy arrays (like we have here), this will
+be the same as :attr:`~Series.array`. When the Series or Index is backed by
+a :class:`~pandas.api.extension.ExtensionArray`, :meth:`~Series.to_numpy`
+may involve copying data and coercing values.
+
+Getting the "raw data" inside a :class:`DataFrame` is possibly a bit more
+complex. When your ``DataFrame`` only has a single data type for all the
+columns, :atr:`DataFrame.to_numpy` will return the underlying data:
+
+.. ipython:: python
+
+   df.to_numpy()
 
 If a DataFrame or Panel contains homogeneously-typed data, the ndarray can
 actually be modified in-place, and the changes will be reflected in the data
@@ -87,6 +116,21 @@ unlike the axis labels, cannot be assigned to.
     strings are involved, the result will be of object dtype. If there are only
     floats and integers, the resulting array will be of float dtype.
 
+In the past, pandas recommended :attr:`Series.values` or :attr:`DataFrame.values`
+for extracting the data from a Series or DataFrame. You'll still find references
+to these in old code bases and online. Going forward, we recommend avoiding
+``.values`` and using ``.array`` or ``.to_numpy()``. ``.values`` has the following
+drawbacks:
+
+1. When your Series contains an :ref:`extension type <extending.extension-type>`, it's
+   unclear whether :attr:`Series.values` returns a NumPy array or the extension array.
+   :attr:`Series.array` will always return the actual array backing the Series,
+   while :meth:`Series.to_numpy` will always return a NumPy array.
+2. When your DataFrame contains a mixture of data types, :attr:`DataFrame.values` may
+   involve copying data and coercing values to a common dtype, a relatively expensive
+   operation. :meth:`DataFrame.to_numpy`, being a method, makes it clearer that the
+   returned NumPy array may not be a view on the same data in the DataFrame.
+
 .. _basics.accelerate:
 
 Accelerated operations
@@ -541,7 +585,7 @@ will exclude NAs on Series input by default:
 .. ipython:: python
 
    np.mean(df['one'])
-   np.mean(df['one'].values)
+   np.mean(df['one'].to_numpy())
 
 :meth:`Series.nunique` will return the number of unique non-NA values in a
 Series:
@@ -839,7 +883,7 @@ Series operation on each column or row:
 
    tsdf = pd.DataFrame(np.random.randn(10, 3), columns=['A', 'B', 'C'],
                        index=pd.date_range('1/1/2000', periods=10))
-   tsdf.values[3:7] = np.nan
+   tsdf.iloc[3:7] = np.nan
 
 .. ipython:: python
 
@@ -1875,17 +1919,29 @@ dtypes
 ------
 
 For the most part, pandas uses NumPy arrays and dtypes for Series or individual
-columns of a DataFrame. The main types allowed in pandas objects are ``float``,
-``int``, ``bool``, and ``datetime64[ns]`` (note that NumPy does not support
-timezone-aware datetimes).
-
-In addition to NumPy's types, pandas :ref:`extends <extending.extension-types>`
-NumPy's type-system for a few cases.
-
-* :ref:`Categorical <categorical>`
-* :ref:`Datetime with Timezone <timeseries.timezone_series>`
-* :ref:`Period <timeseries.periods>`
-* :ref:`Interval <indexing.intervallindex>`
+columns of a DataFrame. NumPy provides support for ``float``,
+``int``, ``bool``, ``timedelta64[ns]`` and ``datetime64[ns]`` (note that NumPy
+does not support timezone-aware datetimes).
+
+Pandas and third-party libraries *extend* NumPy's type system in a few places.
+This section describes the extensions pandas has made internally.
+See :ref:`extending.extension-types` for how to write your own extension that
+works with pandas. See :ref:`ecosystem.extensions` for a list of third-party
+libraries that have implemented an extension.
+
+The following table lists all of pandas extension types. See the respective
+documentation sections for more on each type.
+
+=================== ========================= ================== ============================= =============================
+Kind of Data        Data Type                 Scalar             Array                         Documentation
+=================== ========================= ================== ============================= =============================
+tz-aware datetime   :class:`DatetimeArray`    :class:`Timestamp` :class:`arrays.DatetimeArray` :ref:`timeseries.timezone`
+Categorical         :class:`CategoricalDtype` (none)             :class:`Categorical`          :ref:`categorical`
+period (time spans) :class:`PeriodDtype`      :class:`Period`    :class:`arrays.PeriodArray`   :ref:`timeseries.periods`
+sparse              :class:`SparseDtype`      (none)             :class:`arrays.SparseArray`   :ref:`sparse`
+intervals           :class:`IntervalDtype`    :class:`Interval`  :class:`arrays.IntervalArray` :ref:`advanced.intervalindex`
+nullable integer    :clsas:`Int64Dtype`, ...  (none)             :class:`arrays.IntegerArray`  :ref:`integer_na`
+=================== ========================= ================== ============================= =============================
 
 Pandas uses the ``object`` dtype for storing strings.
 
@@ -1983,13 +2039,13 @@ from the current type (e.g. ``int`` to ``float``).
    df3
    df3.dtypes
 
-The ``values`` attribute on a DataFrame return the *lower-common-denominator* of the dtypes, meaning
+:meth:`DataFrame.to_numpy` will return the *lower-common-denominator* of the dtypes, meaning
 the dtype that can accommodate **ALL** of the types in the resulting homogeneous dtyped NumPy array. This can
 force some *upcasting*.
 
 .. ipython:: python
 
-   df3.values.dtype
+   df3.to_numpy().dtype
 
 astype
 ~~~~~~
@@ -2211,11 +2267,11 @@ dtypes:
                       'float64': np.arange(4.0, 7.0),
                       'bool1': [True, False, True],
                       'bool2': [False, True, False],
-                      'dates': pd.date_range('now', periods=3).values,
+                      'dates': pd.date_range('now', periods=3),
                       'category': pd.Series(list("ABC")).astype('category')})
    df['tdeltas'] = df.dates.diff()
    df['uint64'] = np.arange(3, 6).astype('u8')
-   df['other_dates'] = pd.date_range('20130101', periods=3).values
+   df['other_dates'] = pd.date_range('20130101', periods=3)
    df['tz_aware_dates'] = pd.date_range('20130101', periods=3, tz='US/Eastern')
    df
 

diff --git a/doc/source/categorical.rst b/doc/source/categorical.rst
@@ -178,7 +178,7 @@ are consistent among all columns.
 
     To perform table-wise conversion, where all labels in the entire ``DataFrame`` are used as
     categories for each column, the ``categories`` parameter can be determined programmatically by
-    ``categories = pd.unique(df.values.ravel())``.
+    ``categories = pd.unique(df.to_numpy().ravel())``.
 
 If you already have ``codes`` and ``categories``, you can use the 
 :func:`~pandas.Categorical.from_codes` constructor to save the factorize step 
@@ -955,7 +955,7 @@ Use ``.astype`` or ``union_categoricals`` to get ``category`` result.
    pd.concat([s1, s3])
 
    pd.concat([s1, s3]).astype('category')
-   union_categoricals([s1.values, s3.values])
+   union_categoricals([s1.array, s3.array])
 
 
 Following table summarizes the results of ``Categoricals`` related concatenations.

diff --git a/doc/source/dsintro.rst b/doc/source/dsintro.rst
@@ -137,7 +137,43 @@ However, operations such as slicing will also slice the index.
     s[[4, 3, 1]]
     np.exp(s)
 
-We will address array-based indexing in a separate :ref:`section <indexing>`.
+.. note::
+
+   We will address array-based indexing like ``s[[4, 3, 1]]``
+   in :ref:`section <indexing>`.
+
+Like a NumPy array, a pandas Series has a :attr:`~Series.dtype`.
+
+.. ipython:: python
+
+   s.dtype
+
+This is often a NumPy dtype. However, pandas and 3rd-party libraries
+extend NumPy's type system in a few places, in which case the dtype would
+be a :class:`~pandas.api.extensions.ExtensionDtype`. Some examples within
+pandas are :ref:`categorical` and :ref:`integer_na`. See :ref:`basics.dtypes`
+for more.
+
+If you need the actual array backing a ``Series``, use :attr:`Series.array`.
+
+.. ipython:: python
+
+   s.array
+
+Again, this is often a NumPy array, but may instead be a
+:class:`~pandas.api.extensions.ExtensionArray`. See :ref:`basics.dtypes` for more.
+Accessing the array can be useful when you need to do some operation without the
+index (to disable :ref:`automatic alignment <dsintro.alignment>`, for example).
+
+While Series is ndarray-like, if you need an *actual* ndarray, then use
+:meth:`Series.to_numpy`.
+
+.. ipython:: python
+
+   s.to_numpy()
+
+Even if the Series is backed by a :class:`~pandas.api.extensions.ExtensionArray`,
+:meth:`Series.to_numpy` will return a NumPy ndarray.
 
 Series is dict-like
 ~~~~~~~~~~~~~~~~~~~
@@ -617,6 +653,8 @@ slicing, see the :ref:`section on indexing <indexing>`. We will address the
 fundamentals of reindexing / conforming to new sets of labels in the
 :ref:`section on reindexing <basics.reindexing>`.
 
+.. _dsintro.alignment:
+
 Data alignment and arithmetic
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 

diff --git a/doc/source/enhancingperf.rst b/doc/source/enhancingperf.rst
@@ -221,7 +221,7 @@ the rows, applying our ``integrate_f_typed``, and putting this in the zeros arra
 
    You can **not pass** a ``Series`` directly as a ``ndarray`` typed parameter
    to a Cython function. Instead pass the actual ``ndarray`` using the
-   ``.values`` attribute of the ``Series``. The reason is that the Cython
+   :meth:`Series.to_numpy`. The reason is that the Cython
    definition is specific to an ndarray and not the passed ``Series``.
 
    So, do not do this:
@@ -230,11 +230,13 @@ the rows, applying our ``integrate_f_typed``, and putting this in the zeros arra
 
         apply_integrate_f(df['a'], df['b'], df['N'])
 
-   But rather, use ``.values`` to get the underlying ``ndarray``:
+   But rather, use :meth:`Series.to_numpy` to get the underlying ``ndarray``:
 
    .. code-block:: python
 
-        apply_integrate_f(df['a'].values, df['b'].values, df['N'].values)
+        apply_integrate_f(df['a'].to_numpy(),
+                          df['b'].to_numpy(),
+                          df['N'].to_numpy())
 
 .. note::
 

diff --git a/doc/source/extending.rst b/doc/source/extending.rst
@@ -186,7 +186,7 @@ Instead, you should detect these cases and return ``NotImplemented``.
 When pandas encounters an operation like ``op(Series, ExtensionArray)``, pandas
 will
 
-1. unbox the array from the ``Series`` (roughly ``Series.values``)
+1. unbox the array from the ``Series`` (``Series.array``)
 2. call ``result = op(values, ExtensionArray)``
 3. re-box the result in a ``Series``
 

diff --git a/doc/source/indexing.rst b/doc/source/indexing.rst
@@ -190,7 +190,7 @@ columns.
 
    .. ipython:: python
 
-      df.loc[:,['B', 'A']] = df[['A', 'B']].values
+      df.loc[:,['B', 'A']] = df[['A', 'B']].to_numpy()
       df[['A', 'B']]
 
 

diff --git a/doc/source/missing_data.rst b/doc/source/missing_data.rst
@@ -678,7 +678,7 @@ Replacing more than one value is possible by passing a list.
 
 .. ipython:: python
 
-   df00 = df.values[0, 0]
+   df00 = df.iloc[0, 0]
    df.replace([1.5, df00], [np.nan, 'a'])
    df[1].dtype
 

diff --git a/doc/source/reshaping.rst b/doc/source/reshaping.rst
@@ -27,12 +27,12 @@ Reshaping by pivoting DataFrame objects
    tm.N = 3
 
    def unpivot(frame):
-           N, K = frame.shape
-           data = {'value': frame.values.ravel('F'),
-                   'variable': np.asarray(frame.columns).repeat(N),
-                   'date': np.tile(np.asarray(frame.index), K)}
-           columns = ['date', 'variable', 'value']
-           return pd.DataFrame(data, columns=columns)
+      N, K = frame.shape
+      data = {'value': frame.to_numpy().ravel('F'),
+              'variable': np.asarray(frame.columns).repeat(N),
+              'date': np.tile(np.asarray(frame.index), K)}
+      columns = ['date', 'variable', 'value']
+      return pd.DataFrame(data, columns=columns)
 
    df = unpivot(tm.makeTimeDataFrame())
 
@@ -54,7 +54,7 @@ For the curious here is how the above ``DataFrame`` was created:
 
    def unpivot(frame):
        N, K = frame.shape
-       data = {'value': frame.values.ravel('F'),
+       data = {'value': frame.to_numpy().ravel('F'),
                'variable': np.asarray(frame.columns).repeat(N),
                'date': np.tile(np.asarray(frame.index), K)}
        return pd.DataFrame(data, columns=['date', 'variable', 'value'])

diff --git a/doc/source/text.rst b/doc/source/text.rst
@@ -317,8 +317,8 @@ All one-dimensional list-likes can be combined in a list-like container (includi
 
     s
     u
-    s.str.cat([u.values,
-               u.index.astype(str).values], na_rep='-')
+    s.str.cat([u.array,
+               u.index.astype(str).array], na_rep='-')
 
 All elements must match in length to the calling ``Series`` (or ``Index``), except those having an index if ``join`` is not None: