Skip to content

Commit

Permalink
API: Public data for Series and Index: .array and .to_numpy() (pandas…
Browse files Browse the repository at this point in the history
  • Loading branch information
TomAugspurger authored and Pingviinituutti committed Feb 28, 2019
1 parent 3a6991c commit 87a5385
Show file tree
Hide file tree
Showing 22 changed files with 501 additions and 55 deletions.
31 changes: 29 additions & 2 deletions doc/source/10min.rst
Original file line number Diff line number Diff line change
Expand Up @@ -113,13 +113,40 @@ Here is how to view the top and bottom rows of the frame:
df.head()
df.tail(3)
Display the index, columns, and the underlying NumPy data:
Display the index, columns:

.. ipython:: python
df.index
df.columns
df.values
:meth:`DataFrame.to_numpy` gives a NumPy representation of the underlying data.
Note that his can be an expensive operation when your :class:`DataFrame` has
columns with different data types, which comes down to a fundamental difference
between pandas and NumPy: **NumPy arrays have one dtype for the entire array,
while pandas DataFrames have one dtype per column**. When you call
:meth:`DataFrame.to_numpy`, pandas will find the NumPy dtype that can hold *all*
of the dtypes in the DataFrame. This may end up being ``object``, which requires
casting every value to a Python object.

For ``df``, our :class:`DataFrame` of all floating-point values,
:meth:`DataFrame.to_numpy` is fast and doesn't require copying data.

.. ipython:: python
df.to_numpy()
For ``df2``, the :class:`DataFrame` with multiple dtypes,
:meth:`DataFrame.to_numpy` is relatively expensive.

.. ipython:: python
df2.to_numpy()
.. note::

:meth:`DataFrame.to_numpy` does *not* include the index or column
labels in the output.

:func:`~DataFrame.describe` shows a quick statistic summary of your data:

Expand Down
2 changes: 1 addition & 1 deletion doc/source/advanced.rst
Original file line number Diff line number Diff line change
Expand Up @@ -188,7 +188,7 @@ highly performant. If you want to see only the used levels, you can use the

.. ipython:: python
df[['foo', 'qux']].columns.values
df[['foo', 'qux']].columns.to_numpy()
# for a specific level
df[['foo', 'qux']].columns.get_level_values(0)
Expand Down
104 changes: 80 additions & 24 deletions doc/source/basics.rst
Original file line number Diff line number Diff line change
Expand Up @@ -46,8 +46,8 @@ of elements to display is five, but you may pass a custom number.
.. _basics.attrs:

Attributes and the raw ndarray(s)
---------------------------------
Attributes and Underlying Data
------------------------------

pandas objects have a number of attributes enabling you to access the metadata

Expand All @@ -65,14 +65,43 @@ Note, **these attributes can be safely assigned to**!
df.columns = [x.lower() for x in df.columns]
df
To get the actual data inside a data structure, one need only access the
**values** property:
Pandas objects (:class:`Index`, :class:`Series`, :class:`DataFrame`) can be
thought of as containers for arrays, which hold the actual data and do the
actual computation. For many types, the underlying array is a
:class:`numpy.ndarray`. However, pandas and 3rd party libraries may *extend*
NumPy's type system to add support for custom arrays
(see :ref:`basics.dtypes`).

To get the actual data inside a :class:`Index` or :class:`Series`, use
the **array** property

.. ipython:: python
s.array
s.index.array
Depending on the data type (see :ref:`basics.dtypes`), :attr:`~Series.array`
be either a NumPy array or an :ref:`ExtensionArray <extending.extension-type>`.
If you know you need a NumPy array, use :meth:`~Series.to_numpy`
or :meth:`numpy.asarray`.

.. ipython:: python
s.values
df.values
wp.values
s.to_numpy()
np.asarray(s)
For Series and Indexes backed by NumPy arrays (like we have here), this will
be the same as :attr:`~Series.array`. When the Series or Index is backed by
a :class:`~pandas.api.extension.ExtensionArray`, :meth:`~Series.to_numpy`
may involve copying data and coercing values.

Getting the "raw data" inside a :class:`DataFrame` is possibly a bit more
complex. When your ``DataFrame`` only has a single data type for all the
columns, :atr:`DataFrame.to_numpy` will return the underlying data:

.. ipython:: python
df.to_numpy()
If a DataFrame or Panel contains homogeneously-typed data, the ndarray can
actually be modified in-place, and the changes will be reflected in the data
Expand All @@ -87,6 +116,21 @@ unlike the axis labels, cannot be assigned to.
strings are involved, the result will be of object dtype. If there are only
floats and integers, the resulting array will be of float dtype.

In the past, pandas recommended :attr:`Series.values` or :attr:`DataFrame.values`
for extracting the data from a Series or DataFrame. You'll still find references
to these in old code bases and online. Going forward, we recommend avoiding
``.values`` and using ``.array`` or ``.to_numpy()``. ``.values`` has the following
drawbacks:

1. When your Series contains an :ref:`extension type <extending.extension-type>`, it's
unclear whether :attr:`Series.values` returns a NumPy array or the extension array.
:attr:`Series.array` will always return the actual array backing the Series,
while :meth:`Series.to_numpy` will always return a NumPy array.
2. When your DataFrame contains a mixture of data types, :attr:`DataFrame.values` may
involve copying data and coercing values to a common dtype, a relatively expensive
operation. :meth:`DataFrame.to_numpy`, being a method, makes it clearer that the
returned NumPy array may not be a view on the same data in the DataFrame.

.. _basics.accelerate:

Accelerated operations
Expand Down Expand Up @@ -541,7 +585,7 @@ will exclude NAs on Series input by default:
.. ipython:: python
np.mean(df['one'])
np.mean(df['one'].values)
np.mean(df['one'].to_numpy())
:meth:`Series.nunique` will return the number of unique non-NA values in a
Series:
Expand Down Expand Up @@ -839,7 +883,7 @@ Series operation on each column or row:
tsdf = pd.DataFrame(np.random.randn(10, 3), columns=['A', 'B', 'C'],
index=pd.date_range('1/1/2000', periods=10))
tsdf.values[3:7] = np.nan
tsdf.iloc[3:7] = np.nan
.. ipython:: python
Expand Down Expand Up @@ -1875,17 +1919,29 @@ dtypes
------

For the most part, pandas uses NumPy arrays and dtypes for Series or individual
columns of a DataFrame. The main types allowed in pandas objects are ``float``,
``int``, ``bool``, and ``datetime64[ns]`` (note that NumPy does not support
timezone-aware datetimes).

In addition to NumPy's types, pandas :ref:`extends <extending.extension-types>`
NumPy's type-system for a few cases.

* :ref:`Categorical <categorical>`
* :ref:`Datetime with Timezone <timeseries.timezone_series>`
* :ref:`Period <timeseries.periods>`
* :ref:`Interval <indexing.intervallindex>`
columns of a DataFrame. NumPy provides support for ``float``,
``int``, ``bool``, ``timedelta64[ns]`` and ``datetime64[ns]`` (note that NumPy
does not support timezone-aware datetimes).

Pandas and third-party libraries *extend* NumPy's type system in a few places.
This section describes the extensions pandas has made internally.
See :ref:`extending.extension-types` for how to write your own extension that
works with pandas. See :ref:`ecosystem.extensions` for a list of third-party
libraries that have implemented an extension.

The following table lists all of pandas extension types. See the respective
documentation sections for more on each type.

=================== ========================= ================== ============================= =============================
Kind of Data Data Type Scalar Array Documentation
=================== ========================= ================== ============================= =============================
tz-aware datetime :class:`DatetimeArray` :class:`Timestamp` :class:`arrays.DatetimeArray` :ref:`timeseries.timezone`
Categorical :class:`CategoricalDtype` (none) :class:`Categorical` :ref:`categorical`
period (time spans) :class:`PeriodDtype` :class:`Period` :class:`arrays.PeriodArray` :ref:`timeseries.periods`
sparse :class:`SparseDtype` (none) :class:`arrays.SparseArray` :ref:`sparse`
intervals :class:`IntervalDtype` :class:`Interval` :class:`arrays.IntervalArray` :ref:`advanced.intervalindex`
nullable integer :clsas:`Int64Dtype`, ... (none) :class:`arrays.IntegerArray` :ref:`integer_na`
=================== ========================= ================== ============================= =============================

Pandas uses the ``object`` dtype for storing strings.

Expand Down Expand Up @@ -1983,13 +2039,13 @@ from the current type (e.g. ``int`` to ``float``).
df3
df3.dtypes
The ``values`` attribute on a DataFrame return the *lower-common-denominator* of the dtypes, meaning
:meth:`DataFrame.to_numpy` will return the *lower-common-denominator* of the dtypes, meaning
the dtype that can accommodate **ALL** of the types in the resulting homogeneous dtyped NumPy array. This can
force some *upcasting*.

.. ipython:: python
df3.values.dtype
df3.to_numpy().dtype
astype
~~~~~~
Expand Down Expand Up @@ -2211,11 +2267,11 @@ dtypes:
'float64': np.arange(4.0, 7.0),
'bool1': [True, False, True],
'bool2': [False, True, False],
'dates': pd.date_range('now', periods=3).values,
'dates': pd.date_range('now', periods=3),
'category': pd.Series(list("ABC")).astype('category')})
df['tdeltas'] = df.dates.diff()
df['uint64'] = np.arange(3, 6).astype('u8')
df['other_dates'] = pd.date_range('20130101', periods=3).values
df['other_dates'] = pd.date_range('20130101', periods=3)
df['tz_aware_dates'] = pd.date_range('20130101', periods=3, tz='US/Eastern')
df
Expand Down
4 changes: 2 additions & 2 deletions doc/source/categorical.rst
Original file line number Diff line number Diff line change
Expand Up @@ -178,7 +178,7 @@ are consistent among all columns.

To perform table-wise conversion, where all labels in the entire ``DataFrame`` are used as
categories for each column, the ``categories`` parameter can be determined programmatically by
``categories = pd.unique(df.values.ravel())``.
``categories = pd.unique(df.to_numpy().ravel())``.

If you already have ``codes`` and ``categories``, you can use the
:func:`~pandas.Categorical.from_codes` constructor to save the factorize step
Expand Down Expand Up @@ -955,7 +955,7 @@ Use ``.astype`` or ``union_categoricals`` to get ``category`` result.
pd.concat([s1, s3])
pd.concat([s1, s3]).astype('category')
union_categoricals([s1.values, s3.values])
union_categoricals([s1.array, s3.array])
Following table summarizes the results of ``Categoricals`` related concatenations.
Expand Down
40 changes: 39 additions & 1 deletion doc/source/dsintro.rst
Original file line number Diff line number Diff line change
Expand Up @@ -137,7 +137,43 @@ However, operations such as slicing will also slice the index.
s[[4, 3, 1]]
np.exp(s)
We will address array-based indexing in a separate :ref:`section <indexing>`.
.. note::

We will address array-based indexing like ``s[[4, 3, 1]]``
in :ref:`section <indexing>`.

Like a NumPy array, a pandas Series has a :attr:`~Series.dtype`.

.. ipython:: python
s.dtype
This is often a NumPy dtype. However, pandas and 3rd-party libraries
extend NumPy's type system in a few places, in which case the dtype would
be a :class:`~pandas.api.extensions.ExtensionDtype`. Some examples within
pandas are :ref:`categorical` and :ref:`integer_na`. See :ref:`basics.dtypes`
for more.

If you need the actual array backing a ``Series``, use :attr:`Series.array`.

.. ipython:: python
s.array
Again, this is often a NumPy array, but may instead be a
:class:`~pandas.api.extensions.ExtensionArray`. See :ref:`basics.dtypes` for more.
Accessing the array can be useful when you need to do some operation without the
index (to disable :ref:`automatic alignment <dsintro.alignment>`, for example).

While Series is ndarray-like, if you need an *actual* ndarray, then use
:meth:`Series.to_numpy`.

.. ipython:: python
s.to_numpy()
Even if the Series is backed by a :class:`~pandas.api.extensions.ExtensionArray`,
:meth:`Series.to_numpy` will return a NumPy ndarray.

Series is dict-like
~~~~~~~~~~~~~~~~~~~
Expand Down Expand Up @@ -617,6 +653,8 @@ slicing, see the :ref:`section on indexing <indexing>`. We will address the
fundamentals of reindexing / conforming to new sets of labels in the
:ref:`section on reindexing <basics.reindexing>`.

.. _dsintro.alignment:

Data alignment and arithmetic
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Expand Down
8 changes: 5 additions & 3 deletions doc/source/enhancingperf.rst
Original file line number Diff line number Diff line change
Expand Up @@ -221,7 +221,7 @@ the rows, applying our ``integrate_f_typed``, and putting this in the zeros arra

You can **not pass** a ``Series`` directly as a ``ndarray`` typed parameter
to a Cython function. Instead pass the actual ``ndarray`` using the
``.values`` attribute of the ``Series``. The reason is that the Cython
:meth:`Series.to_numpy`. The reason is that the Cython
definition is specific to an ndarray and not the passed ``Series``.

So, do not do this:
Expand All @@ -230,11 +230,13 @@ the rows, applying our ``integrate_f_typed``, and putting this in the zeros arra
apply_integrate_f(df['a'], df['b'], df['N'])
But rather, use ``.values`` to get the underlying ``ndarray``:
But rather, use :meth:`Series.to_numpy` to get the underlying ``ndarray``:

.. code-block:: python
apply_integrate_f(df['a'].values, df['b'].values, df['N'].values)
apply_integrate_f(df['a'].to_numpy(),
df['b'].to_numpy(),
df['N'].to_numpy())
.. note::

Expand Down
2 changes: 1 addition & 1 deletion doc/source/extending.rst
Original file line number Diff line number Diff line change
Expand Up @@ -186,7 +186,7 @@ Instead, you should detect these cases and return ``NotImplemented``.
When pandas encounters an operation like ``op(Series, ExtensionArray)``, pandas
will

1. unbox the array from the ``Series`` (roughly ``Series.values``)
1. unbox the array from the ``Series`` (``Series.array``)
2. call ``result = op(values, ExtensionArray)``
3. re-box the result in a ``Series``

Expand Down
2 changes: 1 addition & 1 deletion doc/source/indexing.rst
Original file line number Diff line number Diff line change
Expand Up @@ -190,7 +190,7 @@ columns.

.. ipython:: python
df.loc[:,['B', 'A']] = df[['A', 'B']].values
df.loc[:,['B', 'A']] = df[['A', 'B']].to_numpy()
df[['A', 'B']]
Expand Down
2 changes: 1 addition & 1 deletion doc/source/missing_data.rst
Original file line number Diff line number Diff line change
Expand Up @@ -678,7 +678,7 @@ Replacing more than one value is possible by passing a list.

.. ipython:: python
df00 = df.values[0, 0]
df00 = df.iloc[0, 0]
df.replace([1.5, df00], [np.nan, 'a'])
df[1].dtype
Expand Down
14 changes: 7 additions & 7 deletions doc/source/reshaping.rst
Original file line number Diff line number Diff line change
Expand Up @@ -27,12 +27,12 @@ Reshaping by pivoting DataFrame objects
tm.N = 3
def unpivot(frame):
N, K = frame.shape
data = {'value': frame.values.ravel('F'),
'variable': np.asarray(frame.columns).repeat(N),
'date': np.tile(np.asarray(frame.index), K)}
columns = ['date', 'variable', 'value']
return pd.DataFrame(data, columns=columns)
N, K = frame.shape
data = {'value': frame.to_numpy().ravel('F'),
'variable': np.asarray(frame.columns).repeat(N),
'date': np.tile(np.asarray(frame.index), K)}
columns = ['date', 'variable', 'value']
return pd.DataFrame(data, columns=columns)
df = unpivot(tm.makeTimeDataFrame())
Expand All @@ -54,7 +54,7 @@ For the curious here is how the above ``DataFrame`` was created:
def unpivot(frame):
N, K = frame.shape
data = {'value': frame.values.ravel('F'),
data = {'value': frame.to_numpy().ravel('F'),
'variable': np.asarray(frame.columns).repeat(N),
'date': np.tile(np.asarray(frame.index), K)}
return pd.DataFrame(data, columns=['date', 'variable', 'value'])
Expand Down
4 changes: 2 additions & 2 deletions doc/source/text.rst
Original file line number Diff line number Diff line change
Expand Up @@ -317,8 +317,8 @@ All one-dimensional list-likes can be combined in a list-like container (includi
s
u
s.str.cat([u.values,
u.index.astype(str).values], na_rep='-')
s.str.cat([u.array,
u.index.astype(str).array], na_rep='-')
All elements must match in length to the calling ``Series`` (or ``Index``), except those having an index if ``join`` is not None:

Expand Down
Loading

0 comments on commit 87a5385

Please sign in to comment.