Skip to content

Wainberg/ryp

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 

Repository files navigation

ryp: R inside Python

ryp is a minimalist, powerful Python library for:

  • running R code inside Python
  • quickly transferring huge datasets between Python (NumPy/pandas/polars) and R without writing to disk
  • interactively working in both languages at the same time

ryp is an alternative to the widely used rpy2 library. Compared to rpy2, ryp provides:

  • increased stability
  • a much simpler API, with less of a learning curve
  • interactive printouts of R variables that match what you'd see in R
  • a full-featured R terminal inside Python for interactive work
  • inline plotting in Jupyter notebooks (requires the svglite R package)
  • much faster data conversion with Arrow (also provided by rpy2-arrow)
  • support for every NumPy, pandas and polars data type representable in base R, no matter how obscure
  • support for sparse arrays/matrices
  • recursive conversion of containers like R lists, Python tuples/lists/dicts, and S3/S4/R6 objects
  • full Windows support

ryp does the opposite of the reticulate R library, which runs Python inside R.

Table of Contents

Installation

Install ryp via pip:

pip install ryp

conda:

conda install ryp

or mamba:

mamba install ryp

Or, install the development version via pip:

pip install git+https://github.com/Wainberg/ryp

ryp's only mandatory dependencies are:

  • Python 3.7+
  • R
  • the cffi Python package
  • the pyarrow Python package, which includes NumPy as a dependency
  • the arrow R library

R and the arrow R library are automatically installed when installing ryp via conda or mamba, but not via pip. ryp uses the R installation pointed to by the environment variable R_HOME, or if R_HOME is not defined or not a directory, by running R RHOME through subprocess.run().

ryp also has several optional dependencies, which are not installed automatically with pip, conda or mamba. These are:

  • pandas, for format='pandas'
  • polars, for format='polars'
  • SciPy and the Matrix R library, for sparse matrices
  • the svglite R library, for inline plotting in Jupyter notebooks

Functionality

ryp consists of just four functions:

  1. r(R_code) runs a string of R code. r() with no arguments opens up an R terminal inside your Python terminal for interactive work.
  2. to_r(python_object, R_variable_name) converts a Python object into an R object named R_variable_name.
  3. to_py(R_statement) converts the R object produced by evaluating R_statement to Python. R_statement may be a single variable name, or a more complex code snippet that evaluates to the R object you'd like to convert.
  4. options(), for getting or setting ryp's configuration options.

r()

r(R_code: str = ...) -> None

r(R_code) runs a string of R code inside ryp's R interpreter, which is embedded inside Python. It can contain multiple statements separated by semicolons or newlines (e.g. within a triple-quoted Python string). It returns None; use to_py() instead if you would like to convert the result back to Python.

r() with no arguments opens up an R terminal inside your Python terminal for interactive debugging. Press Ctrl + D to exit back to the Python terminal. R variables defined from Python will be available in the R terminal, and variables defined in the R terminal will be available from Python once you exit:

>>> from ryp import r
>>> r('a = 1')
>>> r()
> a
[1]
1
> b <- 2
>
>>> r('b')
[1]
2

Note that the default value for R_code is the special sentinel value ... (Ellipsis) rather than None. This stops users from inadvertently opening the terminal when passing a variable that is supposed to be a string but is unexpectedly None.

to_r()

to_r(python_object: object, R_variable_name: str, *, 
     format: Literal['keep', 'matrix', 'data.frame'] | None = None,
     rownames: object = None, colnames: object = None) -> None

to_r(python_object, R_variable_name) converts python_object to R, adding it to R's global namespace (globalenv) as a variable named R_variable_name.

If python_object is a container (list, tuple, or dict), to_r() recursively converts each element and returns an R named list (if python_object is a dict) or unnamed list (if python_object is a list or tuple).

The format argument

By default (format='keep'), ryp converts polars and pandas DataFrames (and pandas MultiIndexes) into R data frames, and 2D NumPy arrays into R matrices. Specify format='matrix' to convert everything (even DataFrames) to R matrices (in which case all DataFrame columns must have the same type), and format='data.frame' to convert everything (even 2D NumPy arrays) to R data frames.

format must be None unless python_object is a DataFrame, MultiIndex or 2D NumPy array – or unless python_object is a list, tuple, or dict, in which case the format will apply recursively to any DataFrames, MultiIndexes or 2D NumPy arrays it contains.

The rownames and colnames arguments

Since NumPy arrays, polars DataFrames and Series, and scipy sparse arrays and matrices lack row and column names, you can specify these separately via the rownames and/or colnames arguments, and they will be added to the converted R object. rownames and colnames can be lists, tuples, string Series, or categorical Series with string categories, and will be automatically converted to R character vectors.

rownames and colnames must match the length or shape[1], respectively, of the object being converted. The one exception is that rownames of any length may be added to a 0 × 0 polars DataFrame, since polars does not have the concept of an N × 0 DataFrame for nonzero N. (Dropping all the columns of a polars DataFrame always results in a 0 × 0 DataFrame, even if the original DataFrame had more than 0 rows.)

Because Python bool, int, float, and str convert to length-1 R vectors that support names, you can pass length-1 rownames when converting objects of these types. You can also pass rownames and/or colnames when python_object is a list, tuple, or dict, in which case row and column names will only be added to elements that support them. All elements that support rownames must have the same length as the rownames, and similarly for the colnames.

rownames cannot be specified if python_object is a pandas Series or DataFrame (since they already have rownames, i.e. an index), or bytes/bytearray (since these convert to raw vectors, which lack rownames). colnames cannot be specified unless python_object is a multidimensional NumPy array or scipy sparse array or matrix, or something that might contain one (list, tuple, or dict).

to_py()

to_py(R_statement: str, *,
      format: Literal['polars', 'pandas', 'pandas-pyarrow', 'numpy'] |
              dict[Literal['vector', 'matrix', 'data.frame'],
                   Literal['polars', 'pandas', 'pandas-pyarrow',
                           'numpy']] | None = None,
      index: str | Literal[False] | None = None,
      squeeze: bool | None = None) -> Any

to_py(R_statement) runs a single statement of R code (which can be as simple as a single variable name) and converts the resulting R object to Python.

If the object is a list/S3 object, S4 object, or environment/R6 object, it recursively converts each attribute/slot/field and returns a Python dict (or list, if the object is an unnamed list). For R6 objects, only public fields will be converted.

The format argument

By default, or when format='polars', R vectors will be converted to polars Series, and R data frames and matrices will be converted to polars DataFrames. You can change this by setting the format argument to 'pandas', 'pandas-pyarrow' (like 'pandas', but converting to pyarrow dtypes wherever possible) or 'numpy'. (You can also change the default format, e.g. with options(to_py_format='pandas').)

For finer-grained control, you can set format for only certain R variable types by specifying a dictionary with 'vector', 'matrix', and/or 'data.frame' as keys and 'polars', 'pandas', 'pandas-pyarrow' and/or 'numpy' as values.

format must be None when R_statement evaluates to NULL, when it evaluates to an array of 3 or more dimensions (these are always converted to NumPy arrays), or when the final result would be a Python scalar (see squeeze below).

The index argument

By default, the R object's names or rownames will become the index (for pandas) or the first column (for polars) of the output Python object, named 'index'. Set the index argument to a different string to change this name, or set index=False to not convert the names/rownames.

Note that for polars, the output will be a two-column DataFrame (not a Series!) when the input is an R vector, unless index=False.

When the output is a NumPy array, names and rownames will always be discarded, since numeric NumPy arrays cannot store string indexes except with the inefficient dtype=object.

index must be None when format='numpy', or when the final result would be a Python scalar (see squeeze below).

The squeeze argument

By default, length-1 R vectors, matrices and arrays will be converted to Python scalars instead of Python arrays, Series or DataFrames. Set squeeze=False to disable this special case. (R data frames are never converted to Python scalars even if squeeze=True.)

squeeze must be None unless the R object is a vector, matrix or array (raw vectors don't count, because they always convert to Python scalars).

options()

options(*, to_r_format=None, to_py_format=None, index=None, squeeze=None, 
        plot_width: int | float | None = None, 
        plot_height: int | float | None = None) -> None

options gets or sets ryp's configuration settings:

  • to_r_format: the default value for the format parameter in to_r(); must be 'keep' (the default), 'matrix', or 'data.frame'.
  • to_py_format: the default value for the format parameter in to_py(); must be 'polars' (the default), 'pandas', 'pandas-pyarrow', 'numpy', or a dictionary with one of those four Python formats and/or None as values and 'vector', 'matrix' and/or 'data.frame' as keys. If certain keys are missing or have None as the format, leave their format unchanged.
  • index: the default value for the index parameter in to_py(); must be a string (default: 'index') or False.
  • squeeze: the default value for the squeeze parameter in to_py(); must
    be True (the default) or False.
  • plot_width: the width, in inches, of inline plots in Jupyter notebooks; must be a positive number. Defaults to 6.4 inches, to match Matplotlib's default.
  • plot_height: the height, in inches, of inline plots in Jupyter notebooks; must be a positive number. Defaults to 4.8 inches, to match Matplotlib's default.

For instance, to set pandas as the default format in to_py(), run options(to_py_format='pandas'). This leaves the other options unchanged.

options() with no arguments returns the current configuration options as a dictionary, with keys to_r_format, to_py_format, index, squeeze, plot_width, and plot_height.

For additional customization, users can specify ryp-specific settings in their .Rprofile:

if ("ryp" %in% commandArgs()) {
    # Custom settings for running R within ryp
} else {
    # Custom settings for native R
}

Conversion rules

Python to R (to_r())

Python R
None NULL (if scalar) or NA (if inside NumPy, pandas or polars)
nan NaN (if scalar or inside polars) or NA (if inside NumPy or pandas)
pd.NA NA
pd.NaT, np.datetime64('NaT'), np.timedelta64('NaT') NA
bool length-1 logical vector
int length-1 integer (if abs(x) <= 2_147_483_647) or bit64::integer64 vector
float length-1 numeric vector
str length-1 character vector
complex length-1 complex vector
datetime.date length-1 Date vector
datetime.datetime length-1 POSIXct vector
datetime.timedelta length-1 difftime(units='secs') vector
datetime.time (tzinfo must be None) length-1 hms::hms vector
bytes, bytearray raw vector
list, tuple unnamed list
dict (all keys must be strings) named list
polars Series, pandas Series*, pandas Index vector
polars DataFrame, pandas DataFrame*, pandas MultiIndex matrix (if format == 'matrix'; all columns must have same data type) or data.frame
1D NumPy array vector
2D NumPy array data.frame (if format == 'data.frame') or matrix
≥ 3D NumPy array array
0D NumPy array (e.g. np.array(1)), NumPy generic (e.g. np.int32(1)) length-1 vector
csr_array, csr_matrix dgRMatrix (if int or float), lgRMatrix (if boolean), -- (if complex)
csc_array, csc_matrix dgCMatrix (if int or float), lgCMatrix (if boolean), -- (if complex)
coo_array, coo_matrix dgTMatrix (if int or float), lgTMatrix (if boolean), -- (if complex)

NumPy data types

Python R
bool logical
int8, uint8, int16, uint16, int32 integer
uint32, uint64 integer (if x <= 2_147_483_647) or numeric
int64 integer (if abs(x) <= 2_147_483_647) or bit64::integer64
float16, float32, float64, float128 numeric (note: float128 loses precision)
complex64, complex128 complex
bytes (e.g. 'S1') --
str/unicode (e.g. 'U1') character
datetime64 POSIXct
timedelta64 difftime(units='secs')
void (unstructured) raw
void (structured) --
object depends on the contents

pandas-specific data types

Python R
BooleanDtype logical
Int8Dtype, UInt8Dtype, Int16Dtype, UInt16Dtype, Int32Dtype integer
UInt32Dtype, UInt64Dtype integer (if x <= 2_147_483_647) or numeric
Int64Dtype integer (if abs(x) <= 2_147_483_647) or bit64::integer64
Float32Dtype, Float64Dtype numeric
StringDtype character
CategoricalDtype(ordered=False) unordered factor
CategoricalDtype(ordered=True) ordered factor
DatetimeTZDtype, PeriodDtype POSIXct
IntervalDtype, SparseDtype --

pandas Arrow data types (pd.ArrowDtype)

Python R
pa.bool_ logical
pa.int8, pa.uint8, pa.int16, pa.uint16, pa.int32 integer
pa.uint32, pa.uint64 integer (if x <= 2_147_483_647) or numeric
pa.int64 integer (if abs(x) <= 2_147_483_647) or bit64::integer64
pa.float32, pa.float64 numeric
pa.string, pa.large_string character
pa.date32 Date
pa.date64, pa.timestamp POSIXct
pa.duration difftime(units='secs')
pa.time32, pa.time64 hms::hms
pa.dictionary(any integer type, pa.string(), ordered=0) unordered factor
pa.dictionary(any integer type, pa.string(), ordered=1) ordered factor
pa.null() vctrs::unspecified

Polars data types

Python R
Boolean logical
Int8, UInt8, Int16, UInt16, Int32 integer
UInt32, UInt64 integer (if x <= 2_147_483_647) or numeric
Int64 integer (if abs(x) <= 2_147_483_647) or bit64::integer64
Float32, Float64 numeric
Date Date
Datetime POSIXct
Duration difftime(units='secs')
Time hms::hms
String character
Categorical unordered factor
Enum ordered factor
Object depends on the contents
Null vctrs::unspecified
Binary, Decimal, List, Array --

Notes

* For pandas Series and DataFrames, string indexes (and categorical indexes where the categories are strings) will be automatically converted to names/rownames. The default index (pd.RangeIndex(len(python_object))) will be ignored. All other indexes are disallowed.

Because R does not support POSIXct and Date matrices or arrays, dates and datetimes cannot be converted to R matrices or arrays.

For dtype=object and dtype=pl.Object, the output R type depends on the contents, e.g. 'character' if all elements are strings. Some additional notes on ryp's handling of object data types:

  • None, np.nan, pd.NA, pd.NaT, np.datetime64('NaT'), and np.timedelta64('NaT') are all treated as missing values – even for polars, where np.nan is ordinarily treated as a floating-point number rather than a missing value.
  • Length-0 and all-missing data will be converted to the vctrs::unspecified R type (vctrs is part of the tidyverse).
  • If the elements are objects with a mix of types (or datetimes with a mix of time zones), Arrow will generally cause the conversion to fail, though mixes of related types (e.g. int and float) will be automatically cast to the common supertype and succeed.
  • Conversion will also fail if the contents are objects that are not representable as R vector elements. This includes bytes/bytearray (which are only representable in R when scalar, as a raw vector) and Python containers (list, tuple, and dict).
  • pandas Timedelta objects will be rounded down to the nearest microsecond, following the behavior of Arrow.

R to Python (to_py())

R Python
NULL None
NA None (if scalar or format='polars'), None/nan/pd.NA/pd.NaT/np.datetime64('NaT', 'us')/np.timedelta64('NaT', 'ns')/etc. (if format='numpy' 'pandas' or 'pandas-pyarrow')
NaN nan
length-1 vector, matrix or array, squeeze == False scalar
vector or 1D array, format == 'numpy' 1D NumPy array
vector or 1D array, format == 'pandas' or format == 'pandas-pyarrow' pandas Series
vector or 1D array, format == 'polars' polars Series (if index=False) or two-column DataFrame
matrix or data.frame, format == 'numpy' 2D NumPy array
matrix or data.frame, format == 'pandas' or format == 'pandas-pyarrow' pandas DataFrame
matrix or data.frame, format == 'polars' polars DataFrame
≥ 3D array NumPy array
unnamed list list
named list, S3 object, S4 object, environment, S6 object dict
dgRMatrix csr_array(dtype='int32')
dgCMatrix csc_array(dtype='int32')
dgTMatrix coo_array(dtype='int32')
lgRMatrix, ngRMatrix csr_array(dtype=bool)
lgCMatrix, ngCMatrix csc_array(dtype=bool)
lgTMatrix, ngTMatrix coo_array(dtype=bool)
formula (~) --

Data types

R Python scalar NumPy pandas pandas-pyarrow polars
logical bool bool bool ArrowDtype(pa.bool_()) Boolean
integer int int32 int32 ArrowDtype(pa.int32()) Int32
bit64::integer64 int int64 int64 ArrowDtype(pa.int64()) Int64
numeric float float float ArrowDtype(pa.float64()) Float64
character str object (with str elements) object (with str elements) ArrowDtype(pa.string()) String
complex complex complex128 complex128 complex128 --
raw bytearray -- -- -- --
unordered factor str object (with str elements) CategoricalDtype(ordered=False) ArrowDtype(pa.dictionary(pa.int8(), pa.string(), ordered=0)) Categorical
ordered factor str object (with str elements) CategoricalDtype(ordered=True) ArrowDtype(pa.dictionary(pa.int8(), pa.string(), ordered=1)) Enum
POSIXct without time zone datetime.datetime* datetime64[us]* datetime64[us]* ArrowDtype(pa.timestamp('us'))* Datetime('us')*
POSIXct with time zone datetime.datetime* datetime64[us]* (time zone discarded) DatetimeTZDtype('us', time_zone)* ArrowDtype(pa.timestamp('us', time_zone))* Datetime('us, time_zone)*
POSIXlt dict of scalars dict of NumPy arrays dict of pandas Series dict of pandas Series dict of polars Series
Date datetime.date datetime64[D] datetime64[ms] ArrowDtype(pa.date32('day')) Date
difftime datetime.timedelta* timedelta64[ns] timedelta64[ns] ArrowDtype(pa.duration('ns')) Duration(time_unit='ns')
hms::hms datetime.time* object (with datetime.time elements)* object (with datetime.time elements)* ArrowDtype(pa.time64('ns'))* Time
vctrs::unspecified None object (with None elements) object (with None elements) ArrowDtype(pa.null()) Null

* Due to the limitations of conversion with Arrow, POSIXct and hms::hms values are rounded down to the nearest microsecond when converting to Python, except for hms::hms when converting to polars. difftime values are also rounded down to the nearest microsecond, but only when converting to scalar datetime.timedelta values (which cannot represent nanoseconds).

Examples

  1. Apply R's scale() function to a pandas DataFrame:
import pandas as pd
from ryp import r, to_py, to_r
data = pd.DataFrame({'a': [1, 2, 3], 'b': [1, 3, 4]})
to_r(data, 'data')
r('data')
#   a b
# 1 1 1
# 2 2 3
# 3 3 4

r('data <- scale(data)')  # scale the R data.frame
scaled_data = to_py('data', format='pandas')  # convert the R data.frame to Python
scaled_data
#      a         b
# 0 -1.0 -1.091089
# 1  0.0  0.218218
# 2  1.0  0.872872

Note: we could have just written to_py('scale(data)') instead of r('data <- scale(data)') followed by to_py('data'). We could also have run options(to_py_format='pandas') at the top, to avoid having to specify format='pandas' in each to_py() call.

  1. Run a linear model on a polars DataFrame:
import polars as pl
from ryp import r, to_py, to_r
data = pl.DataFrame({'y': [7, 1, 2, 3, 6], 'x': [5, 2, 3, 2, 5]})
to_r(data, 'data')
r('model <- lm(y ~ x, data=data)')
coef = to_py('summary(model)$coefficients', index='variable')
p_value = coef.filter(variable='x').select('Pr(>|t|)')[0, 0]
p_value
# 0.02831035772841884
  1. Recursive conversion, showcasing all the keyword arguments of to_r() and to_py():
import numpy as np
from ryp import r, to_py, to_r
arrays = {'ints': np.array([[1, 2], [3, 4]]),
          'floats': np.array([[0.5, 1.5], [2.5, 3.5]])}
to_r(arrays, 'arrays', format='data.frame',
     rownames = ['row1', 'row2'], colnames = ['col1', 'col2'])
r('arrays')
# $ints
#      col1 col2
# row1    1    2
# row2    3    4
# 
# $floats
#      col1 col2
# row1  0.5  1.5
# row2  2.5  3.5
arrays = to_py('arrays', format='pandas', index='foo')
arrays['ints']
#       col1  col2
# foo
# row1     1     2
# row2     3     4
arrays['floats']
#       col1  col2
# foo
# row1   0.5   1.5
# row2   2.5   3.5