-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Include a performance vignette? #188
Conversation
Since it maybe difficult to run benchmarks on CI, I think we need to investigate how other repositories include benchmarks in their articles. |
May be relevant: https://github.com/pola-rs/tpch |
I was playing around a bit with tpch about August last year. But a lot of features were missing in r-polars back then. Some of test datasets did not compile out-of-the-box and required some hand held fixing on my machine. Maybe tpch has become more ergonomic now. I think a tpch benchmark would be the ultimate approval that r-polars is on par with py-polars- |
@etiennebacher I'm very positive about this draft :) |
Maybe you can add something ultra simple but fun like this, and then point readers to the DuckDB benchmarks for more serious stuff. The idea is would be to give readers an early "Wow!". library(bench)
library(dplyr)
library(polars)
library(data.table)
N = 1e7
df = data.frame(matrix(runif(25 * N), nrow = N))
df$letters = sample(letters, N, replace = TRUE)
df_dt = data.table(df)
df_pl = pl$DataFrame(df)
# comparison
bench::mark(
"base" = by(df, df$letters, \(x) colMeans(x[, -26])),
"dplyr" = df %>% group_by(letters) %>% summarise_all(mean),
"data.table" = df_dt[, lapply(.SD, mean), by = "letters"],
"polars" = df_pl$groupby("letters")$mean(),
check = FALSE,
relative = TRUE
)
# expression min median `itr/sec` mem_alloc `gc/sec`
# <bch:expr> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 base 16.1 15.6 1 8241. Inf
# 2 dplyr 9.66 9.35 1.67 3826. Inf
# 3 data.table 2.29 2.26 6.92 634. NaN
# 4 polars 1 1 15.5 1 NaN |
I didn't want to spam everyone's inbox—so let me know if other would like to join too—but I invited @vincentarelbundock and @etiennebacher to a private repo that houses (an adapted subset of) benchmarks that I keep for myself on some common data tasks across a variety of languages and libraries. Feel free to poke around etc. I also have timings for larger datasets, but this ends up being a bottleneck for some languages (cough Stata cough). |
AFAIK r-polars does not support streaming yet. See the py-polars handbook for some simple examples. tl;dr just end your query with |
Thanks all, FYI if you're interested feel free to push changes directly to this PR |
I am reluctant to make comparisons with |
Well, one goal here is obviously to convince R users that it's worth it for them to try out
No, I think those should be added too if it's easy. I would also be curious to know what people think the benefits of |
For example, when I tried Acero and duckdb in my environment, I got the following results. library(bench)
library(data.table)
library(dplyr, warn.conflicts = FALSE)
library(arrow, warn.conflicts = FALSE)
library(duckdb)
#> Loading required package: DBI
#>
#> Attaching package: 'duckdb'
#> The following object is masked from 'package:dplyr':
#>
#> sql
library(polars)
N = 1e7
set.seed(1)
df = data.frame(matrix(runif(25 * N), nrow = N))
df$letters = sample(letters, N, replace = TRUE)
df_dt = data.table(df)
at = as_arrow_table(df)
df_pl = pl$DataFrame(df)
con = DBI::dbConnect(duckdb::duckdb(), ":memory:")
duckdb_register(con, "df", df)
# comparison
bench::mark(
"data.table" = df_dt[, lapply(.SD, mean), by = "letters"],
"Acero" = at |> group_by(letters) |> summarise(across(!letters, ~ mean(.x, na.rm = TRUE))) |> compute(),
"duckdb" = duckdb::sql("FROM df SELECT letters, avg(COLUMNS(x -> NOT suffix(x, 'letters'))) GROUP BY letters", con),
"polars" = df_pl$groupby("letters")$mean(),
check = FALSE,
relative = TRUE
)
#> # A tibble: 4 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 data.table 6.68 6.29 1 35566. NaN
#> 2 Acero 1.39 1.31 4.81 821. Inf
#> 3 duckdb 1.71 1.61 3.89 1 NaN
#> 4 polars 1 1 6.29 56.2 NaN
sessioninfo::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#> setting value
#> version R version 4.3.0 (2023-04-21)
#> os Ubuntu 22.04.2 LTS
#> system x86_64, linux-gnu
#> ui X11
#> language (EN)
#> collate en_US.UTF-8
#> ctype en_US.UTF-8
#> tz Etc/UTC
#> date 2023-05-05
#> pandoc 3.1.2 @ /usr/bin/ (via rmarkdown)
#>
#> ─ Packages ───────────────────────────────────────────────────────────────────
#> package * version date (UTC) lib source
#> arrow * 11.0.0.3 2023-03-08 [1] RSPM
#> assertthat 0.2.1 2019-03-21 [1] RSPM
#> bench * 1.1.2 2021-11-30 [1] RSPM
#> bit 4.0.5 2022-11-15 [1] RSPM (R 4.3.0)
#> bit64 4.0.5 2020-08-30 [1] RSPM (R 4.3.0)
#> cli 3.6.1 2023-03-23 [1] RSPM
#> data.table * 1.14.8 2023-02-17 [1] RSPM
#> DBI * 1.1.3 2022-06-18 [1] RSPM (R 4.3.0)
#> digest 0.6.31 2022-12-11 [1] RSPM
#> dplyr * 1.1.2 2023-04-20 [1] RSPM (R 4.3.0)
#> duckdb * 0.8.0 2023-05-05 [1] https://duckdb.r-universe.dev (R 4.3.0)
#> evaluate 0.20 2023-01-17 [1] RSPM
#> fansi 1.0.4 2023-01-22 [1] RSPM
#> fastmap 1.1.1 2023-02-24 [1] RSPM
#> fs 1.6.2 2023-04-25 [1] RSPM (R 4.3.0)
#> generics 0.1.3 2022-07-05 [1] RSPM (R 4.3.0)
#> glue 1.6.2 2022-02-24 [1] RSPM
#> htmltools 0.5.5 2023-03-23 [1] RSPM
#> knitr 1.42 2023-01-25 [1] RSPM
#> lifecycle 1.0.3 2022-10-07 [1] RSPM
#> magrittr 2.0.3 2022-03-30 [1] RSPM
#> pillar 1.9.0 2023-03-22 [1] RSPM
#> pkgconfig 2.0.3 2019-09-22 [1] RSPM
#> polars * 0.6.0 2023-05-04 [1] local
#> profmem 0.6.0 2020-12-13 [1] RSPM
#> purrr 1.0.1 2023-01-10 [1] RSPM
#> R.cache 0.16.0 2022-07-21 [1] RSPM
#> R.methodsS3 1.8.2 2022-06-13 [1] RSPM
#> R.oo 1.25.0 2022-06-12 [1] RSPM
#> R.utils 2.12.2 2022-11-11 [1] RSPM
#> R6 2.5.1 2021-08-19 [1] RSPM
#> reprex 2.0.2 2022-08-17 [1] RSPM
#> rlang 1.1.0 2023-03-14 [1] RSPM
#> rmarkdown 2.21 2023-03-26 [1] RSPM
#> sessioninfo 1.2.2 2021-12-06 [1] RSPM
#> styler 1.9.1 2023-03-04 [1] RSPM
#> tibble 3.2.1 2023-03-20 [1] RSPM
#> tidyselect 1.2.0 2022-10-10 [1] RSPM (R 4.3.0)
#> utf8 1.2.3 2023-01-31 [1] RSPM
#> vctrs 0.6.2 2023-04-19 [1] RSPM
#> withr 2.5.0 2022-03-03 [1] RSPM
#> xfun 0.39 2023-04-20 [1] RSPM
#> yaml 2.3.7 2023-01-23 [1] RSPM
#>
#> [1] /usr/local/lib/R/site-library
#> [2] /usr/local/lib/R/library
#>
#> ────────────────────────────────────────────────────────────────────────────── Created on 2023-05-05 with reprex v2.0.2 |
Here are what I consider to be the rough pros and cons: AceroPros
Cons
DuckDBPros
Cons
PolarsPros
Cons
|
I really don't think it is a good idea to make a bare dplyr comparison here, because in my opinion dplyr users can achieve higher speeds by simply switching to data.table (via dtplyr) or Acero (via arrow) or duckdb (via dbplyr) as a backend. |
Why don't we just add a dtplyr example, then? One other benefit of polars, I think, is the parallelism in syntax across R, python, and Rust, which facilitates multilingual projects and teams |
I would recommend using asymptotic benchmarks, which means measuring time and memory for data size N values increasing on a log scale. I have a package https://github.com/tdhock/atime that makes this easy. These benchmarks are much more convincing than single N, which can be misleading (which N is relevant to test may depend on the particular problem/hardware so much more informative/convincing to see result for several N). |
I don't think we should include benchmarks with To me, the objective of this vignette is not to compare I think it's more important here to focus on how one can use @tdhock thanks for the link, I think |
yes, bench::press does something similar, and that is discussed in the Related work section of the atime README, https://github.com/tdhock/atime#related-work |
One other (tbc?) Pro for Polars is that multithreading automatically works on MacOS. I might be missing something about the R-universe build process, but enabling multithreading in other high performance R libraries can be a bit of a pain. That's because these are C/C++ based and the OpenMP toolchain has to be installed and then linked to manually. (Basically, you have to specify a bunch of, e.g., C++ flags in your Makevars and then build from source instead of installing binaries.) |
Do you mention that SIMD is disabled in the R-universe builds? |
We do not :/ but should. I know of no benchmarks yet describing the performance difference. |
Can someone review the content of this vignette? I included the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this!
Sorry for the late review.
Thanks for the review @eitsupi |
@sorhawell @eitsupi @vincentarelbundock @grantmcdermott I started a vignette on
polars
performance, not to compare it to other packages but rather to present a few "good practices" (?) to use its full capabilities:polars
' built-in functions rather than passing R functions to aDataFrame
I'm still a beginner in
polars
and in data wrangling with larger-than-RAM data so there might be some things to correct/complete here. Also, I mostly wrote this for me to have some explanations somewhere and because it might be useful if I end up teaching this, but it doesn't have to be included as a vignette.What do you think about this?
Close #176