Include a performance vignette? #188

etiennebacher · 2023-05-01T15:33:13Z

@sorhawell @eitsupi @vincentarelbundock @grantmcdermott I started a vignette on polars performance, not to compare it to other packages but rather to present a few "good practices" (?) to use its full capabilities:

lazy eval > eager eval because 1) it doesn't load data in memory and 2) it optimizes the queries under the hood before applying them to the data
better to use polars' built-in functions rather than passing R functions to a DataFrame
streaming data? It's something I've seen mentioned in the docs but I didn't explore this yet

I'm still a beginner in polars and in data wrangling with larger-than-RAM data so there might be some things to correct/complete here. Also, I mostly wrote this for me to have some explanations somewhere and because it might be useful if I end up teaching this, but it doesn't have to be included as a vignette.

What do you think about this?

Close #176

eitsupi · 2023-05-01T15:38:51Z

Since it maybe difficult to run benchmarks on CI, I think we need to investigate how other repositories include benchmarks in their articles.
I don't know of many examples, but something like https://github.com/tidyverse/vroom for example.

performance.md

vincentarelbundock · 2023-05-02T01:46:35Z

May be relevant: https://github.com/pola-rs/tpch

sorhawell · 2023-05-02T11:26:48Z

May be relevant: https://github.com/pola-rs/tpch

I was playing around a bit with tpch about August last year. But a lot of features were missing in r-polars back then. Some of test datasets did not compile out-of-the-box and required some hand held fixing on my machine. Maybe tpch has become more ergonomic now.

I think a tpch benchmark would be the ultimate approval that r-polars is on par with py-polars-

performance.Rmd

sorhawell · 2023-05-02T12:38:03Z

@etiennebacher I'm very positive about this draft :)

vincentarelbundock · 2023-05-02T18:44:13Z

Maybe you can add something ultra simple but fun like this, and then point readers to the DuckDB benchmarks for more serious stuff. The idea is would be to give readers an early "Wow!".

library(bench)
library(dplyr)
library(polars)
library(data.table)

N = 1e7
df = data.frame(matrix(runif(25 * N), nrow = N))
df$letters = sample(letters, N, replace = TRUE)

df_dt = data.table(df)
df_pl = pl$DataFrame(df)

# comparison
bench::mark(
    "base" = by(df, df$letters, \(x) colMeans(x[, -26])),
    "dplyr" = df %>% group_by(letters) %>% summarise_all(mean),
    "data.table" = df_dt[, lapply(.SD, mean), by = "letters"],
    "polars" = df_pl$groupby("letters")$mean(),
    check = FALSE,
    relative = TRUE
)
#   expression   min median `itr/sec` mem_alloc `gc/sec`
#   <bch:expr> <dbl>  <dbl>     <dbl>     <dbl>    <dbl>
# 1 base       16.1   15.6       1        8241.      Inf
# 2 dplyr       9.66   9.35      1.67     3826.      Inf
# 3 data.table  2.29   2.26      6.92      634.      NaN
# 4 polars      1      1        15.5         1       NaN

grantmcdermott · 2023-05-02T22:27:38Z

I didn't want to spam everyone's inbox—so let me know if other would like to join too—but I invited @vincentarelbundock and @etiennebacher to a private repo that houses (an adapted subset of) benchmarks that I keep for myself on some common data tasks across a variety of languages and libraries. Feel free to poke around etc. I also have timings for larger datasets, but this ends up being a bottleneck for some languages (cough Stata cough).

grantmcdermott · 2023-05-02T22:37:46Z

I'm still a beginner in polars and in data wrangling with larger-than-RAM data so there might be some things to correct/complete here.

AFAIK r-polars does not support streaming yet. See the py-polars handbook for some simple examples. tl;dr just end your query with collect(steaming=True).

etiennebacher · 2023-05-05T07:09:48Z

Thanks all, FYI if you're interested feel free to push changes directly to this PR

performance.Rmd

eitsupi · 2023-05-05T07:19:58Z

I am reluctant to make comparisons with dplyr or data.table here.
(Is there any reason to include dplyr and data.table but not Acero (arrow) and duckdb?)

vincentarelbundock · 2023-05-05T07:37:50Z

I am reluctant to make comparisons with dplyr or data.table here.

Well, one goal here is obviously to convince R users that it's worth it for them to try out polars. One way to do that is to show that it'll be faster than what they currently use, and 98% of R users currently rely on base, dplyr, or data.table. So in pure "marketing" terms, it seems pretty important to have this there. And if we clearly note that these are not extensive rigorous benchmarks, and point to the DuckDB page, then it is "honest" marketing that we can feel good about.

Is there any reason to include dplyr and data.table but not Acero (arrow) and duckdb?

No, I think those should be added too if it's easy. I would also be curious to know what people think the benefits of polars are over duckdb (assuming the performance is similar.)

eitsupi · 2023-05-05T08:10:02Z

For example, when I tried Acero and duckdb in my environment, I got the following results.
However, duckdb converts the results to R DataFrame, which may not be fair compared to Acero and polars, which require an additional cost when converting to DataFrame.

library(bench)
library(data.table)
library(dplyr, warn.conflicts = FALSE)
library(arrow, warn.conflicts = FALSE)
library(duckdb)
#> Loading required package: DBI
#>
#> Attaching package: 'duckdb'
#> The following object is masked from 'package:dplyr':
#>
#>     sql
library(polars)

N = 1e7
set.seed(1)
df = data.frame(matrix(runif(25 * N), nrow = N))
df$letters = sample(letters, N, replace = TRUE)

df_dt = data.table(df)
at = as_arrow_table(df)
df_pl = pl$DataFrame(df)

con = DBI::dbConnect(duckdb::duckdb(), ":memory:")
duckdb_register(con, "df", df)

# comparison
bench::mark(
    "data.table" = df_dt[, lapply(.SD, mean), by = "letters"],
    "Acero" = at |> group_by(letters) |> summarise(across(!letters, ~ mean(.x, na.rm = TRUE))) |> compute(),
    "duckdb" = duckdb::sql("FROM df SELECT letters, avg(COLUMNS(x -> NOT suffix(x, 'letters'))) GROUP BY letters", con),
    "polars" = df_pl$groupby("letters")$mean(),
    check = FALSE,
    relative = TRUE
)
#> # A tibble: 4 × 6
#>   expression   min median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <dbl>  <dbl>     <dbl>     <dbl>    <dbl>
#> 1 data.table  6.68   6.29      1      35566.       NaN
#> 2 Acero       1.39   1.31      4.81     821.       Inf
#> 3 duckdb      1.71   1.61      3.89       1        NaN
#> 4 polars      1      1         6.29      56.2      NaN

sessioninfo::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#>  setting  value
#>  version  R version 4.3.0 (2023-04-21)
#>  os       Ubuntu 22.04.2 LTS
#>  system   x86_64, linux-gnu
#>  ui       X11
#>  language (EN)
#>  collate  en_US.UTF-8
#>  ctype    en_US.UTF-8
#>  tz       Etc/UTC
#>  date     2023-05-05
#>  pandoc   3.1.2 @ /usr/bin/ (via rmarkdown)
#>
#> ─ Packages ───────────────────────────────────────────────────────────────────
#>  package     * version  date (UTC) lib source
#>  arrow       * 11.0.0.3 2023-03-08 [1] RSPM
#>  assertthat    0.2.1    2019-03-21 [1] RSPM
#>  bench       * 1.1.2    2021-11-30 [1] RSPM
#>  bit           4.0.5    2022-11-15 [1] RSPM (R 4.3.0)
#>  bit64         4.0.5    2020-08-30 [1] RSPM (R 4.3.0)
#>  cli           3.6.1    2023-03-23 [1] RSPM
#>  data.table  * 1.14.8   2023-02-17 [1] RSPM
#>  DBI         * 1.1.3    2022-06-18 [1] RSPM (R 4.3.0)
#>  digest        0.6.31   2022-12-11 [1] RSPM
#>  dplyr       * 1.1.2    2023-04-20 [1] RSPM (R 4.3.0)
#>  duckdb      * 0.8.0    2023-05-05 [1] https://duckdb.r-universe.dev (R 4.3.0)
#>  evaluate      0.20     2023-01-17 [1] RSPM
#>  fansi         1.0.4    2023-01-22 [1] RSPM
#>  fastmap       1.1.1    2023-02-24 [1] RSPM
#>  fs            1.6.2    2023-04-25 [1] RSPM (R 4.3.0)
#>  generics      0.1.3    2022-07-05 [1] RSPM (R 4.3.0)
#>  glue          1.6.2    2022-02-24 [1] RSPM
#>  htmltools     0.5.5    2023-03-23 [1] RSPM
#>  knitr         1.42     2023-01-25 [1] RSPM
#>  lifecycle     1.0.3    2022-10-07 [1] RSPM
#>  magrittr      2.0.3    2022-03-30 [1] RSPM
#>  pillar        1.9.0    2023-03-22 [1] RSPM
#>  pkgconfig     2.0.3    2019-09-22 [1] RSPM
#>  polars      * 0.6.0    2023-05-04 [1] local
#>  profmem       0.6.0    2020-12-13 [1] RSPM
#>  purrr         1.0.1    2023-01-10 [1] RSPM
#>  R.cache       0.16.0   2022-07-21 [1] RSPM
#>  R.methodsS3   1.8.2    2022-06-13 [1] RSPM
#>  R.oo          1.25.0   2022-06-12 [1] RSPM
#>  R.utils       2.12.2   2022-11-11 [1] RSPM
#>  R6            2.5.1    2021-08-19 [1] RSPM
#>  reprex        2.0.2    2022-08-17 [1] RSPM
#>  rlang         1.1.0    2023-03-14 [1] RSPM
#>  rmarkdown     2.21     2023-03-26 [1] RSPM
#>  sessioninfo   1.2.2    2021-12-06 [1] RSPM
#>  styler        1.9.1    2023-03-04 [1] RSPM
#>  tibble        3.2.1    2023-03-20 [1] RSPM
#>  tidyselect    1.2.0    2022-10-10 [1] RSPM (R 4.3.0)
#>  utf8          1.2.3    2023-01-31 [1] RSPM
#>  vctrs         0.6.2    2023-04-19 [1] RSPM
#>  withr         2.5.0    2022-03-03 [1] RSPM
#>  xfun          0.39     2023-04-20 [1] RSPM
#>  yaml          2.3.7    2023-01-23 [1] RSPM
#>
#>  [1] /usr/local/lib/R/site-library
#>  [2] /usr/local/lib/R/library
#>
#> ──────────────────────────────────────────────────────────────────────────────

^{Created on 2023-05-05 with reprex v2.0.2}

eitsupi · 2023-05-05T08:28:28Z

Here are what I consider to be the rough pros and cons:

Acero

Pros

In many cases, high performance can be achieved using the dplyr syntax as is.

Cons

Winodw functions are not supported.

DuckDB

Pros

Queries can be written in SQL and are highly portable.
dbplyr allows the dplyr syntax to be used almost verbatim.

Cons

It adds SQL-extended syntax such as COLUMNS and EXCLUDE, but does not support column selection or column renaming as flexibly as dplyr (or polars).

Polars

Pros

It is considered one of the fastest DetaFrame libraries.

Cons

It is necessary to learn its own syntax.
There are many breaking changes.

eitsupi · 2023-05-05T08:41:36Z

I really don't think it is a good idea to make a bare dplyr comparison here, because in my opinion dplyr users can achieve higher speeds by simply switching to data.table (via dtplyr) or Acero (via arrow) or duckdb (via dbplyr) as a backend.
(Note that these different backends are described in dplyr's README)

vincentarelbundock · 2023-05-05T08:49:19Z

Why don't we just add a dtplyr example, then?

One other benefit of polars, I think, is the parallelism in syntax across R, python, and Rust, which facilitates multilingual projects and teams

tdhock · 2023-05-05T21:33:04Z

I would recommend using asymptotic benchmarks, which means measuring time and memory for data size N values increasing on a log scale. I have a package https://github.com/tdhock/atime that makes this easy. These benchmarks are much more convincing than single N, which can be misleading (which N is relevant to test may depend on the particular problem/hardware so much more informative/convincing to see result for several N).

etiennebacher · 2023-05-06T13:13:25Z

I don't think we should include benchmarks with data.table/dplyr/arrow, etc., or at least not in this vignette (so we could change its name).

To me, the objective of this vignette is not to compare polars to other packages or tools because our benchmarks will never be as comprehensive as those run by duckdb. Also, if we start doing this, then we must make a lot of choices (including those discussed above): should we count data reading in the timing? should we use keyed data.tables? should we compare to arrow, duckdb? how many observations should we keep? etc.

I think it's more important here to focus on how one can use polars' full capabilities, because it's not something that the average R user knows (e.g I guess few R users know the difference between eager and lazy execution). If we want to advertise the speed of polars, couldn't we just take a graph from duckdb's benchmarks?

@tdhock thanks for the link, I think bench::press() does something similar?

tdhock · 2023-05-08T16:25:17Z

yes, bench::press does something similar, and that is discussed in the Related work section of the atime README, https://github.com/tdhock/atime#related-work
bench::press does something similar, and is more flexible because it can do multi-dimensional grid search (not only over a single size N argument as atime does). However it can not store results if check=FALSE, results must be equal if check=TRUE, and there is no way to easily specify a time limit which stops for larger sizes (like seconds.limit argument in atime).

grantmcdermott · 2023-05-10T14:02:35Z

One other (tbc?) Pro for Polars is that multithreading automatically works on MacOS.

I might be missing something about the R-universe build process, but enabling multithreading in other high performance R libraries can be a bit of a pain. That's because these are C/C++ based and the OpenMP toolchain has to be installed and then linked to manually. (Basically, you have to specify a bunch of, e.g., C++ flags in your Makevars and then build from source instead of installing binaries.)

eitsupi · 2023-05-10T14:10:55Z

I might be missing something about the R-universe build process

Do you mention that SIMD is disabled in the R-universe builds?
#78 (comment)

sorhawell · 2023-05-10T14:14:04Z

Do you mention that SIMD is disabled in the R-universe builds?

We do not :/ but should. I know of no benchmarks yet describing the performance difference.

etiennebacher · 2023-08-18T06:46:09Z

Can someone review the content of this vignette? I included the .md file so that it's easier to review from Github but we'll need to remove it before merging since it's not expected by R CMD check.

eitsupi

Thanks for working on this!
Sorry for the late review.

vignettes/performance.Rmd

etiennebacher · 2023-09-09T14:09:10Z

Thanks for the review @eitsupi

start performance vignette

ddc536a

vincentarelbundock reviewed May 1, 2023

View reviewed changes

performance.md Outdated Show resolved Hide resolved

sorhawell reviewed May 2, 2023

View reviewed changes

performance.Rmd Outdated Show resolved Hide resolved

sorhawell reviewed May 2, 2023

View reviewed changes

performance.Rmd Outdated Show resolved Hide resolved

sorhawell reviewed May 2, 2023

View reviewed changes

performance.Rmd Outdated Show resolved Hide resolved

etiennebacher added 2 commits May 5, 2023 08:44

remove readr, address some comments [skip ci]

1b5c052

add basic benchmark [skip ci]

ce349ee

vincentarelbundock reviewed May 5, 2023

View reviewed changes

performance.Rmd Outdated Show resolved Hide resolved

etiennebacher added 2 commits May 11, 2023 08:46

Merge remote-tracking branch 'origin/main' into vignette-performance

5883ee4

remove first benchmark, comment out streaming section [skip ci]

363c18f

etiennebacher added 2 commits August 17, 2023 13:07

Merge branch 'main' into vignette-performance

b43bcd9

clean, remove CSV export and read, add the streaming part

dee570d

etiennebacher marked this pull request as ready for review August 17, 2023 11:36

convert to standard vignette format

bba71d8

etiennebacher requested review from eitsupi, vincentarelbundock and sorhawell August 18, 2023 06:46

eitsupi approved these changes Sep 8, 2023

View reviewed changes

vignettes/performance.Rmd Outdated Show resolved Hide resolved

etiennebacher added 4 commits September 9, 2023 13:36

hardcode benchmarks so that they don't re-run every time

311dc25

Merge branch 'main' into vignette-performance

979572c

remove performance.md

d7f1320

bump news

eb3d37a

etiennebacher merged commit ead2b30 into main Sep 9, 2023
1 check passed

etiennebacher deleted the vignette-performance branch September 9, 2023 14:09

etiennebacher mentioned this pull request Sep 9, 2023

Add vignette "performance" in list of articles #381

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Include a performance vignette? #188

Include a performance vignette? #188

etiennebacher commented May 1, 2023 •

edited

Loading

eitsupi commented May 1, 2023

vincentarelbundock commented May 2, 2023

sorhawell commented May 2, 2023

sorhawell commented May 2, 2023

vincentarelbundock commented May 2, 2023

grantmcdermott commented May 2, 2023

grantmcdermott commented May 2, 2023

etiennebacher commented May 5, 2023

eitsupi commented May 5, 2023

vincentarelbundock commented May 5, 2023 •

edited

Loading

eitsupi commented May 5, 2023

eitsupi commented May 5, 2023

eitsupi commented May 5, 2023

vincentarelbundock commented May 5, 2023

tdhock commented May 5, 2023

etiennebacher commented May 6, 2023 •

edited

Loading

tdhock commented May 8, 2023 •

edited

Loading

grantmcdermott commented May 10, 2023

eitsupi commented May 10, 2023

sorhawell commented May 10, 2023 •

edited

Loading

etiennebacher commented Aug 18, 2023

eitsupi left a comment

etiennebacher commented Sep 9, 2023

Include a performance vignette? #188

Include a performance vignette? #188

Conversation

etiennebacher commented May 1, 2023 • edited Loading

eitsupi commented May 1, 2023

vincentarelbundock commented May 2, 2023

sorhawell commented May 2, 2023

sorhawell commented May 2, 2023

vincentarelbundock commented May 2, 2023

grantmcdermott commented May 2, 2023

grantmcdermott commented May 2, 2023

etiennebacher commented May 5, 2023

eitsupi commented May 5, 2023

vincentarelbundock commented May 5, 2023 • edited Loading

eitsupi commented May 5, 2023

eitsupi commented May 5, 2023

Acero

Pros

Cons

DuckDB

Pros

Cons

Polars

Pros

Cons

eitsupi commented May 5, 2023

vincentarelbundock commented May 5, 2023

tdhock commented May 5, 2023

etiennebacher commented May 6, 2023 • edited Loading

tdhock commented May 8, 2023 • edited Loading

grantmcdermott commented May 10, 2023

eitsupi commented May 10, 2023

sorhawell commented May 10, 2023 • edited Loading

etiennebacher commented Aug 18, 2023

eitsupi left a comment

Choose a reason for hiding this comment

etiennebacher commented Sep 9, 2023

etiennebacher commented May 1, 2023 •

edited

Loading

vincentarelbundock commented May 5, 2023 •

edited

Loading

etiennebacher commented May 6, 2023 •

edited

Loading

tdhock commented May 8, 2023 •

edited

Loading

sorhawell commented May 10, 2023 •

edited

Loading