Improve performance of generating distinct interactions #28

halhen · 2022-07-07T11:56:40Z

When generating distinct intersections on data with hundreds of
thousands of elements, it grinds to a halt. The time seems to be
roughly O(n^2), meaning that with double the data execition takes
2^2=4x times as long. With the help of profviz, we find the main
source to be a Filter in pushCombination(), which causes a twice
nested loop over the elements.

Minimal benchmark on a fairly beefy computer (5950X, 128 GB RAM)
on Fedora Linux, R 4.1.3 and upsetjs 1.11.0, git hash 4b375a8

generate_data <- function(n) {
  tibble::tibble(
    col_0 = sample(c(0, 1), n, replace = TRUE),
    col_1 = sample(c(0, 1), n, replace = TRUE),
    col_2 = sample(c(0, 1), n, replace = TRUE),
    col_3 = sample(c(0, 1), n, replace = TRUE),
    col_4 = sample(c(0, 1), n, replace = TRUE),
    col_5 = sample(c(0, 1), n, replace = TRUE),
    col_6 = sample(c(0, 1), n, replace = TRUE),
    col_7 = sample(c(0, 1), n, replace = TRUE),
    col_8 = sample(c(0, 1), n, replace = TRUE),
    col_9 = sample(c(0, 1), n, replace = TRUE)
  )
}

Before this PR:

> start <- Sys.time()
> upsetjs() |>
+     upsetjs:::fromDataFrame(generate_data(10000)) |>
+     upsetjs:::generateDistinctIntersections(limit = 5)
> Sys.time() - start
Time difference of 24.85004 secs

With this PR:

> start <- Sys.time()
> upsetjs() |>
+     upsetjs:::fromDataFrame(generate_data(10000)) |>
+     upsetjs:::generateDistinctIntersections(limit = 5)
> Sys.time() - start
Time difference of 0.7690187 secs

Also, scaling is now closer to O(n) or slightly better.
With 10x the data:

> start <- Sys.time()
> upsetjs() |>
+     upsetjs:::fromDataFrame(generate_data(100000)) |>
+     upsetjs:::generateDistinctIntersections(limit = 5)
> Sys.time() - start
Time difference of 5.745839 secs

When generating distinct intersections on data with hundreds of thousands of elements, it grinds to a halt. The time seems to be roughly O(n^2), meaning that with double the data execition takes 2^2=4x times as long. With the help of profviz, we find the main source to be a Filter in pushCombination(), which causes a twice nested loop over the elements. Minimal benchmark on a fairly beefy computer (5950X, 128 GB RAM) on Fedora Linux, R 4.1.3 and upsetjs 1.11.0, git hash 4b375a8 ``` generate_data <- function(n) { tibble::tibble( col_0 = sample(c(0, 1), n, replace = TRUE), col_1 = sample(c(0, 1), n, replace = TRUE), col_2 = sample(c(0, 1), n, replace = TRUE), col_3 = sample(c(0, 1), n, replace = TRUE), col_4 = sample(c(0, 1), n, replace = TRUE), col_5 = sample(c(0, 1), n, replace = TRUE), col_6 = sample(c(0, 1), n, replace = TRUE), col_7 = sample(c(0, 1), n, replace = TRUE), col_8 = sample(c(0, 1), n, replace = TRUE), col_9 = sample(c(0, 1), n, replace = TRUE) ) } ``` Before this PR: ``` > start <- Sys.time() > upsetjs() |> + upsetjs:::fromDataFrame(generate_data(10000)) |> + upsetjs:::generateDistinctIntersections(limit = 5) > Sys.time() - start Time difference of 24.85004 secs ``` With this PR: ``` > start <- Sys.time() > upsetjs() |> + upsetjs:::fromDataFrame(generate_data(10000)) |> + upsetjs:::generateDistinctIntersections(limit = 5) > Sys.time() - start Time difference of 0.7690187 secs ``` Also, scaling is now closer to O(n) or slightly better. With 10x the data: ``` > start <- Sys.time() > upsetjs() |> + upsetjs:::fromDataFrame(generate_data(100000)) |> + upsetjs:::generateDistinctIntersections(limit = 5) > Sys.time() - start Time difference of 5.745839 secs ```

sgratzl · 2022-07-07T12:34:06Z

thank you.

btw.

fromDataFrame already extract some combinations from the sets. However, by default they are not the distinct ones.
you can customize it using the c_type parameters, e.g

upsetjs() |>
    upsetjs:::fromDataFrame(generate_data(10000), c_type = "distinctIntersections")

should be enough and way faster since it also uses an optimized version of this combination (data frame + distinct)

sgratzl · 2022-07-07T12:36:47Z

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve performance of generating distinct interactions #28

Improve performance of generating distinct interactions #28

halhen commented Jul 7, 2022

sgratzl commented Jul 7, 2022

sgratzl commented Jul 7, 2022 •

edited

Loading

halhen commented Jul 7, 2022

Improve performance of generating distinct interactions #28

Improve performance of generating distinct interactions #28

Conversation

halhen commented Jul 7, 2022

sgratzl commented Jul 7, 2022

sgratzl commented Jul 7, 2022 • edited Loading

halhen commented Jul 7, 2022

sgratzl commented Jul 7, 2022 •

edited

Loading