Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve performance of generating distinct interactions #28

Merged
merged 1 commit into from
Jul 7, 2022
Merged

Improve performance of generating distinct interactions #28

merged 1 commit into from
Jul 7, 2022

Conversation

halhen
Copy link

@halhen halhen commented Jul 7, 2022

When generating distinct intersections on data with hundreds of
thousands of elements, it grinds to a halt. The time seems to be
roughly O(n^2), meaning that with double the data execition takes
2^2=4x times as long. With the help of profviz, we find the main
source to be a Filter in pushCombination(), which causes a twice
nested loop over the elements.

Minimal benchmark on a fairly beefy computer (5950X, 128 GB RAM)
on Fedora Linux, R 4.1.3 and upsetjs 1.11.0, git hash 4b375a8

generate_data <- function(n) {
  tibble::tibble(
    col_0 = sample(c(0, 1), n, replace = TRUE),
    col_1 = sample(c(0, 1), n, replace = TRUE),
    col_2 = sample(c(0, 1), n, replace = TRUE),
    col_3 = sample(c(0, 1), n, replace = TRUE),
    col_4 = sample(c(0, 1), n, replace = TRUE),
    col_5 = sample(c(0, 1), n, replace = TRUE),
    col_6 = sample(c(0, 1), n, replace = TRUE),
    col_7 = sample(c(0, 1), n, replace = TRUE),
    col_8 = sample(c(0, 1), n, replace = TRUE),
    col_9 = sample(c(0, 1), n, replace = TRUE)
  )
}

Before this PR:

> start <- Sys.time()
> upsetjs() |>
+     upsetjs:::fromDataFrame(generate_data(10000)) |>
+     upsetjs:::generateDistinctIntersections(limit = 5)
> Sys.time() - start
Time difference of 24.85004 secs

With this PR:

> start <- Sys.time()
> upsetjs() |>
+     upsetjs:::fromDataFrame(generate_data(10000)) |>
+     upsetjs:::generateDistinctIntersections(limit = 5)
> Sys.time() - start
Time difference of 0.7690187 secs

Also, scaling is now closer to O(n) or slightly better.
With 10x the data:

> start <- Sys.time()
> upsetjs() |>
+     upsetjs:::fromDataFrame(generate_data(100000)) |>
+     upsetjs:::generateDistinctIntersections(limit = 5)
> Sys.time() - start
Time difference of 5.745839 secs

When generating distinct intersections on data with hundreds of
thousands of elements, it grinds to a halt. The time seems to be
roughly O(n^2), meaning that with double the data execition takes
2^2=4x times as long. With the help of profviz, we find the main
source to be a Filter in pushCombination(), which causes a twice
nested loop over the elements.

Minimal benchmark on a fairly beefy computer (5950X, 128 GB RAM)
on Fedora Linux, R 4.1.3 and upsetjs 1.11.0, git hash 4b375a8

```
generate_data <- function(n) {
  tibble::tibble(
    col_0 = sample(c(0, 1), n, replace = TRUE),
    col_1 = sample(c(0, 1), n, replace = TRUE),
    col_2 = sample(c(0, 1), n, replace = TRUE),
    col_3 = sample(c(0, 1), n, replace = TRUE),
    col_4 = sample(c(0, 1), n, replace = TRUE),
    col_5 = sample(c(0, 1), n, replace = TRUE),
    col_6 = sample(c(0, 1), n, replace = TRUE),
    col_7 = sample(c(0, 1), n, replace = TRUE),
    col_8 = sample(c(0, 1), n, replace = TRUE),
    col_9 = sample(c(0, 1), n, replace = TRUE)
  )
}
```

Before this PR:

```
> start <- Sys.time()
> upsetjs() |>
+     upsetjs:::fromDataFrame(generate_data(10000)) |>
+     upsetjs:::generateDistinctIntersections(limit = 5)
> Sys.time() - start
Time difference of 24.85004 secs
```

With this PR:

```
> start <- Sys.time()
> upsetjs() |>
+     upsetjs:::fromDataFrame(generate_data(10000)) |>
+     upsetjs:::generateDistinctIntersections(limit = 5)
> Sys.time() - start
Time difference of 0.7690187 secs
```

Also, scaling is now closer to O(n) or slightly better.
With 10x the data:

```
> start <- Sys.time()
> upsetjs() |>
+     upsetjs:::fromDataFrame(generate_data(100000)) |>
+     upsetjs:::generateDistinctIntersections(limit = 5)
> Sys.time() - start
Time difference of 5.745839 secs
```
@sgratzl sgratzl self-requested a review July 7, 2022 12:26
@sgratzl sgratzl self-assigned this Jul 7, 2022
@sgratzl sgratzl added the enhancement New feature or request label Jul 7, 2022
@sgratzl
Copy link
Member

sgratzl commented Jul 7, 2022

thank you.

btw.

  1. fromDataFrame already extract some combinations from the sets. However, by default they are not the distinct ones.
  2. you can customize it using the c_type parameters, e.g
upsetjs() |>
    upsetjs:::fromDataFrame(generate_data(10000), c_type = "distinctIntersections")

should be enough and way faster since it also uses an optimized version of this combination (data frame + distinct)

@sgratzl sgratzl merged commit a515fbe into upsetjs:main Jul 7, 2022
@sgratzl
Copy link
Member

sgratzl commented Jul 7, 2022

see also #14 (comment)

@halhen
Copy link
Author

halhen commented Jul 7, 2022

Thanks for the fromDataFrame(c_type) tip -- that solved my immediate performance needs! 🌷

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants