Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bigrquery integration #46

Closed
hadley opened this issue Mar 12, 2024 · 18 comments · Fixed by r-dbi/bigrquery#604
Closed

bigrquery integration #46

hadley opened this issue Mar 12, 2024 · 18 comments · Fixed by r-dbi/bigrquery#604

Comments

@hadley
Copy link
Contributor

hadley commented Mar 12, 2024

If bigquerystorage is available, can you think of any reason that bigrquery shouldn't use it by default? (with some argument/option to opt-out, if needed).

@meztez
Copy link
Owner

meztez commented Mar 13, 2024

If bigquerystorage is available, can you think of any reason that bigrquery shouldn't use it by default? (with some argument/option to opt-out, if needed).

I've taken notice of your question.

I will do some further testing to validate my hypothesis and come back with a comprehensive response in a timely manner.

Thank you (Posit) for the coffee mugs. I will use them to top up during the experiments.

@meztez
Copy link
Owner

meztez commented Mar 13, 2024

Here is what I confirmed. I see two possible issues.

  • startIndex does not have an equivalent on the BigQuery Storage API side. It will currently download all the data, then cut them off before returning results. This can be an issue for extremely large tables with big startIndex and small n_max. I thought offset could be used, but it does not seem like it.
  • Overhead on channel creation is a little larger than the regular API. This might be an issue if your workload is mostly very small tables.

https://cloud.google.com/bigquery/docs/reference/storage/rpc/google.cloud.bigquery.storage.v1#readrowsrequest
tensorflow/io#876

if (start_index > 0L) {

@hadley
Copy link
Contributor Author

hadley commented Mar 13, 2024

I'm not too concerned about your first point; we can just make sure that start_index gives a clear error if you use it with the arrow endpoint.

The second point is interesting, but I think will make itself clear when in bigrquery itself because all the tests use tiny datasets (I always think it's hilarious that I'm using bigquery to query mtcars 😆).

@hadley
Copy link
Contributor Author

hadley commented Apr 2, 2024

When I try and download a public dataset, I see:

Error:
! gRPC method CreateReadSession error -> Permission denied on resource project publicdata.

Is that expected? Is that a case where I need to pass in parent, which I guess corresponds to the bigrquery idea of a billing project?

@hadley
Copy link
Contributor Author

hadley commented Apr 2, 2024

Do you expect that this would take 30+ seconds?

system.time(x <- bigrquerystorage::bqs_table_download(
  x = "publicdata.samples.natality",
  parent = bigrquery::bq_test_project(),
  n_max = 100000,
))

The initial download progress bar seems to flash by very quickly and then nothing happens for a while.

I'm also surprised that if I set n_max = 10000 I see Streamed 142080 rows in 148 messages.. But if I set n_max = 100000 I see Streamed 232320 rows in 242 messages.

Also if I omit n_max I see an progress bar with an ETA of 5 minutes, and then press Ctrl + C to cancel, it takes at least 5 seconds before it actually cancels. Maybe there's a loop that needs to check for interrupts a bit more frequently?

@meztez
Copy link
Owner

meztez commented Apr 2, 2024

When I try and download a public dataset, I see:

Error:
! gRPC method CreateReadSession error -> Permission denied on resource project publicdata.

Is that expected? Is that a case where I need to pass in parent, which I guess corresponds to the bigrquery idea of a billing project?

It needs a billable project with access to publicdata as mentionned.

# TODO: (developer): Set the project_id variable to your billing project.

@meztez
Copy link
Owner

meztez commented Apr 2, 2024

Do you expect that this would take 30+ seconds?

system.time(x <- bigrquerystorage::bqs_table_download(
  x = "publicdata.samples.natality",
  parent = bigrquery::bq_test_project(),
  n_max = 100000,
))

The initial download progress bar seems to flash by very quickly and then nothing happens for a while.

I'm also surprised that if I set n_max = 10000 I see Streamed 142080 rows in 148 messages.. But if I set n_max = 100000 I see Streamed 232320 rows in 242 messages.

Also if I omit n_max I see an progress bar with an ETA of 5 minutes, and then press Ctrl + C to cancel, it takes at least 5 seconds before it actually cancels. Maybe there's a loop that needs to check for interrupts a bit more frequently?

I only see an interrupt in the loop when in quiet mode, I vaguely remember experimenting with interrupt/performance tradeoff :

Rcpp::checkUserInterrupt();

I will test performance on a similar dataset. I wonder if it's memory/swap related.

@hadley
Copy link
Contributor Author

hadley commented Apr 2, 2024

One more data pont: if I download publicdata.samples.shakespeare, it only takes 1s.

@meztez
Copy link
Owner

meztez commented Apr 2, 2024

Ok, there is something going on with the n_max / truncation method. This is not something we use alot internally so it was not detected. Fixing it.

@meztez
Copy link
Owner

meztez commented Apr 2, 2024

Really sorry you had to experience these issues. I believe I identified the issue and corrected the relevant code to allow for interrupt checks more frequently. Every 100 messages should have a marginal performance impact.

I've reworked the n_max / truncation to use the recommanded TryCancel ClientContext method to cancel streaming cleanly with n_max is hit.

Should take between 1-2 seconds now.

@hadley
Copy link
Contributor Author

hadley commented Apr 2, 2024

No problems and thanks so much for fixing this so quickly!

@meztez
Copy link
Owner

meztez commented Sep 21, 2024

One more thing that needs to be addressed is how circular imports can be managed. bigrquerystorage use the following functions from bigrquery.

  • bq_table_fields
  • bq_has_token
  • .auth$cred environment valuie

@meztez
Copy link
Owner

meztez commented Sep 21, 2024

One more thing that needs to be addressed is how circular imports can be managed. bigrquerystorage use the following functions from bigrquery.

  • bq_table_fields

  • bq_has_token

  • .auth$cred environment valuie

Should not be an issue now. Let me know if you need anything else. Have a good weekend.

@meztez
Copy link
Owner

meztez commented Sep 28, 2024

It seems like @gaborcsardi fixed support back to R 4.0. Looking forward to this being integrated into bigrquery.

@hadley
Copy link
Contributor Author

hadley commented Sep 30, 2024

@meztez do you think you could do a CRAN release in the near future? Then I'll advertise dev bigrquery to get some people actually using it and then do a CRAN release of my own.

@meztez
Copy link
Owner

meztez commented Sep 30, 2024

Submitted as 1.2.0, will release on github when CRAN publish.

@hadley
Copy link
Contributor Author

hadley commented Sep 30, 2024

Awesome, thank you!

@meztez
Copy link
Owner

meztez commented Oct 1, 2024

@hadley released bigrquerystorage 1.2.0 : https://cran.r-project.org/web/packages/bigrquerystorage/index.html

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants