bigrquery integration #46

hadley · 2024-03-12T21:55:08Z

If bigquerystorage is available, can you think of any reason that bigrquery shouldn't use it by default? (with some argument/option to opt-out, if needed).

meztez · 2024-03-13T01:55:21Z

If bigquerystorage is available, can you think of any reason that bigrquery shouldn't use it by default? (with some argument/option to opt-out, if needed).

I've taken notice of your question.

I will do some further testing to validate my hypothesis and come back with a comprehensive response in a timely manner.

Thank you (Posit) for the coffee mugs. I will use them to top up during the experiments.

meztez · 2024-03-13T03:50:00Z

Here is what I confirmed. I see two possible issues.

startIndex does not have an equivalent on the BigQuery Storage API side. It will currently download all the data, then cut them off before returning results. This can be an issue for extremely large tables with big startIndex and small n_max. I thought offset could be used, but it does not seem like it.
Overhead on channel creation is a little larger than the regular API. This might be an issue if your workload is mostly very small tables.

https://cloud.google.com/bigquery/docs/reference/storage/rpc/google.cloud.bigquery.storage.v1#readrowsrequest
tensorflow/io#876

bigrquerystorage/R/bqs_download.R

Line 267 in 3f8f956

if (start_index > 0L) {

hadley · 2024-03-13T12:12:49Z

I'm not too concerned about your first point; we can just make sure that start_index gives a clear error if you use it with the arrow endpoint.

The second point is interesting, but I think will make itself clear when in bigrquery itself because all the tests use tiny datasets (I always think it's hilarious that I'm using bigquery to query mtcars 😆).

hadley · 2024-04-02T18:30:32Z

When I try and download a public dataset, I see:

Error:
! gRPC method CreateReadSession error -> Permission denied on resource project publicdata.

Is that expected? Is that a case where I need to pass in parent, which I guess corresponds to the bigrquery idea of a billing project?

hadley · 2024-04-02T18:42:46Z

Do you expect that this would take 30+ seconds?

system.time(x <- bigrquerystorage::bqs_table_download(
  x = "publicdata.samples.natality",
  parent = bigrquery::bq_test_project(),
  n_max = 100000,
))

The initial download progress bar seems to flash by very quickly and then nothing happens for a while.

I'm also surprised that if I set n_max = 10000 I see Streamed 142080 rows in 148 messages.. But if I set n_max = 100000 I see Streamed 232320 rows in 242 messages.

Also if I omit n_max I see an progress bar with an ETA of 5 minutes, and then press Ctrl + C to cancel, it takes at least 5 seconds before it actually cancels. Maybe there's a loop that needs to check for interrupts a bit more frequently?

meztez · 2024-04-02T18:49:15Z

When I try and download a public dataset, I see:
Error:
! gRPC method CreateReadSession error -> Permission denied on resource project publicdata.
Is that expected? Is that a case where I need to pass in parent, which I guess corresponds to the bigrquery idea of a billing project?

It needs a billable project with access to publicdata as mentionned.

bigrquerystorage/README.Rmd

Line 159 in 3f8f956

# TODO: (developer): Set the project_id variable to your billing project.

meztez · 2024-04-02T18:53:28Z

Do you expect that this would take 30+ seconds?
system.time(x <- bigrquerystorage::bqs_table_download(
  x = "publicdata.samples.natality",
  parent = bigrquery::bq_test_project(),
  n_max = 100000,
))
The initial download progress bar seems to flash by very quickly and then nothing happens for a while.

I'm also surprised that if I set n_max = 10000 I see Streamed 142080 rows in 148 messages.. But if I set n_max = 100000 I see Streamed 232320 rows in 242 messages.

Also if I omit n_max I see an progress bar with an ETA of 5 minutes, and then press Ctrl + C to cancel, it takes at least 5 seconds before it actually cancels. Maybe there's a loop that needs to check for interrupts a bit more frequently?

I only see an interrupt in the loop when in quiet mode, I vaguely remember experimenting with interrupt/performance tradeoff :

bigrquerystorage/src/bqs.cpp

Line 177 in 3f8f956

Rcpp::checkUserInterrupt();

I will test performance on a similar dataset. I wonder if it's memory/swap related.

hadley · 2024-04-02T20:33:38Z

One more data pont: if I download publicdata.samples.shakespeare, it only takes 1s.

meztez · 2024-04-02T20:34:33Z

Ok, there is something going on with the n_max / truncation method. This is not something we use alot internally so it was not detected. Fixing it.

meztez · 2024-04-02T21:31:05Z

Really sorry you had to experience these issues. I believe I identified the issue and corrected the relevant code to allow for interrupt checks more frequently. Every 100 messages should have a marginal performance impact.

I've reworked the n_max / truncation to use the recommanded TryCancel ClientContext method to cancel streaming cleanly with n_max is hit.

Should take between 1-2 seconds now.

hadley · 2024-04-02T22:21:30Z

No problems and thanks so much for fixing this so quickly!

meztez · 2024-09-21T16:29:45Z

One more thing that needs to be addressed is how circular imports can be managed. bigrquerystorage use the following functions from bigrquery.

bq_table_fields
bq_has_token

.auth$cred environment valuie

meztez · 2024-09-21T17:22:28Z

One more thing that needs to be addressed is how circular imports can be managed. bigrquerystorage use the following functions from bigrquery.

bq_table_fields

bq_has_token

.auth$cred environment valuie

Should not be an issue now. Let me know if you need anything else. Have a good weekend.

meztez · 2024-09-28T16:53:13Z

It seems like @gaborcsardi fixed support back to R 4.0. Looking forward to this being integrated into bigrquery.

hadley · 2024-09-30T12:26:45Z

@meztez do you think you could do a CRAN release in the near future? Then I'll advertise dev bigrquery to get some people actually using it and then do a CRAN release of my own.

meztez · 2024-09-30T21:58:11Z

Submitted as 1.2.0, will release on github when CRAN publish.

hadley · 2024-09-30T22:03:34Z

Awesome, thank you!

meztez · 2024-10-01T16:34:51Z

@hadley released bigrquerystorage 1.2.0 : https://cran.r-project.org/web/packages/bigrquerystorage/index.html

hadley mentioned this issue Apr 2, 2024

WIP: Use bigrquerystorage for downloads r-dbi/bigrquery#604

Merged

6 tasks

hadley closed this as completed in r-dbi/bigrquery#604 Sep 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bigrquery integration #46

bigrquery integration #46

hadley commented Mar 12, 2024

meztez commented Mar 13, 2024

meztez commented Mar 13, 2024

hadley commented Mar 13, 2024

hadley commented Apr 2, 2024

hadley commented Apr 2, 2024 •

edited

Loading

meztez commented Apr 2, 2024

meztez commented Apr 2, 2024

hadley commented Apr 2, 2024

meztez commented Apr 2, 2024

meztez commented Apr 2, 2024

hadley commented Apr 2, 2024

meztez commented Sep 21, 2024

meztez commented Sep 21, 2024

meztez commented Sep 28, 2024

hadley commented Sep 30, 2024

meztez commented Sep 30, 2024

hadley commented Sep 30, 2024

meztez commented Oct 1, 2024

bigrquery integration #46

bigrquery integration #46

Comments

hadley commented Mar 12, 2024

meztez commented Mar 13, 2024

meztez commented Mar 13, 2024

hadley commented Mar 13, 2024

hadley commented Apr 2, 2024

hadley commented Apr 2, 2024 • edited Loading

meztez commented Apr 2, 2024

meztez commented Apr 2, 2024

hadley commented Apr 2, 2024

meztez commented Apr 2, 2024

meztez commented Apr 2, 2024

hadley commented Apr 2, 2024

meztez commented Sep 21, 2024

meztez commented Sep 21, 2024

meztez commented Sep 28, 2024

hadley commented Sep 30, 2024

meztez commented Sep 30, 2024

hadley commented Sep 30, 2024

meztez commented Oct 1, 2024

hadley commented Apr 2, 2024 •

edited

Loading