Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: remove pre-caching of remote function results #1028

Merged
merged 1 commit into from
Oct 1, 2024

Conversation

shobsi
Copy link
Contributor

@shobsi shobsi commented Sep 27, 2024

This will save redundant remote function execution as reported in b/370088754. The trade-off is that any remote function integration issues will be caughts only at a later point through a usage triggered sql execution. Why the caching is not working as indended in the reported use case in the bug? as per TrevorBergeron@ "the cached execution is useless when it is implicitly joined back to the base dataframe"

Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly:

  • Make sure to open an issue as a bug/issue before writing your code! That way we can discuss the change, evaluate designs, and agree on the general idea
  • Ensure the tests and linter pass
  • Code coverage does not decrease (if any source code was changed)
  • Appropriate docs were updated (if necessary)

Fixes internal issue 370088754 🦕

This will save redundant remote function execution as reported in
b/370088754. The trade-off is that any remote function integration
issues will be caughts only at a later point through a usage triggered
sql execution. Why the caching is not working as indended in the
reported use case in the bug? as per TrevorBergeron@ "the cached execution is useless when it is implicitly joined back to the base dataframe"
@shobsi shobsi requested review from a team as code owners September 27, 2024 23:04
@shobsi shobsi requested a review from GarrettWu September 27, 2024 23:04
@product-auto-label product-auto-label bot added size: s Pull request size is small. api: bigquery Issues related to the googleapis/python-bigquery-dataframes API. labels Sep 27, 2024
GarrettWu
GarrettWu previously approved these changes Sep 27, 2024
@GarrettWu
Copy link
Contributor

So after the change, when does the execution happens?

df["col2"] = df["col1"].apply(remote_function)

or

df.to_gbq()

if at the later stage, if the user calls exec APIs (such as repr) multiple times, will the function be executed multiple times?

@GarrettWu GarrettWu self-requested a review September 27, 2024 23:41
@GarrettWu
Copy link
Contributor

Is there a way that we can add test to ensure that with multiple operations, there won't be redundant function calls?

@GarrettWu GarrettWu dismissed their stale review September 27, 2024 23:43

questions

@shobsi
Copy link
Contributor Author

shobsi commented Sep 30, 2024

So after the change, when does the execution happens?

df["col2"] = df["col1"].apply(remote_function)

or

df.to_gbq()

if at the later stage, if the user calls exec APIs (such as repr) multiple times, will the function be executed multiple times?

It would be the latter. After this change the behavior would be same as the standard bigframes behavior - if the result is cached it won't be executed, else it would be.

@shobsi
Copy link
Contributor Author

shobsi commented Oct 1, 2024

Is there a way that we can add test to ensure that with multiple operations, there won't be redundant function calls?

It's tied to caching. Chatted with @TrevorBergeron that caching is not a guarantee at this point but more like an internal optimization. Asserting how many times a function was called is a bit of a stretch, just like asserting how many number of sql jobs were run in an operation, so I'd skip adding a test. Let me know if you have s strong opinion otherwise.

@GarrettWu GarrettWu merged commit 0359bc8 into main Oct 1, 2024
23 checks passed
@GarrettWu GarrettWu deleted the shobs-setup-rf-no-precache branch October 1, 2024 01:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: bigquery Issues related to the googleapis/python-bigquery-dataframes API. size: s Pull request size is small.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants