-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
✨ [source-google-sheets] add row_batch_size as an input parameter with higher increase #35404
Conversation
The latest updates on your projects. Learn more about Vercel for Git ↗︎
|
Before Merging a Connector Pull RequestWow! What a great pull request you have here! 🎉 To merge this PR, ensure the following has been done/considered for each connector added or updated:
If the checklist is complete, but the CI check is failing,
|
…-size-as-input-parameter-with-increases
@tautvydas-v, Regarding this comment in previous pr #35320 (comment) |
Hey @darynaishchenko, suitable batch size really depends case on case I suppose. In some cases, we could really use 10k records per batch and it would speed the extraction by a lot. Keep in mind that we're leaving default batch size as 200 and it's under Optional parameters, so even with a version update, if someone doesn't really need or care about batch size - it won't have any impact, as it's already currently 200. As a use case, we have a google sheet with about ~137000 records and here's a screenshot of how it goes for us so far: Meanwhile, we've tested out with different batch size inputs: Here's with 100k rows per batch: Do you see any potential problems with this setup though? Having 100k rows as a batch size really seems to be an edge case, but it's possible to use it too. Also regarding backoff strategy - I think that by default we could leave a hardcoded value actually, since percentages would have to be rounded to a full number, because I tested out with 10% and didn't like how it worked, because: |
@darynaishchenko @marcosmarxm Hi, what should be the next steps here? |
…-size-as-input-parameter-with-increases
@darynaishchenko can you review the contributio again please? |
@darynaishchenko @marcosmarxm Hey, any additional steps should be taken here? Just conscious that this doesn't get lost as we would really like to implement this. |
type: integer | ||
title: Row Batch Size | ||
description: >- | ||
An integer representing row batch size with each sent request to Google Sheets API. Defaults to 200. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tautvydas-v lgtm, could you please add more detailed description here and in .md file for Row Batch Size. Info about what impact this value have on rate limits, sycn etc, and how to define most suitable value for users.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added in more documentation
@@ -149,6 +149,7 @@ def _read( | |||
catalog: ConfiguredAirbyteCatalog, | |||
) -> Generator[AirbyteMessage, None, None]: | |||
client = GoogleSheetsClient(self.get_credentials(config)) | |||
client.Backoff.row_batch_size = config["batch_size"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
client.Backoff.row_batch_size = config["batch_size"] | |
client.Backoff.row_batch_size = config.get("batch_size", 200) |
for old configs without batch_size.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
…-size-as-input-parameter-with-increases
…utvydas-v/source-google-sheets-add-batch-size-as-input-parameter-with-increases
@darynaishchenko let me know if this is ready to ship 🚀 |
@darynaishchenko Hi, what's the status with this PR? Can this be reviewed again? |
@tautvydas-v there are some internal discussion about the implementation, sorry the delay. Hope to get an update next week! Thanks for the your patience. |
@marcosmarxm @darynaishchenko Hi, any news regarding this one? I've noticed that I have to update the connector version for the fourth time, but really not sure what's the status here and whether it's even worth it to do so in the future. Would be great to get some feedback, thanks. |
…utvydas-v/source-google-sheets-add-batch-size-as-input-parameter-with-increases
…-add-batch-size-as-input-parameter-with-increases
@tautvydas-v hi, thanks for the your patience. I'm working on internal discussions, once I receive answers I let you know. I'm really sorry for the delay. |
@tautvydas-v Hi, I'm ok with your solution. |
Hey @darynaishchenko, In my opinion, even though setting Having default batch_size as 200 seems to cover a lot of cases, is more clear when having lots of connections and makes checking logs a bit more clearer due to the fact that you can see the same value for each sync. If a user wants to just use a connection and will see different values for each sync (where the Google sheets size is different), I think this could create confusion. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
@tautvydas-v thanks for your work on it.
@marcosmarxm it's ready to ship
…-size-as-input-parameter-with-increases
Thanks @darynaishchenko. @tautvydas-v I'm going to proceed with the shipping process 🚢 |
…-size-as-input-parameter-with-increases
Hello @tautvydas-v your feedback matters a lot to us. Can you spare just 3 minutes to complete a survey? We're dedicated to making the contributor experience even better, and your input is key to achieving excellence. Thank you for helping us improve! |
What
Accidentally closed #35320 PR therefore raising this new one.
Resolves my raised issue: #35274
Previously, source_google_sheets source connector had hardcoded value of 200 for row_batch_size. This means that one request sent to Google Sheets API processes only 200 rows, even though there's pretty much no limit, as long as it's processed in under 3 minutes. Also, limitation is on requests sent, but not the rows processed. Sometimes, a problem rises when doing a sync with a google sheet with over 100k records, where exponential backoff fails and the sync silently fails. In order to overcome this, making "batch_size" input parameter, with a default value of 200, so there wouldn't be any kind of change when updating the connector.
How
Added "batch_size" to spec with a default value of 200 which is passed over source.py file, under _read method. Also, changed row_batch_size increase to 100 if the exception code is TOO_MANY_REQUESTS.