✨ [source-google-sheets] add row_batch_size as an input parameter with higher increase #35404

tautvydas-v · 2024-02-19T07:12:31Z

What

Accidentally closed #35320 PR therefore raising this new one.
Resolves my raised issue: #35274
Previously, source_google_sheets source connector had hardcoded value of 200 for row_batch_size. This means that one request sent to Google Sheets API processes only 200 rows, even though there's pretty much no limit, as long as it's processed in under 3 minutes. Also, limitation is on requests sent, but not the rows processed. Sometimes, a problem rises when doing a sync with a google sheet with over 100k records, where exponential backoff fails and the sync silently fails. In order to overcome this, making "batch_size" input parameter, with a default value of 200, so there wouldn't be any kind of change when updating the connector.

How

Added "batch_size" to spec with a default value of 200 which is passed over source.py file, under _read method. Also, changed row_batch_size increase to 100 if the exception code is TOO_MANY_REQUESTS.

vercel · 2024-02-19T07:12:35Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name	Status	Preview	Comments	Updated (UTC)
airbyte-docs	✅ Ready (Inspect)	Visit Preview	💬 Add feedback	Apr 11, 2024 2:45pm

github-actions · 2024-02-19T07:12:49Z

Before Merging a Connector Pull Request

Wow! What a great pull request you have here! 🎉

To merge this PR, ensure the following has been done/considered for each connector added or updated:

PR name follows PR naming conventions
Breaking changes are considered. If a Breaking Change is being introduced, ensure an Airbyte engineer has created a Breaking Change Plan.
Connector version has been incremented in the Dockerfile and metadata.yaml according to our Semantic Versioning for Connectors guidelines
You've updated the connector's metadata.yaml file any other relevant changes, including a breakingChanges entry for major version bumps. See metadata.yaml docs
Secrets in the connector's spec are annotated with airbyte_secret
All documentation files are up to date. (README.md, bootstrap.md, docs.md, etc...)
Changelog updated in docs/integrations/<source or destination>/<name>.md with an entry for the new version. See changelog example
Migration guide updated in docs/integrations/<source or destination>/<name>-migrations.md with an entry for the new version, if the version is a breaking change. See migration guide example
If set, you've ensured the icon is present in the platform-internal repo. (Docs)

If the checklist is complete, but the CI check is failing,

Check for hidden checklists in your PR description
Toggle the github label checklist-action-run on/off to re-run the checklist CI.

…-size-as-input-parameter-with-increases

darynaishchenko · 2024-02-21T09:51:20Z

@tautvydas-v, Regarding this comment in previous pr #35320 (comment)
What is suitable input row batch size value for you? I'm ok with idea of percentage, but I'm afraid that not all users will understand the logic of this input param and they should take some time to understand what value they need. Did you test this changes with suitable row batch size value and backoff logic on your data?

tautvydas-v · 2024-02-21T11:07:26Z

Hey @darynaishchenko, suitable batch size really depends case on case I suppose. In some cases, we could really use 10k records per batch and it would speed the extraction by a lot. Keep in mind that we're leaving default batch size as 200 and it's under Optional parameters, so even with a version update, if someone doesn't really need or care about batch size - it won't have any impact, as it's already currently 200. As a use case, we have a google sheet with about ~137000 records and here's a screenshot of how it goes for us so far:

As you can see, in some cases we extract 30k-80k records even though we have 137k records. So in a way, it silently fails. Also, it takes up to 10 minutes for full extraction.

Meanwhile, we've tested out with different batch size inputs:
Here's with 10k rows per batch:

Here's with 100k rows per batch:

Do you see any potential problems with this setup though? Having 100k rows as a batch size really seems to be an edge case, but it's possible to use it too.

Also regarding backoff strategy - I think that by default we could leave a hardcoded value actually, since percentages would have to be rounded to a full number, because I tested out with 10% and didn't like how it worked, because:
Initial value is 200 -> after backoff increase by 10% 220 -> then batch size 242 -> batch size - 266.2, etc. So if the default is 200, then we could leave 10 or increase it a bit.

tautvydas-v · 2024-02-27T15:01:45Z

@darynaishchenko @marcosmarxm Hi, what should be the next steps here?

…-size-as-input-parameter-with-increases

marcosmarxm · 2024-03-08T20:05:36Z

@darynaishchenko can you review the contributio again please?

tautvydas-v · 2024-03-19T07:54:47Z

@darynaishchenko @marcosmarxm Hey, any additional steps should be taken here? Just conscious that this doesn't get lost as we would really like to implement this.

darynaishchenko · 2024-03-19T14:09:23Z

airbyte-integrations/connectors/source-google-sheets/source_google_sheets/spec.yaml

+      type: integer
+      title: Row Batch Size
+      description: >-
+        An integer representing row batch size with each sent request to Google Sheets API. Defaults to 200.


@tautvydas-v lgtm, could you please add more detailed description here and in .md file for Row Batch Size. Info about what impact this value have on rate limits, sycn etc, and how to define most suitable value for users.

Added in more documentation

darynaishchenko · 2024-03-19T14:14:10Z

airbyte-integrations/connectors/source-google-sheets/source_google_sheets/source.py

@@ -149,6 +149,7 @@ def _read(
        catalog: ConfiguredAirbyteCatalog,
    ) -> Generator[AirbyteMessage, None, None]:
        client = GoogleSheetsClient(self.get_credentials(config))
+        client.Backoff.row_batch_size = config["batch_size"]


Suggested change

client.Backoff.row_batch_size = config["batch_size"]

client.Backoff.row_batch_size = config.get("batch_size", 200)

for old configs without batch_size.

…-size-as-input-parameter-with-increases

…utvydas-v/source-google-sheets-add-batch-size-as-input-parameter-with-increases

marcosmarxm · 2024-03-21T12:42:22Z

@darynaishchenko let me know if this is ready to ship 🚀

tautvydas-v · 2024-03-28T08:21:18Z

@darynaishchenko Hi, what's the status with this PR? Can this be reviewed again?

marcosmarxm · 2024-03-28T20:49:46Z

@tautvydas-v there are some internal discussion about the implementation, sorry the delay. Hope to get an update next week! Thanks for the your patience.

tautvydas-v · 2024-04-08T06:44:52Z

@marcosmarxm @darynaishchenko Hi, any news regarding this one? I've noticed that I have to update the connector version for the fourth time, but really not sure what's the status here and whether it's even worth it to do so in the future. Would be great to get some feedback, thanks.

…utvydas-v/source-google-sheets-add-batch-size-as-input-parameter-with-increases

…-add-batch-size-as-input-parameter-with-increases

darynaishchenko · 2024-04-09T14:40:03Z

@tautvydas-v hi, thanks for the your patience. I'm working on internal discussions, once I receive answers I let you know. I'm really sorry for the delay.

darynaishchenko · 2024-04-10T13:01:26Z

@tautvydas-v Hi, I'm ok with your solution.
Just one last question, what do you think about calculating of proper batch size before actual sync by fetching a count of records in spread sheet? So for example before read we fetch that request spread sheet has 100k records and set row_batch size to 10 000 as you used. And we can use enum for row_batch size value for example sheets with < 10000 records: batch size is 200, > 10000 and < 100 000: batch size is 1000 etc.

tautvydas-v · 2024-04-10T14:31:35Z

Hey @darynaishchenko,

In my opinion, even though setting batch_size based on Google sheet size seems like a more dynamic solution, I'm a bit conscious about debugging as an end user, more specifically - hitting API timeout limits (timeouts after 180 seconds I believe). Not sure how often that can happen but if a user has a slow connection, has a huge Google sheet and tries to sync the sheet with dynamically set batch size (which could be for example 5-10k records if a sheet is really big) and gets a timeout error - it can be confusing why that happened.

Having default batch_size as 200 seems to cover a lot of cases, is more clear when having lots of connections and makes checking logs a bit more clearer due to the fact that you can see the same value for each sync. If a user wants to just use a connection and will see different values for each sync (where the Google sheets size is different), I think this could create confusion.
Also, one source accepts one google sheet (although number of sheets inside the Google sheet could vary), so it should be clear enough to notice how many rows are in the sheet and decide for a fitting batch_size.

darynaishchenko

lgtm
@tautvydas-v thanks for your work on it.
@marcosmarxm it's ready to ship

…-size-as-input-parameter-with-increases

marcosmarxm · 2024-04-10T22:22:47Z

Thanks @darynaishchenko. @tautvydas-v I'm going to proceed with the shipping process 🚢

…-size-as-input-parameter-with-increases

marcosmarxm · 2024-04-15T19:29:54Z

Hello @tautvydas-v your feedback matters a lot to us. Can you spare just 3 minutes to complete a survey? We're dedicated to making the contributor experience even better, and your input is key to achieving excellence. Thank you for helping us improve!

Created new branch with the same changes for source google sheets

f81e338

octavia-squidington-iii added area/connectors Connector related issues area/documentation Improvements or additions to documentation community connectors/source/google-sheets labels Feb 19, 2024

Merge branch 'master' into tautvydas-v/source-google-sheets-add-batch…

b5c838f

…-size-as-input-parameter-with-increases

vercel bot deployed to Preview February 19, 2024 07:21 View deployment

tautvydas-v changed the title ~~✨ [source-google-sheets] add row_batch_size an input parameter with higher increase~~ ✨ [source-google-sheets] add row_batch_size as an input parameter with higher increase Feb 19, 2024

marcosmarxm requested a review from darynaishchenko February 19, 2024 12:55

marcosmarxm added the certified label Feb 19, 2024

Merge branch 'master' into tautvydas-v/source-google-sheets-add-batch…

9ca93fa

…-size-as-input-parameter-with-increases

vercel bot deployed to Preview March 7, 2024 09:40 View deployment

darynaishchenko reviewed Mar 19, 2024

View reviewed changes

More documentation and added get method

d14709c

vercel bot deployed to Preview March 19, 2024 15:00 View deployment

Merge branch 'master' into tautvydas-v/source-google-sheets-add-batch…

e14fba9

…-size-as-input-parameter-with-increases

vercel bot deployed to Preview March 19, 2024 15:09 View deployment

tautvydas-v added 3 commits March 21, 2024 09:47

Resolved conflict and updated version

0fcdd91

Merge branch 'master' of https://github.com/airbytehq/airbyte into ta…

fa10827

…utvydas-v/source-google-sheets-add-batch-size-as-input-parameter-with-increases

Updated version

982189d

marcosmarxm requested a review from darynaishchenko March 21, 2024 12:42

tautvydas-v and others added 3 commits April 8, 2024 09:50

Merge branch 'master' of https://github.com/airbytehq/airbyte into ta…

5203108

…utvydas-v/source-google-sheets-add-batch-size-as-input-parameter-with-increases

Version bump

60b1b94

Merge branch 'airbytehq:master' into tautvydas-v/source-google-sheets…

ef8b350

…-add-batch-size-as-input-parameter-with-increases

vercel bot deployed to Preview April 9, 2024 10:28 View deployment

marcosmarxm mentioned this pull request Apr 10, 2024

[source-google-sheets] silently fails when reaching requests limit #36900

Closed

1 task

darynaishchenko approved these changes Apr 10, 2024

View reviewed changes

Merge branch 'master' into tautvydas-v/source-google-sheets-add-batch…

f5dce87

…-size-as-input-parameter-with-increases

vercel bot deployed to Preview April 10, 2024 19:30 View deployment

Merge branch 'master' into tautvydas-v/source-google-sheets-add-batch…

e256a60

…-size-as-input-parameter-with-increases

vercel bot deployed to Preview April 11, 2024 14:36 View deployment

add changelog

73a7640

vercel bot deployed to Preview April 11, 2024 14:45 View deployment

marcosmarxm merged commit bb9d374 into airbytehq:master Apr 11, 2024
49 of 55 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

✨ [source-google-sheets] add row_batch_size as an input parameter with higher increase #35404

✨ [source-google-sheets] add row_batch_size as an input parameter with higher increase #35404

tautvydas-v commented Feb 19, 2024

vercel bot commented Feb 19, 2024 •

edited

Loading

github-actions bot commented Feb 19, 2024

darynaishchenko commented Feb 21, 2024

tautvydas-v commented Feb 21, 2024 •

edited

Loading

tautvydas-v commented Feb 27, 2024

marcosmarxm commented Mar 8, 2024

tautvydas-v commented Mar 19, 2024

darynaishchenko Mar 19, 2024

tautvydas-v Mar 19, 2024

darynaishchenko Mar 19, 2024

tautvydas-v Mar 19, 2024

marcosmarxm commented Mar 21, 2024

tautvydas-v commented Mar 28, 2024

marcosmarxm commented Mar 28, 2024

tautvydas-v commented Apr 8, 2024

darynaishchenko commented Apr 9, 2024

darynaishchenko commented Apr 10, 2024

tautvydas-v commented Apr 10, 2024

darynaishchenko left a comment

marcosmarxm commented Apr 10, 2024

marcosmarxm commented Apr 15, 2024

	client.Backoff.row_batch_size = config["batch_size"]
	client.Backoff.row_batch_size = config.get("batch_size", 200)

✨ [source-google-sheets] add row_batch_size as an input parameter with higher increase #35404

✨ [source-google-sheets] add row_batch_size as an input parameter with higher increase #35404

Conversation

tautvydas-v commented Feb 19, 2024

What

How

vercel bot commented Feb 19, 2024 • edited Loading

github-actions bot commented Feb 19, 2024

Before Merging a Connector Pull Request

darynaishchenko commented Feb 21, 2024

tautvydas-v commented Feb 21, 2024 • edited Loading

tautvydas-v commented Feb 27, 2024

marcosmarxm commented Mar 8, 2024

tautvydas-v commented Mar 19, 2024

darynaishchenko Mar 19, 2024

Choose a reason for hiding this comment

tautvydas-v Mar 19, 2024

Choose a reason for hiding this comment

darynaishchenko Mar 19, 2024

Choose a reason for hiding this comment

tautvydas-v Mar 19, 2024

Choose a reason for hiding this comment

marcosmarxm commented Mar 21, 2024

tautvydas-v commented Mar 28, 2024

marcosmarxm commented Mar 28, 2024

tautvydas-v commented Apr 8, 2024

darynaishchenko commented Apr 9, 2024

darynaishchenko commented Apr 10, 2024

tautvydas-v commented Apr 10, 2024

darynaishchenko left a comment

Choose a reason for hiding this comment

marcosmarxm commented Apr 10, 2024

marcosmarxm commented Apr 15, 2024

vercel bot commented Feb 19, 2024 •

edited

Loading

tautvydas-v commented Feb 21, 2024 •

edited

Loading