Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

✨ [source-google-sheets] add row_batch_size as an input parameter with higher increase #35404

Conversation

tautvydas-v
Copy link
Contributor

What

Accidentally closed #35320 PR therefore raising this new one.
Resolves my raised issue: #35274
Previously, source_google_sheets source connector had hardcoded value of 200 for row_batch_size. This means that one request sent to Google Sheets API processes only 200 rows, even though there's pretty much no limit, as long as it's processed in under 3 minutes. Also, limitation is on requests sent, but not the rows processed. Sometimes, a problem rises when doing a sync with a google sheet with over 100k records, where exponential backoff fails and the sync silently fails. In order to overcome this, making "batch_size" input parameter, with a default value of 200, so there wouldn't be any kind of change when updating the connector.

How

Added "batch_size" to spec with a default value of 200 which is passed over source.py file, under _read method. Also, changed row_batch_size increase to 100 if the exception code is TOO_MANY_REQUESTS.

Copy link

vercel bot commented Feb 19, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
airbyte-docs ✅ Ready (Inspect) Visit Preview 💬 Add feedback Apr 11, 2024 2:45pm

@octavia-squidington-iii octavia-squidington-iii added area/connectors Connector related issues area/documentation Improvements or additions to documentation community connectors/source/google-sheets labels Feb 19, 2024
Copy link
Contributor

Before Merging a Connector Pull Request

Wow! What a great pull request you have here! 🎉

To merge this PR, ensure the following has been done/considered for each connector added or updated:

  • PR name follows PR naming conventions
  • Breaking changes are considered. If a Breaking Change is being introduced, ensure an Airbyte engineer has created a Breaking Change Plan.
  • Connector version has been incremented in the Dockerfile and metadata.yaml according to our Semantic Versioning for Connectors guidelines
  • You've updated the connector's metadata.yaml file any other relevant changes, including a breakingChanges entry for major version bumps. See metadata.yaml docs
  • Secrets in the connector's spec are annotated with airbyte_secret
  • All documentation files are up to date. (README.md, bootstrap.md, docs.md, etc...)
  • Changelog updated in docs/integrations/<source or destination>/<name>.md with an entry for the new version. See changelog example
  • Migration guide updated in docs/integrations/<source or destination>/<name>-migrations.md with an entry for the new version, if the version is a breaking change. See migration guide example
  • If set, you've ensured the icon is present in the platform-internal repo. (Docs)

If the checklist is complete, but the CI check is failing,

  1. Check for hidden checklists in your PR description

  2. Toggle the github label checklist-action-run on/off to re-run the checklist CI.

@tautvydas-v tautvydas-v changed the title ✨ [source-google-sheets] add row_batch_size an input parameter with higher increase ✨ [source-google-sheets] add row_batch_size as an input parameter with higher increase Feb 19, 2024
@darynaishchenko
Copy link
Collaborator

@tautvydas-v, Regarding this comment in previous pr #35320 (comment)
What is suitable input row batch size value for you? I'm ok with idea of percentage, but I'm afraid that not all users will understand the logic of this input param and they should take some time to understand what value they need. Did you test this changes with suitable row batch size value and backoff logic on your data?

@tautvydas-v
Copy link
Contributor Author

tautvydas-v commented Feb 21, 2024

Hey @darynaishchenko, suitable batch size really depends case on case I suppose. In some cases, we could really use 10k records per batch and it would speed the extraction by a lot. Keep in mind that we're leaving default batch size as 200 and it's under Optional parameters, so even with a version update, if someone doesn't really need or care about batch size - it won't have any impact, as it's already currently 200. As a use case, we have a google sheet with about ~137000 records and here's a screenshot of how it goes for us so far:
Screenshot 2024-02-21 at 12 09 56 PM
As you can see, in some cases we extract 30k-80k records even though we have 137k records. So in a way, it silently fails. Also, it takes up to 10 minutes for full extraction.

Meanwhile, we've tested out with different batch size inputs:
Here's with 10k rows per batch:
Screenshot 2024-02-21 at 12 40 48 PM
Screenshot 2024-02-21 at 12 42 25 PM

Here's with 100k rows per batch:
Screenshot 2024-02-21 at 1 02 25 PM
Screenshot 2024-02-21 at 12 52 50 PM

Do you see any potential problems with this setup though? Having 100k rows as a batch size really seems to be an edge case, but it's possible to use it too.

Also regarding backoff strategy - I think that by default we could leave a hardcoded value actually, since percentages would have to be rounded to a full number, because I tested out with 10% and didn't like how it worked, because:
Initial value is 200 -> after backoff increase by 10% 220 -> then batch size 242 -> batch size - 266.2, etc. So if the default is 200, then we could leave 10 or increase it a bit.

@tautvydas-v
Copy link
Contributor Author

@darynaishchenko @marcosmarxm Hi, what should be the next steps here?

@marcosmarxm
Copy link
Member

@darynaishchenko can you review the contributio again please?

@tautvydas-v
Copy link
Contributor Author

@darynaishchenko @marcosmarxm Hey, any additional steps should be taken here? Just conscious that this doesn't get lost as we would really like to implement this.

type: integer
title: Row Batch Size
description: >-
An integer representing row batch size with each sent request to Google Sheets API. Defaults to 200.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tautvydas-v lgtm, could you please add more detailed description here and in .md file for Row Batch Size. Info about what impact this value have on rate limits, sycn etc, and how to define most suitable value for users.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added in more documentation

@@ -149,6 +149,7 @@ def _read(
catalog: ConfiguredAirbyteCatalog,
) -> Generator[AirbyteMessage, None, None]:
client = GoogleSheetsClient(self.get_credentials(config))
client.Backoff.row_batch_size = config["batch_size"]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
client.Backoff.row_batch_size = config["batch_size"]
client.Backoff.row_batch_size = config.get("batch_size", 200)

for old configs without batch_size.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

@marcosmarxm
Copy link
Member

@darynaishchenko let me know if this is ready to ship 🚀

@tautvydas-v
Copy link
Contributor Author

@darynaishchenko Hi, what's the status with this PR? Can this be reviewed again?

@marcosmarxm
Copy link
Member

@tautvydas-v there are some internal discussion about the implementation, sorry the delay. Hope to get an update next week! Thanks for the your patience.

@tautvydas-v
Copy link
Contributor Author

@marcosmarxm @darynaishchenko Hi, any news regarding this one? I've noticed that I have to update the connector version for the fourth time, but really not sure what's the status here and whether it's even worth it to do so in the future. Would be great to get some feedback, thanks.

tautvydas-v and others added 3 commits April 8, 2024 09:50
…utvydas-v/source-google-sheets-add-batch-size-as-input-parameter-with-increases
…-add-batch-size-as-input-parameter-with-increases
@darynaishchenko
Copy link
Collaborator

@tautvydas-v hi, thanks for the your patience. I'm working on internal discussions, once I receive answers I let you know. I'm really sorry for the delay.

@darynaishchenko
Copy link
Collaborator

@tautvydas-v Hi, I'm ok with your solution.
Just one last question, what do you think about calculating of proper batch size before actual sync by fetching a count of records in spread sheet? So for example before read we fetch that request spread sheet has 100k records and set row_batch size to 10 000 as you used. And we can use enum for row_batch size value for example sheets with < 10000 records: batch size is 200, > 10000 and < 100 000: batch size is 1000 etc.

@tautvydas-v
Copy link
Contributor Author

Hey @darynaishchenko,

In my opinion, even though setting batch_size based on Google sheet size seems like a more dynamic solution, I'm a bit conscious about debugging as an end user, more specifically - hitting API timeout limits (timeouts after 180 seconds I believe). Not sure how often that can happen but if a user has a slow connection, has a huge Google sheet and tries to sync the sheet with dynamically set batch size (which could be for example 5-10k records if a sheet is really big) and gets a timeout error - it can be confusing why that happened.

Having default batch_size as 200 seems to cover a lot of cases, is more clear when having lots of connections and makes checking logs a bit more clearer due to the fact that you can see the same value for each sync. If a user wants to just use a connection and will see different values for each sync (where the Google sheets size is different), I think this could create confusion.
Also, one source accepts one google sheet (although number of sheets inside the Google sheet could vary), so it should be clear enough to notice how many rows are in the sheet and decide for a fitting batch_size.

Copy link
Collaborator

@darynaishchenko darynaishchenko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm
@tautvydas-v thanks for your work on it.
@marcosmarxm it's ready to ship

@marcosmarxm
Copy link
Member

Thanks @darynaishchenko. @tautvydas-v I'm going to proceed with the shipping process 🚢

@marcosmarxm marcosmarxm merged commit bb9d374 into airbytehq:master Apr 11, 2024
49 of 55 checks passed
@marcosmarxm
Copy link
Member

Hello @tautvydas-v your feedback matters a lot to us. Can you spare just 3 minutes to complete a survey? We're dedicated to making the contributor experience even better, and your input is key to achieving excellence. Thank you for helping us improve!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/connectors Connector related issues area/documentation Improvements or additions to documentation certified community connectors/source/google-sheets
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants