Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update BigQuery connector documentation #25109

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

SemionPar
Copy link
Contributor

Description

Add a section on project ID resolution.

Additional context and related issues

Release notes

( ) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
( ) Release notes are required, with the following suggested text:

## Section
* Fix some things. ({issue}`issuenumber`)

@cla-bot cla-bot bot added the cla-signed label Feb 21, 2025
@github-actions github-actions bot added the docs label Feb 21, 2025
@SemionPar
Copy link
Contributor Author

SemionPar commented Feb 21, 2025

I have one more thing I am considering adding here.

Use case - cross project service accounts, aspect - PTF functions.

Little on the use case and how to setup service accounts in this way:

By default, you can't create an IAM service account in one Google Cloud project and attach it to a resource in another project. However, you might have centralized the service accounts for your organization in separate projects, which can make the service accounts easier to manage. This document outlines the steps required to support attaching a service account in one project to an Eventarc trigger in another project.

Illustration

Setup

I did some testing with 3 Google Cloud projects

Projects:

  • base-30275 - 'parent' project
  • data-1-13082, data-2-644 - 'data' projects

Service accounts:

  • base-30275-service-account-34@base-30275.iam.gserviceaccount.com
    • lives in base-30275,
    • has BigQueryUser access to base-30275, data-2-644
  • base-30275-no-bq@base-30275.iam.gserviceaccount.com
    • lives in base-30275,
    • has BigQueryUser access to data-2-644 and data-1-13082, BUT NOT TO base-30275,
  • data-2-bq@data-2-644.iam.gserviceaccount.com
    • lives in data-2-644,
    • has BigQueryUser access to data-2-644

Findings

  • PTF query will execute with service account permissions, regardless of the bigquery.parent-project-id or bigquery.project-id

This makes it possible to access ALL BigQuery resources that this service account has permissions to:

-Dtesting.bigquery.parent-project-id=data-2-644         <--------------- data project id set as parent project
-Dtesting.bigquery.project-id=data-2-644
-Dtesting.bigquery.credentials-key=base-30275-service-account-34@base-30275.iam.gserviceaccount.com
-- access to data project:
  SELECT * FROM TABLE(bigquery.system.query(query => 'SELECT schema_name FROM `data-2-644.region-us.INFORMATION_SCHEMA.SCHEMATA`'));
-- access to parent project:
  SELECT * FROM TABLE(bigquery.system.query(query => 'SELECT schema_name FROM `base-30275.region-us.INFORMATION_SCHEMA.SCHEMATA`'));
-- OK                                                   <--------------- has access because service account has permissions in parent
-- access to another data project:
  SELECT * FROM TABLE(bigquery.system.query(query => 'SELECT schema_name FROM `data-1-13082.region-us.INFORMATION_SCHEMA.SCHEMATA`'));
-- OK                                                   <--------------- has access because service account has permissions in other data project as well

If, however, used with service account that has no BigQuery permissions in the parent project, access is denied:

-Dtesting.bigquery.parent-project-id=data-2-644
-Dtesting.bigquery.project-id=data-2-644
-Dtesting.bigquery.credentials-key=base-30275-service-account-34@base-30275.iam.gserviceaccount.com
-- access to data project:
  SELECT * FROM TABLE(bigquery.system.query(query => 'SELECT schema_name FROM `data-2-644.region-us.INFORMATION_SCHEMA.SCHEMATA`'));
-- access to parent project:
  SELECT * FROM TABLE(bigquery.system.query(query => 'SELECT schema_name FROM `base-30275.region-us.INFORMATION_SCHEMA.SCHEMATA`'));
-- Failed to get destination table for query. Access Denied: Table base-30275:region-us. INFORMATION_SCHEMA. SCHEMATA: User does not have permission to query table base-30275:region-us. INFORMATION_SCHEMA. SCHEMATA, or perhaps it does not exist.

Summary

If a BigQuery catalog is configured with an SA JSON key and a project ID, then through PTF, one effectively gains access to all BigQuery projects that the service account has permissions for.

Ultimately, do we want to mention this in the documentation?

Copy link
Member

@hashhar hashhar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job demystifying this.

Everyone was so confused for so long and wasn't able to figure out how or why it works.

Copy link
Member

@mosabua mosabua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall we might want to rename the section since it seems all about billing ... also since it seems pretty critical we might want to add an anchor and link to it from the configs for each of these properties with a "see also .."

data in multiple GCP projects, You need to create several catalogs, each
pointing to a different GCP project. For example, if you have two GCP projects,
one for the sales and one for analytics, you can create two properties files in
`etc/catalog` named `sales.properties` and `analytics.properties`, both
having `connector.name=bigquery` but with different `project-id`. This will
create the two catalogs, `sales` and `analytics` respectively.

### Understanding Project ID Resolution

The BigQuery connector determines the project ID to use based on the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Somewhere we need to link to some sort docs from BigQuery that explains more about the project ID ideally .. maybe right here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fine idea! I am having trouble finding any relevant documentation on this topic that is more than a brief mention in the BigQuery section of Google Cloud docs. There are some third party sources like https://productresources.collibra.com/docs/collibra/latest/Content/DataQuality/DBConnection/ta_bigquery-cross-account-dataset-access.htm

We could have a reference to a general cross project SA page like this one instead, WDYT? https://cloud.google.com/iam/docs/attach-service-accounts#attaching-different-project

Copy link
Member

@mosabua mosabua Feb 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe use this link https://cloud.google.com/resource-manager/docs/creating-managing-projects .. it explicitly explains Project ID

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went ahead and added a sentence, please edit as you see fit

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Link the project ID to the link I found

@mosabua
Copy link
Member

mosabua commented Feb 25, 2025

With regards to the query table function and it being a potential security issue.. thats already documented in a generic fashion .. adding more details specific to BigQuery would also be good.

@SemionPar SemionPar force-pushed the semionpar/update-bigquery-connector-documentation branch 2 times, most recently from 164193f to c71334f Compare February 25, 2025 18:24
@SemionPar SemionPar changed the title [wip] Update BigQuery connector documentation Update BigQuery connector documentation Feb 25, 2025
@SemionPar SemionPar marked this pull request as ready for review February 25, 2025 18:25
@SemionPar
Copy link
Contributor Author

Thank you for reviews @hashhar @mosabua!

Applied the changes and pushed, PTAL

@SemionPar SemionPar requested review from mosabua and hashhar February 25, 2025 18:27
@SemionPar
Copy link
Contributor Author

Thank you @mosabua!

Gave it another try, PTAL

@SemionPar SemionPar requested a review from mosabua February 25, 2025 21:50
Copy link
Member

@hashhar hashhar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM % nits

@SemionPar SemionPar force-pushed the semionpar/update-bigquery-connector-documentation branch from 3758f98 to 211a2f1 Compare February 26, 2025 09:41
@SemionPar
Copy link
Contributor Author

Thank you @mosabua!

Gave it another try, PTAL

Sorry, forgot to actually push my latest changes yesterday... All pushed now

Copy link
Member

@mosabua mosabua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few nits .. then ready to go ... ping me after last push and I can merge

The BigQuery connector can only access a single GCP project.Thus, if you have
data in multiple GCP projects, You need to create several catalogs, each
The BigQuery connector can only access a single GCP project. Thus, if you have
data in multiple GCP projects, you need to create several catalogs, each
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
data in multiple GCP projects, you need to create several catalogs, each
data in multiple GCP projects, you must create several catalogs, each

@@ -74,14 +74,47 @@ bigquery.project-id=<your Google Cloud Platform project id>

### Multiple GCP projects

The BigQuery connector can only access a single GCP project.Thus, if you have
data in multiple GCP projects, You need to create several catalogs, each
The BigQuery connector can only access a single GCP project. Thus, if you have
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The BigQuery connector can only access a single GCP project. Thus, if you have
The BigQuery connector can only access a single GCP project. If you have

(bigquery-project-id-resolution)=
### Billing and data projects

The BigQuery connector determines the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix wrapping in this paragraph to 80 char .. currently its weird

[project ID](https://cloud.google.com/resource-manager/docs/creating-managing-projects)
to use based on the configuration settings.
This behavior provides users with flexibility in selecting both
the project to query and the project to be billed for BigQuery operations.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
the project to query and the project to be billed for BigQuery operations.
the project to query and the project to bill for BigQuery operations.

@findinpath
Copy link
Contributor

@mosabua pls go ahead and change the wording yourself - seems straightforward.

@mosabua
Copy link
Member

mosabua commented Feb 26, 2025

@mosabua pls go ahead and change the wording yourself - seems straightforward.

I don't have time for that until Friday

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

5 participants