From d6e2d213b891c9f09d28b6000012800364321c76 Mon Sep 17 00:00:00 2001 From: Luke Lowery Date: Thu, 12 Mar 2020 21:49:28 -0700 Subject: [PATCH] Adding BigQuery to the readme (#222) * Adding Bigquery to the readme * fixing link --- databuilder/README.md | 30 ++++++++++++++++++++++++++++++ 1 file changed, 30 insertions(+) diff --git a/databuilder/README.md b/databuilder/README.md index 707d0cf3ab..add0f73287 100644 --- a/databuilder/README.md +++ b/databuilder/README.md @@ -224,6 +224,36 @@ job = DefaultJob( job.launch() ``` +#### [BigQueryMetadataExtractor](https://github.com/lyft/amundsendatabuilder/blob/master/databuilder/extractor/bigquery_metadata_extractor.py "BigQuery Metdata Extractor") +An extractor that extracts table and column metadata including database, schema, table name, table description, column name and column description from a Bigquery database. + +The API calls driving the extraction is defined [here](https://github.com/lyft/amundsendatabuilder/blob/master/databuilder/extractor/bigquery_metadata_extractor.py) + +You will need to create a service account for reading metadata and grant it "BigQuery Metadata Viewer" access to all of your datasets. This can all be done via the bigquery ui. + +Download the creditials file and store it securely. Set the `GOOGLE_APPLICATION_CREDENTIALS` environment varible to the location of your credtials files and your code should have access to everything it needs. + +You can configure bigquery like this. You can optionally set a label filter if you only want to pull tables with a certain label. +```python + job_config = { + 'extractor.bigquery_table_metadata.{}'.format( + BigQueryMetadataExtractor.PROJECT_ID_KEY + ): gcloud_project + } + if label_filter: + job_config[ + 'extractor.bigquery_table_metadata.{}' + .format(BigQueryMetadataExtractor.FILTER_KEY) + ] = label_filter + task = DefaultTask(extractor=BigQueryMetadataExtractor(), + loader=csv_loader, + transformer=NoopTransformer()) + job = DefaultJob(conf=ConfigFactory.from_dict(job_config), + task=task, + publisher=Neo4jCsvPublisher()) +job.launch() +``` + #### [Neo4jEsLastUpdatedExtractor](https://github.com/lyft/amundsendatabuilder/blob/master/databuilder/extractor/neo4j_es_last_updated_extractor.py "Neo4jEsLastUpdatedExtractor") An extractor that basically get current timestamp and passes it GenericExtractor. This extractor is basically being used to create timestamp for "Amundsen was last indexed on ..." in Amundsen web page's footer.