Refactoring table creation in Big Query Pipeline #343

aniketsinghrawat · 2023-06-08T07:12:18Z

fixes: #311
As mentioned in this pr #250 (comment) skipping file prefix validation does not work in case of big query as it was dependent on first_uri to infer schema for big query.

Fixed this issue by creating a pipeline stage that infers the schema and creates the bigquery table using the first uri. This stage is skipped for all subsequent uris.

weather_mv/loader_pipeline/bq.py

alxmrs · 2023-06-12T15:33:00Z

weather_mv/loader_pipeline/bq.py

@@ -130,6 +131,12 @@ def validate_arguments(cls, known_args: argparse.Namespace, pipeline_args: t.Lis
        pipeline_options = PipelineOptions(pipeline_args)
        pipeline_options_dict = pipeline_options.get_all_options()

+        if known_args.output_table:
+            # checking if the output table is in format (<project>.<dataset>.<table>).
+            output_table_pattern = r'^[\w-]+\.[\w-]+\.[\w-]+$'


Are the []s necessary? Can it just be? r'^\w+\.\w+\.\w+$' (Please test my hand-written regex :))

project_ids can contain hypens https://cloud.google.com/resource-manager/docs/creating-managing-projects
\w+ doesn't match when word is like my-project.

Thanks, that makes sense! Then, would it be [\w\-]+? IIRC, [] and - together are often used to express a range.

alxmrs · 2023-06-12T15:33:28Z

weather_mv/loader_pipeline/bq.py

-        """Initializes Sink by creating a BigQuery table based on user input."""
+        """Initializes BigQuery table based on user input."""
+        self.project, self.dataset_id, self.table_id = self.output_table.split('.')
+        self.table =  None


extra space.

alxmrs · 2023-06-12T15:36:30Z

weather_mv/loader_pipeline/bq.py

-                        table=self.table.table_id,
+                        project=self.project,
+                        dataset=self.dataset_id,
+                        table=self.table_id,


If we wanted to create the BQ table in the pipeline, a simpler solution would be to change this step's disposition: https://beam.apache.org/documentation/io/built-in/google-bigquery/#create-disposition

On second thought, we may want to keep your step since we create an opinionated schema...

@alxmrs
i tried putting the create table as a transform in the pipeline but faced many issues

I tried the sample transform as you suggested but there is no way to ensure table creation happens before the usual pipeline flow. (apache_beam python sdk doesn't have support for Wait.on which could have made this possible)

I also tried stateful processing but as we are windowing out pub sub reads, and a state is discarded when the window is expired, the solution didn't work.

I think there is potential in using the create-disposition flag in WriteToBigQuery. Can you please elaborate on what do you mean by opinionated schema

cc: @mahrsee1997

Now that you mention it, I agree that using the create disposition is easiest! We need to make sure that we pass in our computed schema (say, from the init) into this transform instead of having it automatically make the schema. Thinking it over now, that's what my concern was about: I wasn't sure if we'd get the schema we wanted if it computed it automatically.

Was surfing web for some other issue & found this. It might be helpful.

https://beam.apache.org/releases/pydoc/current/apache_beam.io.gcp.bigquery.html#schemas

weather_mv/loader_pipeline/bq.py

alxmrs · 2023-06-12T21:22:32Z

weather_mv/loader_pipeline/bq.py

@@ -243,6 +261,7 @@ def expand(self, paths):
        """Extract rows of variables from data paths into a BigQuery table."""
        extracted_rows = (
                paths
+                | 'CreateTable' >> beam.Map(self.create_bq_table)


Concern: I am pretty sure this will try to create a BQ table for every element of paths. Remember, self.tables won't refer to the state of the class, since global state is not really a think for parallel steps like this.

Consider using something like sample: https://beam.apache.org/documentation/transforms/python/aggregation/sample/

aniketsinghrawat added 10 commits June 6, 2023 17:23

added check for pub sub

c7cdd47

added create table stage to bq pipeline

9aa71aa

return uri from dry run

5ce0002

init write to big query from output table var

36237d4

remove method and batch_size

bc75310

removed first uri from bq

17a88b7

copy zarr kwargs to open dataset kwargs in bq post_init

5684b9a

first_uri to None is case all_uris is empty

63a25aa

minor fixes

cb6b473

minor nits

e3ca3b1

aniketsinghrawat marked this pull request as draft June 8, 2023 08:59

alxmrs reviewed Jun 8, 2023

View reviewed changes

weather_mv/loader_pipeline/bq.py Outdated Show resolved Hide resolved

aniketsinghrawat marked this pull request as ready for review June 9, 2023 07:18

aniketsinghrawat force-pushed the streaming-weather-mv branch from 6089a99 to e3ca3b1 Compare June 9, 2023 07:20

minor changes

324b538

mahrsee1997 requested a review from alxmrs June 12, 2023 07:47

alxmrs reviewed Jun 12, 2023

View reviewed changes

lint issue

820c34c

aniketsinghrawat changed the title ~~Streaming in weather-mv~~ Refactoring table creation in Big Query Pipeline Jun 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactoring table creation in Big Query Pipeline #343

Refactoring table creation in Big Query Pipeline #343

aniketsinghrawat commented Jun 8, 2023

alxmrs Jun 12, 2023

aniketsinghrawat Jun 13, 2023

alxmrs Jun 13, 2023

alxmrs Jun 12, 2023

alxmrs Jun 12, 2023

alxmrs Jun 12, 2023

aniketsinghrawat Jun 15, 2023

alxmrs Jun 15, 2023

mahrsee1997 Jul 4, 2023 •

edited

Loading

alxmrs Jun 12, 2023

alxmrs Jun 12, 2023

Refactoring table creation in Big Query Pipeline #343

Are you sure you want to change the base?

Refactoring table creation in Big Query Pipeline #343

Conversation

aniketsinghrawat commented Jun 8, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mahrsee1997 Jul 4, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mahrsee1997 Jul 4, 2023 •

edited

Loading