[ADAP-1063] [Bug] BigQueryException: Error while reading data, error message: Schema mismatch #1047

gbmarc1 · 2023-12-06T20:53:06Z

Is this a new bug in dbt-bigquery?

I believe this is a new bug in dbt-bigquery
I have searched the existing issues, and I could not find an existing issue for this bug

Current Behavior

Dataproc job fails unexpectedly using dbt-core>1.7.0 and dbt-bigquery>1.7.0 but works using dbt-core<1.6.9 and dbt-bigquery<1.6.9 with schema mismatch error.

Expected Behavior

Updating dbt should not result in dataproc job failing

Steps To Reproduce

The column type is:

column_name      | RECORD | REPEATED |
  - my_list             | STRING | REPEATED | 
  - name                | STRING | NULLABLE

Relevant log output

Caused by: com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.BigQueryException: Error while reading data, error message: Schema mismatch: referenced variable 'column_name.list.element.my_list.list.element' has array levels of 1, while the corresponding field path to Parquet column has 0 repeated fields File: gs://<my-bucket>/.spark-bigquery-application_1697116734723_1013-4f752a25-0f69-441e-8641-54bee7fcb8ce/part-00007-ab709b08-3bd5-481d-a6d4-f77989c194f8-c000.snappy.parquet

Environment

- OS: macos
- Python: 3.11
- dbt-core: >1.7.0
- dbt-bigquery: >1.7.0

Additional Context

Would impact anyone moving from Python models version 1.6 to 1.7 using BQ

Currently can't reproduce, will pair with Doug to repro.

No response

The text was updated successfully, but these errors were encountered:

VersusFacit · 2023-12-12T21:43:03Z

Hi there! Thanks for the issue submission. Could you help us out by explaining more of what you mean by:

Updating dbt should not result in dataproc job failing

What's the commands you are using, or is it just a dbt run?

Likewise, you've provided some column types but are there are specific model lineage we should be aware of?

gbmarc1 · 2023-12-12T23:28:20Z

Hi there! Thanks for the issue submission. Could you help us out by explaining more of what you mean by:

Updating dbt should not result in dataproc job failing

Updating dbt should not result in dataproc job failing
Meaning a minor upgrade should not be backward. TBH, I did not know what to write in that box. I can removed that sentence.

What's the commands you are using, or is it just a dbt run?

I run dbt build but yeah it fails in the run part. The problem seems to occur at write time to BQ. It seems there is a problem writing a list nested in a list of records.

dbeatty10 · 2023-12-20T04:14:32Z

@gbmarc1 could you provide us with a simple dbt model that works using dbt-core<1.6.9 and dbt-bigquery<1.6.9 but raises that "Schema mismatch" exception with dbt-core>1.7.0 and dbt-bigquery>1.7.0?

We'll need to be able to reproduce this on our side in order to determine how to proceed. Could you provide us a detailed set of steps that would allow us to reproduce this?

gbmarc1 · 2024-01-09T19:18:45Z

@dbeatty10 Sorry for the delay. You can reproduce the error with this model:

import pandas as pd


def model(dbt, session):
    dbt.config(submission_method="cluster", materialized="table")

    df = pd.DataFrame(
        [
            {"column_name": [{"name": "hello", "my_list": ["h", "e", "l", "l", "o"]}]},
        ]
    )

    return df

dlubawy · 2024-01-23T01:07:01Z

I've just upgraded to dbt-core 1.7.5 and dbt-bigquery 1.7.3 and this is still an issue.

Caused by: com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.BigQueryException: Error while reading data, error message: Schema mismatch: referenced variable 'list_of_items.list.element.name' has array levels of 1, while the corresponding field path to Parquet column has 0 repeated fields File: gs://xxxx-xxxx-xxxx/.spark-bigquery-app-20240123000000-xxxx-xxxx-xxxx-xxxx/part-xxxx-xxxx-xxxx-xxx.snappy.parquet

dbeatty10 · 2024-02-02T23:05:41Z

@gbmarc1 We were able to reproduce the error report for 1.7, but we were not able to reproduce it for 1.6. Could you check again if the example you gave us works in 1.6 for you?

Alternatively, @dlubawy if you have an example that works on 1.6 but doesn't work on 1.7, would you please share it?

Namely, this didn't work for us:

import pandas as pd


def model(dbt, session):
    dbt.config(submission_method="serverless", materialized="table")

    df = pd.DataFrame(
        [
            {"column_name": [{"name": "hello", "my_list": ["h", "e", "l", "l", "o"]}]},
        ]
    )

    return df

But this did work:

import pandas as pd


def model(dbt, session):
    dbt.config(submission_method="serverless", materialized="table")

    df = pd.DataFrame(
        [
            {"column_name": {"name": "hello", "my_list": ["h", "e", "l", "l", "o"]}},
        ]
    )

    return df

gbmarc1 · 2024-02-03T20:41:00Z

@gbmarc1 We were able to reproduce the error report for 1.7, but we were not able to reproduce it for 1.6. Could you check again if the example you gave us works in 1.6 for you?

Alternatively, @dlubawy if you have an example that works on 1.6 but doesn't work on 1.7, would you please share it?

Namely, this didn't work for us:
import pandas as pd


def model(dbt, session):
    dbt.config(submission_method="serverless", materialized="table")

    df = pd.DataFrame(
        [
            {"column_name": [{"name": "hello", "my_list": ["h", "e", "l", "l", "o"]}]},
        ]
    )

    return df
But this did work:
import pandas as pd


def model(dbt, session):
    dbt.config(submission_method="serverless", materialized="table")

    df = pd.DataFrame(
        [
            {"column_name": {"name": "hello", "my_list": ["h", "e", "l", "l", "o"]}},
        ]
    )

    return df

#dbeatty10 you are right. As specified in the issue description, it does NOT work on 1.7.x but DOES on 1.6.x
Thank you for looking at this!

dlubawy · 2024-02-06T02:00:43Z

@dbeatty10 the problem we are seeing is from an example like this:

import pandas as pd


def model(dbt, session):
    dbt.config(submission_method="serverless", materialized="table")

    df = pd.DataFrame(
        [
            {"column_name": {"name": "hello", "my_list": [{"value": "hello"}]}},
        ]
    )

    return df

This code will run on v1.6 and materialize the table correctly, but it does not run in v1.7 (a regression):
Error while reading data, error message: Schema mismatch: referenced variable 'column_name.my_list.list.element.value' has array levels of 1, while the corresponding field path to Parquet column has 0 repeated fields

TianyiLi1025 · 2024-02-20T10:19:03Z

I have exactly the same problem with DBT 1.7.x. So I am not able to store repeated records to the BigQuery. Is there anyone is addressing the issue. Otherwise, I have to use the version 1.6.8.

Fleid · 2024-02-22T20:28:47Z

Moving up the queue!

TianyiLi1025 · 2024-02-23T13:26:05Z

For someone who still get stuck with this problem. Add this in the code:

session.conf.set('intermediateFormat', "orc")

Or

session.conf.set('intermediateFormat', "parquet") 
session.conf.set('enableListInference', "true")

For indirect write, spark use parquet as the default format to store the data in the temporary bucket, and it requires enabling list inference to store repeated record data to BigQuery.

Or simply, we can config the format as 'orc', it is more efficient for data ingestion.

I think DBT should add this session.conf.set('enableListInference', "true") as default.

gbmarc1 added type:bug Something isn't working triage:product labels Dec 6, 2023

github-actions bot changed the title ~~[Bug] BigQueryException: Error while reading data, error message: Schema mismatch~~ [ADAP-1063] [Bug] BigQueryException: Error while reading data, error message: Schema mismatch Dec 6, 2023

dbeatty10 added triage:awaiting-response and removed triage:product labels Dec 20, 2023

dbeatty10 mentioned this issue Dec 20, 2023

[ADAP-1075] [Bug] Python models are hard to change after a first run #1059

Closed

2 tasks

github-actions bot added triage:product and removed triage:awaiting-response labels Jan 9, 2024

dlubawy mentioned this issue Jan 23, 2024

[ADAP-1016] [Regression] [1.7] python models don't work #1006

Closed

2 tasks

dbeatty10 added the type:regression label Feb 2, 2024

mikealfare added the feature:python-models label Feb 2, 2024

dbeatty10 added triage:awaiting-response and removed triage:product labels Feb 2, 2024

github-actions bot added triage:product and removed triage:awaiting-response labels Feb 3, 2024

Fleid added support and removed triage:product labels Feb 22, 2024

martynydbt added backport 1.7.latest and removed backport 1.7.latest labels Feb 23, 2024

martynydbt added the backport 1.7.latest label Feb 27, 2024

martynydbt assigned dbeatty10 Feb 27, 2024

martynydbt added the High Severity label Mar 1, 2024

Fleid assigned McKnight-42 and unassigned dbeatty10 Apr 10, 2024

dbeatty10 added the feature:nested-columns Related to BQ's nested and repeated columns (STRUCT / RECORD) label Apr 16, 2024

McKnight-42 assigned colin-rogers-dbt and unassigned McKnight-42 Apr 17, 2024

jfeng-dev mentioned this issue Apr 22, 2024

[Bug] List types don't save to BigQuery using 1.7 (but did in 1.5) #1114

Closed

2 tasks

mikealfare assigned mikealfare and unassigned colin-rogers-dbt Apr 24, 2024

mikealfare removed the support label Apr 24, 2024

mikealfare mentioned this issue Apr 26, 2024

Add configuration options for enable_list_inference and intermediate_format for python models #1205

Merged

5 tasks

mikealfare closed this as completed in #1205 May 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ADAP-1063] [Bug] BigQueryException: Error while reading data, error message: Schema mismatch #1047

[ADAP-1063] [Bug] BigQueryException: Error while reading data, error message: Schema mismatch #1047

gbmarc1 commented Dec 6, 2023 •

edited by martynydbt

Loading

VersusFacit commented Dec 12, 2023

gbmarc1 commented Dec 12, 2023 •

edited

Loading

dbeatty10 commented Dec 20, 2023

gbmarc1 commented Jan 9, 2024

dlubawy commented Jan 23, 2024

dbeatty10 commented Feb 2, 2024

gbmarc1 commented Feb 3, 2024

dlubawy commented Feb 6, 2024

TianyiLi1025 commented Feb 20, 2024

Fleid commented Feb 22, 2024

TianyiLi1025 commented Feb 23, 2024 •

edited by dbeatty10

Loading

[ADAP-1063] [Bug] BigQueryException: Error while reading data, error message: Schema mismatch #1047

[ADAP-1063] [Bug] BigQueryException: Error while reading data, error message: Schema mismatch #1047

Comments

gbmarc1 commented Dec 6, 2023 • edited by martynydbt Loading

Is this a new bug in dbt-bigquery?

Current Behavior

Expected Behavior

Steps To Reproduce

Relevant log output

Environment

Additional Context

VersusFacit commented Dec 12, 2023

gbmarc1 commented Dec 12, 2023 • edited Loading

dbeatty10 commented Dec 20, 2023

gbmarc1 commented Jan 9, 2024

dlubawy commented Jan 23, 2024

dbeatty10 commented Feb 2, 2024

gbmarc1 commented Feb 3, 2024

dlubawy commented Feb 6, 2024

TianyiLi1025 commented Feb 20, 2024

Fleid commented Feb 22, 2024

TianyiLi1025 commented Feb 23, 2024 • edited by dbeatty10 Loading

gbmarc1 commented Dec 6, 2023 •

edited by martynydbt

Loading

gbmarc1 commented Dec 12, 2023 •

edited

Loading

TianyiLi1025 commented Feb 23, 2024 •

edited by dbeatty10

Loading