Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ADAP-1063] [Bug] BigQueryException: Error while reading data, error message: Schema mismatch #1047

Closed
2 tasks done
gbmarc1 opened this issue Dec 6, 2023 · 11 comments · Fixed by #1205
Closed
2 tasks done
Assignees
Labels
feature:nested-columns Related to BQ's nested and repeated columns (STRUCT / RECORD) feature:python-models type:bug Something isn't working type:regression

Comments

@gbmarc1
Copy link

gbmarc1 commented Dec 6, 2023

Is this a new bug in dbt-bigquery?

  • I believe this is a new bug in dbt-bigquery
  • I have searched the existing issues, and I could not find an existing issue for this bug

Current Behavior

Dataproc job fails unexpectedly using dbt-core>1.7.0 and dbt-bigquery>1.7.0 but works using dbt-core<1.6.9 and dbt-bigquery<1.6.9 with schema mismatch error.

Expected Behavior

Updating dbt should not result in dataproc job failing

Steps To Reproduce

The column type is:

column_name      | RECORD | REPEATED |
  - my_list             | STRING | REPEATED | 
  - name                | STRING | NULLABLE

Relevant log output

Caused by: com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.BigQueryException: Error while reading data, error message: Schema mismatch: referenced variable 'column_name.list.element.my_list.list.element' has array levels of 1, while the corresponding field path to Parquet column has 0 repeated fields File: gs://<my-bucket>/.spark-bigquery-application_1697116734723_1013-4f752a25-0f69-441e-8641-54bee7fcb8ce/part-00007-ab709b08-3bd5-481d-a6d4-f77989c194f8-c000.snappy.parquet

Environment

- OS: macos
- Python: 3.11
- dbt-core: >1.7.0
- dbt-bigquery: >1.7.0

Additional Context

Would impact anyone moving from Python models version 1.6 to 1.7 using BQ

Currently can't reproduce, will pair with Doug to repro.

No response

@gbmarc1 gbmarc1 added type:bug Something isn't working triage:product labels Dec 6, 2023
@github-actions github-actions bot changed the title [Bug] BigQueryException: Error while reading data, error message: Schema mismatch [ADAP-1063] [Bug] BigQueryException: Error while reading data, error message: Schema mismatch Dec 6, 2023
@VersusFacit
Copy link
Contributor

Hi there! Thanks for the issue submission. Could you help us out by explaining more of what you mean by:

Updating dbt should not result in dataproc job failing

What's the commands you are using, or is it just a dbt run?

Likewise, you've provided some column types but are there are specific model lineage we should be aware of?

@gbmarc1
Copy link
Author

gbmarc1 commented Dec 12, 2023

Hi there! Thanks for the issue submission. Could you help us out by explaining more of what you mean by:

Updating dbt should not result in dataproc job failing

Updating dbt should not result in dataproc job failing
Meaning a minor upgrade should not be backward. TBH, I did not know what to write in that box. I can removed that sentence.

What's the commands you are using, or is it just a dbt run?

I run dbt build but yeah it fails in the run part. The problem seems to occur at write time to BQ. It seems there is a problem writing a list nested in a list of records.

@dbeatty10
Copy link
Contributor

@gbmarc1 could you provide us with a simple dbt model that works using dbt-core<1.6.9 and dbt-bigquery<1.6.9 but raises that "Schema mismatch" exception with dbt-core>1.7.0 and dbt-bigquery>1.7.0?

We'll need to be able to reproduce this on our side in order to determine how to proceed. Could you provide us a detailed set of steps that would allow us to reproduce this?

@gbmarc1
Copy link
Author

gbmarc1 commented Jan 9, 2024

@dbeatty10 Sorry for the delay. You can reproduce the error with this model:

import pandas as pd


def model(dbt, session):
    dbt.config(submission_method="cluster", materialized="table")

    df = pd.DataFrame(
        [
            {"column_name": [{"name": "hello", "my_list": ["h", "e", "l", "l", "o"]}]},
        ]
    )

    return df

@dlubawy
Copy link

dlubawy commented Jan 23, 2024

I've just upgraded to dbt-core 1.7.5 and dbt-bigquery 1.7.3 and this is still an issue.

Caused by: com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.BigQueryException: Error while reading data, error message: Schema mismatch: referenced variable 'list_of_items.list.element.name' has array levels of 1, while the corresponding field path to Parquet column has 0 repeated fields File: gs://xxxx-xxxx-xxxx/.spark-bigquery-app-20240123000000-xxxx-xxxx-xxxx-xxxx/part-xxxx-xxxx-xxxx-xxx.snappy.parquet

@dbeatty10
Copy link
Contributor

@gbmarc1 We were able to reproduce the error report for 1.7, but we were not able to reproduce it for 1.6. Could you check again if the example you gave us works in 1.6 for you?

Alternatively, @dlubawy if you have an example that works on 1.6 but doesn't work on 1.7, would you please share it?

Namely, this didn't work for us:

import pandas as pd


def model(dbt, session):
    dbt.config(submission_method="serverless", materialized="table")

    df = pd.DataFrame(
        [
            {"column_name": [{"name": "hello", "my_list": ["h", "e", "l", "l", "o"]}]},
        ]
    )

    return df

But this did work:

import pandas as pd


def model(dbt, session):
    dbt.config(submission_method="serverless", materialized="table")

    df = pd.DataFrame(
        [
            {"column_name": {"name": "hello", "my_list": ["h", "e", "l", "l", "o"]}},
        ]
    )

    return df

@gbmarc1
Copy link
Author

gbmarc1 commented Feb 3, 2024

@gbmarc1 We were able to reproduce the error report for 1.7, but we were not able to reproduce it for 1.6. Could you check again if the example you gave us works in 1.6 for you?

Alternatively, @dlubawy if you have an example that works on 1.6 but doesn't work on 1.7, would you please share it?

Namely, this didn't work for us:

import pandas as pd


def model(dbt, session):
    dbt.config(submission_method="serverless", materialized="table")

    df = pd.DataFrame(
        [
            {"column_name": [{"name": "hello", "my_list": ["h", "e", "l", "l", "o"]}]},
        ]
    )

    return df

But this did work:

import pandas as pd


def model(dbt, session):
    dbt.config(submission_method="serverless", materialized="table")

    df = pd.DataFrame(
        [
            {"column_name": {"name": "hello", "my_list": ["h", "e", "l", "l", "o"]}},
        ]
    )

    return df

#dbeatty10 you are right. As specified in the issue description, it does NOT work on 1.7.x but DOES on 1.6.x
Thank you for looking at this!

@dlubawy
Copy link

dlubawy commented Feb 6, 2024

@dbeatty10 the problem we are seeing is from an example like this:

import pandas as pd


def model(dbt, session):
    dbt.config(submission_method="serverless", materialized="table")

    df = pd.DataFrame(
        [
            {"column_name": {"name": "hello", "my_list": [{"value": "hello"}]}},
        ]
    )

    return df

This code will run on v1.6 and materialize the table correctly, but it does not run in v1.7 (a regression):
Error while reading data, error message: Schema mismatch: referenced variable 'column_name.my_list.list.element.value' has array levels of 1, while the corresponding field path to Parquet column has 0 repeated fields

@TianyiLi1025
Copy link

I have exactly the same problem with DBT 1.7.x. So I am not able to store repeated records to the BigQuery. Is there anyone is addressing the issue. Otherwise, I have to use the version 1.6.8.

@Fleid
Copy link
Contributor

Fleid commented Feb 22, 2024

Moving up the queue!

@TianyiLi1025
Copy link

TianyiLi1025 commented Feb 23, 2024

For someone who still get stuck with this problem. Add this in the code:

session.conf.set('intermediateFormat', "orc")

Or

session.conf.set('intermediateFormat', "parquet") 
session.conf.set('enableListInference', "true")

For indirect write, spark use parquet as the default format to store the data in the temporary bucket, and it requires enabling list inference to store repeated record data to BigQuery.

Or simply, we can config the format as 'orc', it is more efficient for data ingestion.

I think DBT should add this session.conf.set('enableListInference', "true") as default.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature:nested-columns Related to BQ's nested and repeated columns (STRUCT / RECORD) feature:python-models type:bug Something isn't working type:regression
Projects
None yet
10 participants