Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

INSERT INTO fails to append data [Spark] #213

Closed
osopardo1 opened this issue Aug 31, 2023 · 1 comment · Fixed by #214
Closed

INSERT INTO fails to append data [Spark] #213

osopardo1 opened this issue Aug 31, 2023 · 1 comment · Fixed by #214
Labels
type: bug Something isn't working

Comments

@osopardo1
Copy link
Member

What went wrong?

After a few tests with types and SQL statements for #211 , we found out that the INSERT INTO was not behaving as expected. Fails to write data or does not find the expected columns of the schema.

How to reproduce?

Different steps about how to reproduce the problem.

1. Code that triggered the bug, or steps to reproduce:

spark.sql( "CREATE TABLE tbl(c1 STRING, c2 TIMESTAMP) " +
    	"USING qbeast OPTIONS ('columnsToIndex’=‘c1’)”)

spark.sql("""INSERT INTO tbl VALUES('foo','2022-01-02 03:04:05.123456')""".stripMargin)

The test throws the following error:

c1 does not exist. Available: col1, col2, col3

2. Branch and commit id:

main at commit f9c7ab0

3. Spark version:

3.2.1

4. Hadoop version:

3.4.0

5. How are you running Spark?

Running Spark in Local Machine

6. Stack trace:

Described in 1.

@osopardo1 osopardo1 added the type: bug Something isn't working label Aug 31, 2023
@osopardo1
Copy link
Member Author

After some analysis of the way Delta Lake handles INSERT INTO, I've found an explanation:

  • The command does not load the schema of the existing table. It tries to write the data as it arrives, and since the data does not have any schema, Spark generates one automatically with values: "col1, col2, col3".
  • According to comments on code in the Delta Lake project:
  /**
   * With Delta, we ACCEPT_ANY_SCHEMA, meaning that Spark doesn't automatically adjust the schema
   * of INSERT INTO. Here we check if we need to perform any schema adjustment for INSERT INTO by
   * name queries. We also check that any columns not in the list of user-specified columns must
   * have a default expression.
   */
  • A solution should be to call the code in DeltaAnalysis to avoid duplicating the same behavior.
  • Since a lot of methods that reconstruct and check the schema are complex, I encourage us to not develop the same solution ourselves. But if there's not an easy way of delegating, that could be another possibility.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant