Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bugfix] Populate table name from the identifier in Iceberg conversion #630

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

jalpan-randeri
Copy link

@jalpan-randeri jalpan-randeri commented Jan 28, 2025

What is the purpose of the pull request

While converting the iceberg table to other format such as Hudi, the icerberg source table do not populate the table name. This is due to iceberg table's behavior as it is treated as Hadoop tables. This leads to table identified as table-location, leading to confusing conversation.

BUG= #494

Brief change log

This commit handles the conversation logic, when icebege table manager provides HadoopTable, it populate the table name from provided input TableIdentifier. This ensures that source table name is carried over to the transformation.

Verify this pull request

  • Updated unit test to cover this scenario.

Manual Verification

#### Step 1. create iceberg table
./bin/spark-shell --packages org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.7.0\                                                                  
    --conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions \
    --conf spark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkSessionCatalog \
    --conf spark.sql.catalog.spark_catalog.type=hive \
    --conf spark.sql.catalog.local=org.apache.iceberg.spark.SparkCatalog \
    --conf spark.sql.catalog.local.type=hadoop \
    --conf spark.sql.catalog.local.warehouse=$PWD/warehouse \
    --conf spark.sql.defaultCatalog=local

spark.sql(""" 
CREATE TABLE prod.db.sample (
 id bigint NOT NULL COMMENT 'unique id',
 data string)
USING iceberg 
""").show()

spark.sql("INSERT INTO prod.db.sample VALUES (1, 'jalpan) ").show()

#### Step 2. create xtable config
$ cat xtable_iceberg.yaml

sourceFormat: ICEBERG
targetFormats:
  - HUDI
datasets:
  -
    tableBasePath: /Users/jalpan/workspace/spark/warehouse/prod/db/sample
    tableName: sample

#### Step 3. perform sync
java -jar xtable-utilities/target/xtable-utilities_2.12-0.2.0-SNAPSHOT-bundled.jar --datasetConfig xtable_iceberg.yaml

#### Step 4. validate 
cat /Users/jalpan/workspace/spark/warehouse/prod/db/sample/.hoodie/hoodie.properties | grep name                                                                                      [14:21:05]
hoodie.table.name=sample

@jalpan-randeri jalpan-randeri changed the title [Bugfix] Populate table name from the identifier in Iceberge conversion [Bugfix] Populate table name from the identifier in Iceberg conversion Jan 28, 2025
Copy link
Contributor

@the-other-tim-brown the-other-tim-brown left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Can you squash down to a single commit?

What is the problem?
While converting the iceberg table to other format such as Hudi,
the icerberg source table do not populate the table name. This is
due to iceberg table's behavior as it is treated as Hadoop tables.
This leads to table identified as table-location, leading to confusing
conversation.

Solution:
This commit handles the conversation logic, when icebege table manager
provides HadoopTable, it populate the table name from provided input
TableIdentifier. This ensures that source table name is carried over
to the transformation.

Testing:
- Added unit test to cover this scenario.

Co-authored-by: Tim Brown <tim.brown126@gmail.com>
@jalpan-randeri jalpan-randeri force-pushed the jalpan/bugfix-iceberg-filebased-name branch from c357040 to 828112b Compare February 13, 2025 03:14
.name(
iceTable.name().contains(iceTable.location())
? sourceTableConfig.getName()
: iceTable.name())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please confirm if existing tests cases cover both the paths?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants