Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

#203 Add spark catalog support #206

Merged
merged 11 commits into from
Jun 9, 2023
Merged

Conversation

yruslan
Copy link
Collaborator

@yruslan yruslan commented Jun 8, 2023

This adds another HiveHelper implementation that is based on Spark Catalog, rather than running explicit SQL query against Hive metastore.

Spark Catalog allows natural management of Glue Catalog tables in Delta format, provided that prerequisites for the Glue job are met: https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-format-delta-lake.html.

Global Hive implementation selection:

pramen {
  # ...

  # The API to use to query Hive. Valid values are: "sql", "spark_catalog"
  hive.api = "sql"
}

Per-metastore table Hive implementation selection:

pramen.metastore {
  tables = [
    {
      # ...

      # The API to use to query Hive. Valid values are: "sql", "spark_catalog"
      hive.api = "spark_catalog"
      hive.table = "table"
    }
  ]
}

Standardization Sink Hive implementation selection:

pramen.sinks = [
  {
    name = "mysink"
    factory.class = "za.co.absa.pramen.extras.sink.StandardizationSink"

    # ...

    hive = {
      # The API to use to query Hive. Valid values are: "sql", "spark_catalog"
      api = "sql"
    }
  }
]

@AbsaOSS AbsaOSS deleted a comment from github-actions bot Jun 8, 2023
@yruslan yruslan force-pushed the feature/203-add-spark-catalog-support branch 2 times, most recently from 6936293 to f2618e3 Compare June 8, 2023 12:10
@yruslan yruslan force-pushed the feature/203-add-spark-catalog-support branch from f2618e3 to 63820c1 Compare June 8, 2023 12:12
@github-actions
Copy link

github-actions bot commented Jun 8, 2023

Unit Test Coverage

File Coverage [89.57%] 🍏
HiveConfig.scala 100% 🍏
HiveApi.scala 100% 🍏
HiveDefaultConfig.scala 100% 🍏
HiveHelperSql.scala 100% 🍏
SparkUtils.scala 90.47% 🍏
MetastoreImpl.scala 88.62% 🍏
HiveHelperSparkCatalog.scala 88.36% 🍏
HiveHelper.scala 86.76% 🍏
StandardizationSink.scala 81.79% 🍏
HiveFormat.scala 81.48% 🍏
Total Project Coverage 78.7% 🍏

@AbsaOSS AbsaOSS deleted a comment from github-actions bot Jun 8, 2023
@AbsaOSS AbsaOSS deleted a comment from github-actions bot Jun 8, 2023
@yruslan yruslan marked this pull request as ready for review June 9, 2023 06:12
@yruslan yruslan requested a review from jirifilip as a code owner June 9, 2023 06:12
Copy link
Collaborator

@jirifilip jirifilip left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice feature, looks great!

@yruslan yruslan merged commit 9b46bee into main Jun 9, 2023
@yruslan yruslan deleted the feature/203-add-spark-catalog-support branch June 9, 2023 07:11
@yruslan yruslan mentioned this pull request Jun 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants