Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

#222 Add file based source and metastore table format #223

Merged
merged 10 commits into from
Aug 4, 2023

Conversation

yruslan
Copy link
Collaborator

@yruslan yruslan commented Aug 2, 2023

  • Added a new source RawFileSource that allows loading files to the metastore without looking inside them. Just file copy.
  • Added a new format raw for metastore tables that allows keeping files without understanding which format they contain.
    • If you query such a table using metastore.getTable(), you will get a DataFrame of the list of files in that partition. Ranged queries are supported as well.
  • Added a new built-in transformer ConversionTransformer that allows transforming a raw table into parquet or delta metastore table
  • Added a new method to the MetastoreReader trait (Scala only, for now):
    /**
      * Gets definition of a metastore table. Please, use with caution and do not write to the underlying path
      * from transformers.
      *
      * @param tableName The name of the table to query.
      * @return The table definition.
      */
    def getTableDef(tableName: String): MetaTableDef
    • It is used to properly handle tables that have raw format, in transformers and sinks.
    • But can also be used for logging table metadata (description, format etc) from custom transformers.
    • Should not be used to get direct access to table internals and never for writing to them.
  • Bumped up the minor version to 1.5.0-SNAPSHOT due to the interface change.

@github-actions
Copy link

github-actions bot commented Aug 2, 2023

Unit Test Coverage

File Coverage [89.54%] 🍏
ConversionTransformer.scala 100% 🍏
HiveConfig.scala 100% 🍏
TransformationJob.scala 100% 🍏
MetaTable.scala 98.39% 🍏
MetastorePersistenceRaw.scala 98.13% 🍏
TransferTable.scala 97.77% 🍏
PythonTransformationJob.scala 96.01% 🍏
RawFileSource.scala 92.98% 🍏
SparkUtils.scala 92.15% 🍏
MetastorePersistenceParquet.scala 90.89% 🍏
MetastoreImpl.scala 90.89% 🍏
FsUtils.scala 84.29% 🍏
DataFormatParser.scala 82.31% 🍏
MetastorePersistence.scala 76.15% 🍏
JobBase.scala 75.89% 🍏
AppRunner.scala 73.7%
Total Project Coverage 79.56% 🍏

@yruslan yruslan marked this pull request as ready for review August 3, 2023 07:17
@yruslan yruslan requested a review from jirifilip as a code owner August 3, 2023 07:17
Copy link
Collaborator

@jirifilip jirifilip left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very cool functionality! We used to do this stuff in Airflow and the DAG code got really large and unwieldy because of it. It's pretty cool that you were able to generalize this functionality (landing -> raw -> transformation) in a nice way.

@yruslan yruslan merged commit 7dda4ac into main Aug 4, 2023
@yruslan yruslan deleted the feature/222-add-file-based-operations branch August 4, 2023 13:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants