Feature/84 metastore get latest available date for delta #116

ValeriiKhalimendik · 2023-01-05T10:22:20Z

No description provided.

…r delta format table

…condition

yruslan

Looks good. Just a couple of nitpicks

pramen-py/src/pramen_py/metastore/reader.py

zhukovgreen · 2023-01-09T10:36:36Z

pramen-py/src/pramen_py/metastore/reader.py

+            df_select = self._read_table(
+                metastore_table.format, metastore_table.path
+            ).select(f"{metastore_table.info_date_settings.column}")
+            dates_list = [data[0] for data in df_select.distinct().collect()]


Take a look here, not sure what are you doing, but I am confused by the .collect and not readable data[0] for data.

collect() returns records as an array, and [0] is the first column. At least this is how it works in Scala.
How would you do it differently?

something like this would be more sparktonic based on my sense of the spark nature:):

latest_date =df.select("info_date").distinct().rdd.map(lambda x: x[0]).max()

replace lambda x: x[0] with some typed function, i.e.

def get_info_date_value(row: Row) -> str:

and it is more readable (no obscured data[0] for data). in this case you need to know that Row supports indexing....

From tastes perspective I like the original code. Seen many Spark functions that do this trick:
Quoting my own code here, but have seen similar things in other code bases, and it does not trigger a wtf for me 😄 : https://github.com/AbsaOSS/pramen/blob/main/pramen/core/src/main/scala/za/co/absa/pramen/core/reader/TableReaderJdbc.scala#L79-L79

Extracting the processing logic as a lambda won't do much since you can't avoid data[0] or something like this.
Do I understand it correctly that you propose:

def get_info_date_value(row: Row) -> str: return str[0]

Also, I'm not sure why you want to convert to an rdd - seems like redundant to me, might be even less performant.

I'd be happy to chat about this tomorrow, sure do not be blocked on this!

Sure, thanks!

Maybe the low-hanging fruit here would be to rename data to row which (maybe) makes it clearer that we are extracting the first column but I guess it's still pretty much a matter of preference. I have used this "pattern" many times so to me it was familiar.

I like the idea of renaming data to row

zhukovgreen · 2023-01-09T10:37:44Z

pramen-py/tests/test_metastore/test_metastore.py

+def test_metastore_get_latest_available_date_for_delta(
+    spark, get_data_stub, tmp_path
+):
+    def save_delta_table(df: DataFrame, path: str) -> None:


looks like something reusable. Consider moving it as a fixture

jirifilip · 2023-01-09T14:32:12Z

pramen-py/tests/test_metastore/test_metastore.py

+
+    df_union = get_data_stub.union(
+        spark.createDataFrame(
+            spark.sparkContext.parallelize([(17, 18, d(2022, 11, 2))]),


Sorry for nitpicking but I think you can create a DataFrame directly from a list of tuples (without using RDD).

Suggested change

spark.sparkContext.parallelize([(17, 18, d(2022, 11, 2))]),

[(17, 18, d(2022, 11, 2))],

You right, Thanks!

jirifilip · 2023-01-10T06:47:00Z

pramen-py/src/pramen_py/metastore/reader.py

+        self, path: str, target_partition_name: str
+    ) -> Optional[str]:
+        def is_file_hidden(column_name_: str, date_: str) -> bool:
+            if len(column_name_) > 0 and column_name_[0] == "_":


Maybe this function could be simplified to be more readable using column_name_.startswith("_")

yruslan

Looks great. Just a couple of tiny things left.

pramen-py/src/pramen_py/metastore/reader.py

pramen-py/tests/test_metastore/test_metastore.py

pramen-py/src/pramen_py/metastore/reader.py

pramen-py/tests/conftest.py

yruslan

Looks great!

Just a couple of optional considerations.

yruslan · 2023-01-11T09:55:51Z

pramen-py/src/pramen_py/metastore/reader.py

+    @staticmethod
    def _apply_uppercase_to_columns_names(
-        self, df: DataFrame, uppercase_columns: bool
+        df: DataFrame, uppercase_columns: bool


yruslan · 2023-01-11T09:59:12Z

pramen-py/src/pramen_py/metastore/reader.py

+            logger.error(f"Unable to access directory: {path}")
+            raise Exception(f"Unable to access directory : {path}")


Logging and throwing is an anti-pattern. Pick one. Probably raising an exception is the best here.

Why do you want to raise a custom exception here? Is AnalysisException error message meaningful?

More on the topic: https://stackoverflow.com/questions/6639963/why-is-log-and-throw-considered-an-anti-pattern

yruslan · 2023-01-11T09:59:50Z

pramen-py/src/pramen_py/metastore/reader.py

+            logger.error(
+                f"The directory does not contain partitions by "
+                f"'{metastore_table.info_date_settings.column}': {metastore_table.path}"
            )
+            raise ValueError("No partitions are available")


Same here - you can raise an error only, but it can contain the more detailed message.

yruslan · 2023-01-11T10:00:22Z

pramen-py/src/pramen_py/metastore/reader.py

+            logger.error(
+                f"No partitions are available for the given '{metastore_table.name}'.\n"
+                f"The table is available for the following dates:\n"
+                f"{str_date_list}\n"
                f"Only partitions earlier than {str(until)} might be included."
-            ) from err
-        else:
-            logger.info(f"Latest date for {table_name} is {latest_date}")
-            return latest_date
+            )
+            raise ValueError("No partitions are available")


Same here - you can raise an error only, but it can contain the more detailed message.

yruslan · 2023-01-11T10:00:54Z

pramen-py/tests/test_metastore/test_metastore.py


 from pramen_py import MetastoreReader, MetastoreWriter
 from pramen_py.models import InfoDateSettings, MetastoreTable, TableFormat


+@pytest.mark.parametrize(


Nice test suite!

yruslan · 2023-01-11T12:53:55Z

pramen-py/src/pramen_py/metastore/reader.py

+        try:
+            return self.spark.read.format(table_format.value).load(path)
+        except AnalysisException:
+            raise Exception(f"Unable to access directory: {path}")


You mentioned you wanted to improve the error message by adding the cause of the error (from the original analysts exception)?

Original exception is outing by this way

yruslan

Looks good

ValeriiKhalimendik added 2 commits January 4, 2023 21:11

#84 Added test and implementation of getting latest available date fo…

85f7331

…r delta format table

#84 Fixed output of available dates if any dates not satisfied until …

54afb43

…condition

ValeriiKhalimendik requested review from zhukovgreen and yruslan as code owners January 5, 2023 10:22

yruslan reviewed Jan 9, 2023

View reviewed changes

zhukovgreen reviewed Jan 9, 2023

View reviewed changes

yruslan requested a review from jirifilip January 9, 2023 13:16

jirifilip reviewed Jan 9, 2023

View reviewed changes

#84 Fixed Logger info messages

9304b62

ValeriiKhalimendik force-pushed the feature/84_metastore_get_latest_available_date_for_delta branch 2 times, most recently from aa529ab to 55c360d Compare January 9, 2023 17:42

#84 Fixed test_metastore_raises_value_error_on_bad_path

8ab3541

ValeriiKhalimendik force-pushed the feature/84_metastore_get_latest_available_date_for_delta branch from 55c360d to 8ab3541 Compare January 9, 2023 17:55

jirifilip reviewed Jan 10, 2023

View reviewed changes

yruslan reviewed Jan 10, 2023

View reviewed changes

ValeriiKhalimendik force-pushed the feature/84_metastore_get_latest_available_date_for_delta branch from 032da8c to 7dc3797 Compare January 10, 2023 18:27

#84 Added validation of name and date in partitions paths

b9b8091

ValeriiKhalimendik force-pushed the feature/84_metastore_get_latest_available_date_for_delta branch from 7dc3797 to b9b8091 Compare January 10, 2023 18:46

yruslan reviewed Jan 11, 2023

View reviewed changes

pramen-py/src/pramen_py/metastore/reader.py Outdated Show resolved Hide resolved

yruslan reviewed Jan 11, 2023

View reviewed changes

pramen-py/src/pramen_py/metastore/reader.py Outdated Show resolved Hide resolved

pramen-py/src/pramen_py/metastore/reader.py Outdated Show resolved Hide resolved

pramen-py/tests/conftest.py Outdated Show resolved Hide resolved

ValeriiKhalimendik force-pushed the feature/84_metastore_get_latest_available_date_for_delta branch from 007835b to bacfc76 Compare January 11, 2023 09:48

yruslan previously approved these changes Jan 11, 2023

View reviewed changes

#84 Added validation of name and date in partitions paths

70403e4

ValeriiKhalimendik dismissed yruslan’s stale review via 70403e4 January 11, 2023 12:47

ValeriiKhalimendik force-pushed the feature/84_metastore_get_latest_available_date_for_delta branch from bacfc76 to 70403e4 Compare January 11, 2023 12:47

yruslan reviewed Jan 11, 2023

View reviewed changes

yruslan approved these changes Jan 11, 2023

View reviewed changes

ValeriiKhalimendik merged commit 518eabd into main Jan 11, 2023

ValeriiKhalimendik deleted the feature/84_metastore_get_latest_available_date_for_delta branch January 11, 2023 15:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/84 metastore get latest available date for delta #116

Feature/84 metastore get latest available date for delta #116

ValeriiKhalimendik commented Jan 5, 2023

yruslan left a comment •

edited

Loading

zhukovgreen Jan 9, 2023

yruslan Jan 9, 2023

zhukovgreen Jan 9, 2023 •

edited

Loading

zhukovgreen Jan 9, 2023

yruslan Jan 9, 2023

zhukovgreen Jan 9, 2023 •

edited

Loading

yruslan Jan 9, 2023

jirifilip Jan 9, 2023

yruslan Jan 9, 2023

zhukovgreen Jan 9, 2023

jirifilip Jan 9, 2023

ValeriiKhalimendik Jan 9, 2023

jirifilip Jan 10, 2023

yruslan left a comment

yruslan left a comment

yruslan Jan 11, 2023

yruslan Jan 11, 2023

yruslan Jan 11, 2023

yruslan Jan 11, 2023

yruslan Jan 11, 2023

yruslan Jan 11, 2023

yruslan Jan 11, 2023

ValeriiKhalimendik Jan 11, 2023

yruslan left a comment

	spark.sparkContext.parallelize([(17, 18, d(2022, 11, 2))]),
	[(17, 18, d(2022, 11, 2))],

		logger.error(f"Unable to access directory: {path}")
		raise Exception(f"Unable to access directory : {path}")

Feature/84 metastore get latest available date for delta #116

Feature/84 metastore get latest available date for delta #116

Conversation

ValeriiKhalimendik commented Jan 5, 2023

yruslan left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zhukovgreen Jan 9, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zhukovgreen Jan 9, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yruslan left a comment

Choose a reason for hiding this comment

yruslan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yruslan left a comment

Choose a reason for hiding this comment

yruslan left a comment •

edited

Loading

zhukovgreen Jan 9, 2023 •

edited

Loading

zhukovgreen Jan 9, 2023 •

edited

Loading