Feature/65 delta format in pramen py #81

ValeriiKhalimendik · 2022-11-23T14:57:47Z

No description provided.

…oreReader and corrected test for it

…rected test for it

…est for it

… to "MetastoreTable" class

yruslan · 2022-11-24T08:50:08Z

Hmm, the CI swill failing.

Try changing teh CI to use actions/setup-python@v4

…h x64 not found

yruslan · 2022-11-24T09:33:52Z

I see it didn't work. Try removing Python 2.6 from CI

yruslan

Looks good overall. Haven't finished the review yet. Will continue later.

But there are already a couple of issues to address.

yruslan · 2022-11-24T12:42:29Z

pramen-py/pyproject.toml

 loguru = "^0.6.0"
 pytest = "6.2.5"
 pytest-asyncio = "0.16"
 pytest-cov = "2.12.1"
 types-PyYAML = "^6.0.4"
-pyspark-stubs = "^3.0.0"
+pyspark-stubs = "2.3.0.post2"


How PySpark version and PySpark-stubs versions are related?

https://github.com/zero323/pyspark-stubs#important

pramen-py/src/pramen_py/metastore/reader.py

pramen-py/src/pramen_py/metastore/writer.py

…d all table, after filter by info date

yruslan · 2022-11-25T11:06:48Z

Nice! so running tests sequentially actually solved the CI issue?

zhukovgreen

Hey, this is excellent work! Couple of notes, though:)

zhukovgreen · 2022-11-25T14:29:06Z

pramen-py/Makefile

@@ -35,7 +35,7 @@ build: install
 	poetry build

 test: install .env
-	poetry run pytest --cov
+	poetry run pytest -n 1


why did you remove --cov?

if we want to have the tests running sequentially, then just uninstall pytest-xdist

Ideally we want tests run in parallel, but something prevents it. They fail on an attempt to reuse a stopped Spark session IIRC.

Can we use both --cov and -n 1 for now?

yes, this is not related options. --cov is for coverage -n 1 is for parallelization

.github/workflows/python.yml

zhukovgreen · 2022-11-25T14:37:10Z

pramen-py/pyproject.toml

 loguru = "^0.6.0"
 pytest = "6.2.5"
 pytest-asyncio = "0.16"
 pytest-cov = "2.12.1"
 types-PyYAML = "^6.0.4"
-pyspark-stubs = "^3.0.0"
+pyspark-stubs = "2.3.0.post2"


https://github.com/zero323/pyspark-stubs#important

zhukovgreen · 2022-11-25T14:38:27Z

pramen-py/src/pramen_py/metastore/reader.py

@@ -42,6 +43,9 @@ class MetastoreReader(MetastoreReaderBase):
    a KeyError will be raised.
    """

+    def _read_table(self, format_value: str, path: str) -> DataFrame:


nit: all arguments are values, so I think it is not needed to add _value suffix to the argument names

I'd pass table_format: TableFormat and to the resolution logic inside the method.

zhukovgreen · 2022-11-25T14:46:39Z

pramen-py/tests/conftest.py

+        format_table_path = (table_path / format_.value).as_posix()
+        logger.info("Creating sample DataFrame partitioned by info_date")
+        get_data_stub.write.partitionBy("info_date").format(
+            format_.value


that's a bit risky as you don't know for sure that TableFormat contains compatible keys with the spark write format

We enforce this, I'd say we can trust it.

If this is a real concerns, I'd suggest adding a comment to TableFormat specifying that the string name of the format should match Spark format.

Sounds good. In this case let's add this information to the class TableFormat docstrings.

pramen-py/tests/conftest.py

zhukovgreen · 2022-11-25T15:00:51Z

pramen-py/tests/conftest.py

@@ -121,12 +135,15 @@ def load_and_patch_config(
    object.__setattr__(
        config.metastore_tables[0],
        "path",
-        create_parquet_data_stubs[0].resolve().as_posix(),
+        pathlib.Path(create_data_stubs_and_paths["parquet"])


Maybe it makes more sense to use delta as a default format for tests?

pramen-py/tests/conftest.py

zhukovgreen · 2022-11-25T15:11:21Z

pramen-py/tests/conftest.py

@@ -121,12 +135,15 @@ def load_and_patch_config(
    object.__setattr__(
        config.metastore_tables[0],
        "path",
-        create_parquet_data_stubs[0].resolve().as_posix(),
+        pathlib.Path(create_data_stubs_and_paths["parquet"])


I think it is better to parametrize this fixture, so it is possible to say which format to use.
For example, something like:

@pytest.mark.metastore(format="delta") def test_foo(load_and_patch_config):...

That's easy to achieve, see for example this https://jaketrent.com/post/pass-params-pytest-fixture/

zhukovgreen · 2022-11-25T15:13:42Z

pramen-py/tests/test_metastore/test_metastore.py

-        )
-        expected = spark.read.parquet(
-            load_and_patch_config.metastore_tables[0].path
+    for format_ in TableFormat:


It is better to use @pytest.mark.parametrize and set two formats. So you'll have better handling of your tests by pytest

zhukovgreen · 2022-11-25T15:15:54Z

Also you miss a license header...

yruslan

Great job!

yruslan · 2022-11-29T07:44:42Z

pramen-py/src/pramen_py/metastore/reader.py


        logger.info(f"Looking for {table_name} in the metastore.")
-        logger.debug(f"info_date range: {info_date_from} - {info_date_to}")
+        logger.debug(


Maybe it makes sense to put this info to the above logger.info() call?

yruslan · 2022-11-29T07:45:30Z

pramen-py/src/pramen_py/metastore/reader.py

@@ -42,6 +43,9 @@ class MetastoreReader(MetastoreReaderBase):
    a KeyError will be raised.
    """

+    def _read_table(self, format_value: str, path: str) -> DataFrame:


I'd pass table_format: TableFormat and to the resolution logic inside the method.

yruslan · 2022-11-29T07:49:39Z

pramen-py/src/pramen_py/metastore/reader.py

        if uppercase_columns:
-            return df.select([F.col(c).alias(c.upper()) for c in df.columns])
+            return df_filtered.select(
+                [F.col(c).alias(c.upper()) for c in df.columns]
+            )
        else:
-            return df
+            return df_filtered


Consider extracting this as a separate method since you use this at least in 2 places get_table() and get_latest()

yruslan · 2022-11-29T08:30:40Z

pramen-py/tests/test_metastore/test_metastore.py

+                until or info_date,
+            )
+            expected = expected.filter(F.col("info_date") == latest_date)
+            assert_df_equality(actual, expected, ignore_row_order=True)


Would this assert report which format has failed the test?
Maybe you can add the format as a part of the error message (can be a parameter of the assert).

with pytest.mark.parametrize, yes, it will be explicitly noticeable. The test name will be autogenerated from the parametrization values

.github/workflows/python.yml

zhukovgreen · 2022-11-29T10:27:08Z

.github/workflows/python.yml

        os-name: [ ubuntu-latest ]
    runs-on: ${{ matrix.os-name }}
    steps:
      - uses: actions/checkout@v2
-      - uses: actions/setup-python@v2
+      - uses: actions/setup-python@v4


this change makes it impossible to check the py3.6 version

pramen-py/tests/conftest.py

zhukovgreen · 2022-11-29T10:31:36Z

pramen-py/tests/conftest.py

+        format_table_path = (table_path / format_.value).as_posix()
+        logger.info("Creating sample DataFrame partitioned by info_date")
+        get_data_stub.write.partitionBy("info_date").format(
+            format_.value


Sounds good. In this case let's add this information to the class TableFormat docstrings.

zhukovgreen · 2022-11-29T10:33:14Z

pramen-py/tests/test_metastore/test_metastore.py

+                until or info_date,
+            )
+            expected = expected.filter(F.col("info_date") == latest_date)
+            assert_df_equality(actual, expected, ignore_row_order=True)


with pytest.mark.parametrize, yes, it will be explicitly noticeable. The test name will be autogenerated from the parametrization values

yruslan · 2022-12-01T14:01:33Z

@zhukovgreen, there are still a couple of pending commends related to improving unit tests. But we decided to merge the code now now, and create separate issues for making parametrized tests and other minor issues identified in this PR.

Issues: #88, #89, #84, #85

ValeriiKhalimendik added 6 commits November 16, 2022 16:39

#65 Added reading from delta format

f1f0132

#65 Added test for reading from delta format feature

4f7cd6f

#65 Added reading from delta format for get_latest function in Metast…

dac7d24

…oreReader and corrected test for it

#65 Added repartitioning in write function of MetastoreWriter and cor…

e097c07

…rected test for it

#65 Added writing delta format from Metastore feature and corrected t…

ce76926

…est for it

#65 Fixed failing of tests which related with added "table" attribute…

2e10cba

… to "MetastoreTable" class

ValeriiKhalimendik requested a review from yruslan November 23, 2022 14:57

ValeriiKhalimendik requested a review from zhukovgreen as a code owner November 23, 2022 14:57

#65 Bugfix - flask migration from gitlab to github

7215c16

#65 Bugfix: Run actions/setup-python@v2 - Error: Version 3.6 with arc…

5da658c

…h x64 not found

yruslan reviewed Nov 24, 2022

View reviewed changes

ValeriiKhalimendik added 3 commits November 24, 2022 18:00

#65 Fixed delta reading, instead load specific partition, firstly loa…

73996c6

…d all table, after filter by info date

#65 Bugfix: remove python v = 3,6 from CI

bbd975e

#65 Test CI budfix

1a0f0bd

ValeriiKhalimendik force-pushed the feature/65-Delta-format-in-Pramen-Py branch from 776e57e to 8417693 Compare November 25, 2022 09:33

#65 Lint CI budfix

3a1f6c3

ValeriiKhalimendik force-pushed the feature/65-Delta-format-in-Pramen-Py branch from 8417693 to 3a1f6c3 Compare November 25, 2022 09:39

#65 Test CI run in one thread

08e3f9d

zhukovgreen reviewed Nov 25, 2022

View reviewed changes

yruslan requested a review from DzMakatun November 29, 2022 08:07

yruslan reviewed Nov 29, 2022

View reviewed changes

zhukovgreen reviewed Nov 29, 2022

View reviewed changes

ValeriiKhalimendik force-pushed the feature/65-Delta-format-in-Pramen-Py branch 2 times, most recently from cc44552 to 2b87532 Compare November 30, 2022 15:47

#65 Added changes from conversation

8c424c4

ValeriiKhalimendik force-pushed the feature/65-Delta-format-in-Pramen-Py branch from 2b87532 to 8c424c4 Compare December 1, 2022 09:56

ValeriiKhalimendik force-pushed the feature/65-Delta-format-in-Pramen-Py branch 2 times, most recently from 09cbf63 to fbe6887 Compare December 1, 2022 10:44

yruslan previously approved these changes Dec 1, 2022

View reviewed changes

#65 Extracted uppercase_columns transformation to separate function

cbef85f

ValeriiKhalimendik dismissed yruslan’s stale review via cbef85f December 1, 2022 14:47

ValeriiKhalimendik force-pushed the feature/65-Delta-format-in-Pramen-Py branch from fbe6887 to cbef85f Compare December 1, 2022 14:47

yruslan approved these changes Dec 1, 2022

View reviewed changes

ValeriiKhalimendik merged commit 8387dd7 into main Dec 2, 2022

ValeriiKhalimendik deleted the feature/65-Delta-format-in-Pramen-Py branch December 2, 2022 12:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/65 delta format in pramen py #81

Feature/65 delta format in pramen py #81

ValeriiKhalimendik commented Nov 23, 2022

yruslan commented Nov 24, 2022

yruslan commented Nov 24, 2022

yruslan left a comment

yruslan Nov 24, 2022

zhukovgreen Nov 25, 2022

yruslan commented Nov 25, 2022

zhukovgreen left a comment

zhukovgreen Nov 25, 2022

zhukovgreen Nov 28, 2022

yruslan Nov 29, 2022

yruslan Nov 29, 2022

zhukovgreen Nov 29, 2022

zhukovgreen Nov 25, 2022

zhukovgreen Nov 25, 2022

yruslan Nov 29, 2022

zhukovgreen Nov 25, 2022

yruslan Nov 28, 2022

zhukovgreen Nov 29, 2022

zhukovgreen Nov 25, 2022

zhukovgreen Nov 25, 2022

zhukovgreen Nov 25, 2022

zhukovgreen commented Nov 25, 2022

yruslan left a comment

yruslan Nov 29, 2022

yruslan Nov 29, 2022

yruslan Nov 29, 2022

yruslan Nov 29, 2022

zhukovgreen Nov 29, 2022

zhukovgreen Nov 29, 2022

zhukovgreen Nov 29, 2022

zhukovgreen Nov 29, 2022

yruslan commented Dec 1, 2022 •

edited

Loading

Feature/65 delta format in pramen py #81

Feature/65 delta format in pramen py #81

Conversation

ValeriiKhalimendik commented Nov 23, 2022

yruslan commented Nov 24, 2022

yruslan commented Nov 24, 2022

yruslan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yruslan commented Nov 25, 2022

zhukovgreen left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zhukovgreen commented Nov 25, 2022

yruslan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yruslan commented Dec 1, 2022 • edited Loading

yruslan commented Dec 1, 2022 •

edited

Loading