Feature: added parquet sampling #1070

menglinw · 2023-11-14T23:59:12Z

parquet sampling function developed in data_utils.py;
Added sample_nrows argument in ParquetData class;
Added test_len_sampled_data in test_parquet_data.py

CLAassistant · 2023-11-14T23:59:18Z

All committers have signed the CLA.

taylorfturner · 2023-11-15T14:22:21Z

@menglinw two things:

you'll want to open the PR into dev per docs here
also you'll want to rebase onto dev to resolve the conflict with dataprofiler/data_readers/data_utils.py

…ows argument in ParquetData class; Added test_len_sampled_data in test_parquet_data.py

dataprofiler/data_readers/data_utils.py

dataprofiler/data_readers/parquet_data.py

dataprofiler/data_readers/data_utils.py

taylorfturner

couple comments / thoughts

dataprofiler/data_readers/data_utils.py

dataprofiler/tests/data_readers/test_parquet_data.py

1. added type of return in sample_parquet function; 2. changed variable names in sample_parquet function to more descriptive names (select -> sample_index, out -> sample_df); 3. created convert_unicode_col_to_utf8 function to reduce repeating code in sample_parquet and read_parquet_df functions

dataprofiler/data_readers/data_utils.py

…a_utils.py) to be more descriptive (types -> input_column_types, col -> iter_column), other part unchanged 2. test_parquet_data.py, move import statement to the top of file 3. test_parquet_data.py, merged all tests about parquet sample feature to their original tests

dataprofiler/data_readers/data_utils.py

dataprofiler/tests/data_readers/test_parquet_data.py

… sampling option enabled

…ent.txt

dataprofiler/data_readers/avro_data.py

taylorfturner

let's update the requirements.tct

requirements.txt

dataprofiler/data_readers/data_utils.py

dataprofiler/tests/data_readers/test_parquet_data.py

taylorfturner · 2023-12-11T18:30:30Z

dataprofiler/tests/data_readers/test_parquet_data.py

                if data_format == "dataframe":
-                    import pandas as pd
-
                    self.assertIsInstance(data, pd.DataFrame)
                elif data_format in ["records", "json"]:
                    self.assertIsInstance(data, list)
                    self.assertIsInstance(data[0], str)

+            input_data_obj_sampled = Data(
+                input_file["path"], options={"sample_nrows": 100}
+            )
+            for data_format in list(input_data_obj_sampled._data_formats.keys()):
+                input_data_obj_sampled.data_format = data_format
+                self.assertEqual(input_data_obj_sampled.data_format, data_format)
+                data_sampled = input_data_obj_sampled.data
+                if data_format == "dataframe":
+                    self.assertIsInstance(data_sampled, pd.DataFrame)
+                elif data_format in ["records", "json"]:
+                    self.assertIsInstance(data_sampled, list)
+                    self.assertIsInstance(data_sampled[0], str)
+


this reads a little verbose.... wondering if needed since its adding an option to the test...

I think this is necessary. The new parquet sampling feature is using a different parquet reading method and then format, so I think it might be safer to check if the sampled data still meet our format requirements.

I don't mind too much if tests are verbose, as long as their readable. Better to be verbose than miss edge cases imo.

2. test_len_data method keep one sample length test 3. remove sampling test in test_specifying_data_type 4. remove sampling test in test_reload_data

* Feature: added parquet sampling (#1070) * parquet sampling function developed in data_utils.py; Added sample_nrows argument in ParquetData class; Added test_len_sampled_data in test_parquet_data.py * resolved conflict with dev, added more tests * fixed sample empty column bug * fixed comments in data_utils.py, including: 1. added type of return in sample_parquet function; 2. changed variable names in sample_parquet function to more descriptive names (select -> sample_index, out -> sample_df); 3. created convert_unicode_col_to_utf8 function to reduce repeating code in sample_parquet and read_parquet_df functions * 1. renamed variable names in covert_unicode_col_to_utf8 function (data_utils.py) to be more descriptive (types -> input_column_types, col -> iter_column), other part unchanged 2. test_parquet_data.py, move import statement to the top of file 3. test_parquet_data.py, merged all tests about parquet sample feature to their original tests * checked the datatype and input file path before and after reload with sampling option enabled * test * delete test edit in avro_data.py, updated fastavro version in requirment.txt * remove fastavro.reader type * change fastavro version back to original * 1. sample_parquet function description 2. test_len_data method keep one sample length test 3. remove sampling test in test_specifying_data_type 4. remove sampling test in test_reload_data * Depedency: `matplotlib` version bump (#1072) * bump tag matplotlib * bumpt to most recent * 3.9.0 update * Bump actions/setup-python from 4 to 5 (#1078) Bumps [actions/setup-python](https://github.com/actions/setup-python) from 4 to 5. - [Release notes](https://github.com/actions/setup-python/releases) - [Commits](actions/setup-python@v4...v5) --- updated-dependencies: - dependency-name: actions/setup-python dependency-type: direct:production update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Taylor Turner <taylorfturner@gmail.com> * Make _assimilate_histogram not use self (#1071) Co-authored-by: Taylor Turner <taylorfturner@gmail.com> * version bump --------- Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: WML <36968256+menglinw@users.noreply.github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Junho Lee <53921230+junholee6a@users.noreply.github.com>

menglinw requested a review from a team as a code owner November 14, 2023 23:59

taylorfturner changed the title ~~feat: added parquet sampling~~ Feature: added parquet sampling Nov 15, 2023

taylorfturner assigned menglinw Nov 15, 2023

taylorfturner requested review from taylorfturner, tyfarnan and micdavis November 15, 2023 14:21

menglinw changed the base branch from main to dev November 15, 2023 17:58

menglinw added 2 commits November 15, 2023 16:41

parquet sampling function developed in data_utils.py; Added sample_nr…

09a868d

…ows argument in ParquetData class; Added test_len_sampled_data in test_parquet_data.py

resolved conflict with dev, added more tests

a17b4df

menglinw force-pushed the parquet-sampling branch from 47785d9 to a17b4df Compare November 16, 2023 01:27

taylorfturner added the New Feature A feature addition not currently in the library label Nov 16, 2023

fixed sample empty column bug

05b1940

taylorfturner reviewed Nov 17, 2023

View reviewed changes

dataprofiler/data_readers/data_utils.py Outdated Show resolved Hide resolved

dataprofiler/data_readers/parquet_data.py Show resolved Hide resolved

taylorfturner added the 0.10.8 label Nov 21, 2023

micdavis reviewed Nov 21, 2023

View reviewed changes

dataprofiler/data_readers/data_utils.py Outdated Show resolved Hide resolved

micdavis reviewed Nov 21, 2023

View reviewed changes

dataprofiler/data_readers/data_utils.py Outdated Show resolved Hide resolved

dataprofiler/data_readers/data_utils.py Outdated Show resolved Hide resolved

dataprofiler/data_readers/data_utils.py Outdated Show resolved Hide resolved

taylorfturner reviewed Nov 21, 2023

View reviewed changes

dataprofiler/data_readers/data_utils.py Outdated Show resolved Hide resolved

micdavis reviewed Dec 6, 2023

View reviewed changes

dataprofiler/data_readers/data_utils.py Show resolved Hide resolved

dataprofiler/tests/data_readers/test_parquet_data.py Outdated Show resolved Hide resolved

checked the datatype and input file path before and after reload with…

e60dda6

… sampling option enabled

micdavis previously approved these changes Dec 8, 2023

View reviewed changes

test

897ffb6

menglinw dismissed micdavis’s stale review via 897ffb6 December 8, 2023 18:13

menglinw added 2 commits December 8, 2023 12:03

delete test edit in avro_data.py, updated fastavro version in requirm…

8c872bc

…ent.txt

remove fastavro.reader type

07e755f

taylorfturner enabled auto-merge (squash) December 8, 2023 21:38

taylorfturner reviewed Dec 8, 2023

View reviewed changes

dataprofiler/data_readers/avro_data.py Show resolved Hide resolved

taylorfturner suggested changes Dec 11, 2023

View reviewed changes

requirements.txt Outdated Show resolved Hide resolved

change fastavro version back to original

80993bc

auto-merge was automatically disabled December 11, 2023 16:34
Head branch was pushed to by a user without write access

taylorfturner enabled auto-merge (squash) December 11, 2023 16:59

taylorfturner reviewed Dec 11, 2023

View reviewed changes

1. sample_parquet function description

fa38188

2. test_len_data method keep one sample length test 3. remove sampling test in test_specifying_data_type 4. remove sampling test in test_reload_data

auto-merge was automatically disabled December 11, 2023 19:37
Head branch was pushed to by a user without write access

taylorfturner approved these changes Dec 11, 2023

View reviewed changes

taylorfturner enabled auto-merge (squash) December 11, 2023 22:25

micdavis approved these changes Dec 12, 2023

View reviewed changes

taylorfturner merged commit 0d56dac into capitalone:dev Dec 12, 2023
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature: added parquet sampling #1070

Feature: added parquet sampling #1070

menglinw commented Nov 14, 2023 •

edited by taylorfturner

Loading

CLAassistant commented Nov 14, 2023 •

edited

Loading

taylorfturner commented Nov 15, 2023 •

edited

Loading

taylorfturner left a comment •

edited

Loading

taylorfturner left a comment

taylorfturner Dec 11, 2023

menglinw Dec 11, 2023

micdavis Dec 12, 2023

Feature: added parquet sampling #1070

Feature: added parquet sampling #1070

Conversation

menglinw commented Nov 14, 2023 • edited by taylorfturner Loading

CLAassistant commented Nov 14, 2023 • edited Loading

taylorfturner commented Nov 15, 2023 • edited Loading

taylorfturner left a comment • edited Loading

Choose a reason for hiding this comment

taylorfturner left a comment

Choose a reason for hiding this comment

taylorfturner Dec 11, 2023

Choose a reason for hiding this comment

menglinw Dec 11, 2023

Choose a reason for hiding this comment

micdavis Dec 12, 2023

Choose a reason for hiding this comment

menglinw commented Nov 14, 2023 •

edited by taylorfturner

Loading

CLAassistant commented Nov 14, 2023 •

edited

Loading

taylorfturner commented Nov 15, 2023 •

edited

Loading

taylorfturner left a comment •

edited

Loading