-
Notifications
You must be signed in to change notification settings - Fork 164
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature: added parquet sampling #1070
Conversation
…ows argument in ParquetData class; Added test_len_sampled_data in test_parquet_data.py
47785d9
to
a17b4df
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
couple comments / thoughts
1. added type of return in sample_parquet function; 2. changed variable names in sample_parquet function to more descriptive names (select -> sample_index, out -> sample_df); 3. created convert_unicode_col_to_utf8 function to reduce repeating code in sample_parquet and read_parquet_df functions
…a_utils.py) to be more descriptive (types -> input_column_types, col -> iter_column), other part unchanged 2. test_parquet_data.py, move import statement to the top of file 3. test_parquet_data.py, merged all tests about parquet sample feature to their original tests
… sampling option enabled
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's update the requirements.tct
Head branch was pushed to by a user without write access
if data_format == "dataframe": | ||
import pandas as pd | ||
|
||
self.assertIsInstance(data, pd.DataFrame) | ||
elif data_format in ["records", "json"]: | ||
self.assertIsInstance(data, list) | ||
self.assertIsInstance(data[0], str) | ||
|
||
input_data_obj_sampled = Data( | ||
input_file["path"], options={"sample_nrows": 100} | ||
) | ||
for data_format in list(input_data_obj_sampled._data_formats.keys()): | ||
input_data_obj_sampled.data_format = data_format | ||
self.assertEqual(input_data_obj_sampled.data_format, data_format) | ||
data_sampled = input_data_obj_sampled.data | ||
if data_format == "dataframe": | ||
self.assertIsInstance(data_sampled, pd.DataFrame) | ||
elif data_format in ["records", "json"]: | ||
self.assertIsInstance(data_sampled, list) | ||
self.assertIsInstance(data_sampled[0], str) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this reads a little verbose.... wondering if needed since its adding an option to the test...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is necessary. The new parquet sampling feature is using a different parquet reading method and then format, so I think it might be safer to check if the sampled data still meet our format requirements.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't mind too much if tests are verbose, as long as their readable. Better to be verbose than miss edge cases imo.
2. test_len_data method keep one sample length test 3. remove sampling test in test_specifying_data_type 4. remove sampling test in test_reload_data
Head branch was pushed to by a user without write access
* Feature: added parquet sampling (#1070) * parquet sampling function developed in data_utils.py; Added sample_nrows argument in ParquetData class; Added test_len_sampled_data in test_parquet_data.py * resolved conflict with dev, added more tests * fixed sample empty column bug * fixed comments in data_utils.py, including: 1. added type of return in sample_parquet function; 2. changed variable names in sample_parquet function to more descriptive names (select -> sample_index, out -> sample_df); 3. created convert_unicode_col_to_utf8 function to reduce repeating code in sample_parquet and read_parquet_df functions * 1. renamed variable names in covert_unicode_col_to_utf8 function (data_utils.py) to be more descriptive (types -> input_column_types, col -> iter_column), other part unchanged 2. test_parquet_data.py, move import statement to the top of file 3. test_parquet_data.py, merged all tests about parquet sample feature to their original tests * checked the datatype and input file path before and after reload with sampling option enabled * test * delete test edit in avro_data.py, updated fastavro version in requirment.txt * remove fastavro.reader type * change fastavro version back to original * 1. sample_parquet function description 2. test_len_data method keep one sample length test 3. remove sampling test in test_specifying_data_type 4. remove sampling test in test_reload_data * Depedency: `matplotlib` version bump (#1072) * bump tag matplotlib * bumpt to most recent * 3.9.0 update * Bump actions/setup-python from 4 to 5 (#1078) Bumps [actions/setup-python](https://github.com/actions/setup-python) from 4 to 5. - [Release notes](https://github.com/actions/setup-python/releases) - [Commits](actions/setup-python@v4...v5) --- updated-dependencies: - dependency-name: actions/setup-python dependency-type: direct:production update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Taylor Turner <taylorfturner@gmail.com> * Make _assimilate_histogram not use self (#1071) Co-authored-by: Taylor Turner <taylorfturner@gmail.com> * version bump --------- Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: WML <36968256+menglinw@users.noreply.github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Junho Lee <53921230+junholee6a@users.noreply.github.com>
parquet sampling function developed in data_utils.py;
Added sample_nrows argument in ParquetData class;
Added test_len_sampled_data in test_parquet_data.py