Update main with change in memory-optimization (#799)

* Fixed testing to run on all feature branches for PRs (#793) * cleanup time sapce analysis code (#797) * quick update to feature/memory-optimization for merge to `main` (#802) * Encode update format (#789) * Update categorical test * Fix encode tests to agree with new JSONEncoder Fix categorical column tests Fix DateTimeColumn tests Fix IntColumn tests Fix NumericStatsMixin tests Fix OrderColumn tests Fix BaseColumn tests * Remove unnecessary cast() in csv_data.py (1) (#796) Removing this cast() doesn't cause a mypy error. self._header is set to type Optional[Union[str, int]] in the constructor. Also, self.guess_header_row() has return type Optional[int], so casting to int doesn't make sense here. --------- Co-authored-by: Kshitij Sinha <55467782+kshitijavis@users.noreply.github.com> Co-authored-by: Junho Lee <53921230+junholee6a@users.noreply.github.com> * Update feat mem (#803) * Encode update format (#789) * Update categorical test * Fix encode tests to agree with new JSONEncoder Fix categorical column tests Fix DateTimeColumn tests Fix IntColumn tests Fix NumericStatsMixin tests Fix OrderColumn tests Fix BaseColumn tests * Remove unnecessary cast() in csv_data.py (1) (#796) Removing this cast() doesn't cause a mypy error. self._header is set to type Optional[Union[str, int]] in the constructor. Also, self.guess_header_row() has return type Optional[int], so casting to int doesn't make sense here. * Remove unnecessary cast() in csv_data.py (2) (#798) --------- Co-authored-by: Kshitij Sinha <55467782+kshitijavis@users.noreply.github.com> Co-authored-by: Junho Lee <53921230+junholee6a@users.noreply.github.com> --------- Co-authored-by: ksneab7 <91956551+ksneab7@users.noreply.github.com> Co-authored-by: Kshitij Sinha <55467782+kshitijavis@users.noreply.github.com> Co-authored-by: Junho Lee <53921230+junholee6a@users.noreply.github.com>
capitalone · Apr 27, 2023 · d9648fb · d9648fb
1 parent 23da09d
commit d9648fb
Show file tree

Hide file tree

Showing 3 changed files with 15 additions and 11 deletions.
diff --git a/.github/workflows/test-python-package.yml b/.github/workflows/test-python-package.yml
@@ -5,7 +5,9 @@ name: Test Python Package
 
 on:
   pull_request:
-    branches: [ main ]
+    branches:
+      - 'main'
+      - 'feature/**'
 
 jobs:
   build:

diff --git a/dataprofiler/tests/space_time_analysis/structured_space_time_analysis.py b/dataprofiler/tests/space_time_analysis/structured_space_time_analysis.py
@@ -279,7 +279,7 @@ def dp_space_time_analysis(
         )
         print(f"Dataset of size {max(SAMPLE_SIZES)} created.")
     else:
-        full_dataset = dp.Data(DATASET_PATH)
+        _full_dataset = dp.Data(DATASET_PATH)
 
     dp_space_time_analysis(
         _rng,

diff --git a/dataprofiler/tests/space_time_analysis/throughput-test-guidelines.md b/dataprofiler/tests/space_time_analysis/throughput-test-guidelines.md
@@ -7,18 +7,20 @@ testing mechanism for both structured and unstructured datasets.
 
 # Structured Dataset Throughput Evaluation
 
-The test script `structured_throughput_testing.py` has been provided to simplify
+The test script `structured_space_time_analysis.py` has been provided to simplify
 the throughput testing procedure. Simply running the script will provide a
 printed output as well as four files and saved to the working directory of where
 the script was ran.
 
-  * `structured_profile_times.json`: dict of total time, time to merge, and
+  * `time_analysis/structured_profile_times.json`: dict of total time, time to merge, and
       runtimes for each of the profiled functions within the library
-  * `structured_profile_times.csv`: a flattened table of the above json
-  * `profile_space_analysis.bin`: a bin file that contains information on the
+  * `time_analysis/structured_profile_times.csv`: a flattened table of the above json
+  * `space_analysis/profile_space_analysis_*.bin`: a bin files that contain information on the
       spatial analysis of running the dp.Profiler function
-  * `merge_space_analysis.bin`: a bin file that contains information on the
+  * `space_analysis/merge_space_analysis_*.bin`: a bin files that contain information on the
       spatial analysis of merging two profiles together
+  * `time_analysis/time_report_*.txt`: a text file that shows the total time taken for
+      profiling and merging a dataset
 
 Total time and merge time can be used for comparing the overall runtime changes,
 whereas the individual function times can detail bottlenecks or speed changes as
@@ -27,14 +29,14 @@ a result of alterations to a property's calculation.
 The spatial analysis `bin` files can be viewed in different report formats with memray.
 For example running:
 ```console
-python3 -m memray flamegraph profile_space_analysis.bin -o profile_space_analysis.html
+python3 -m memray flamegraph profile_space_analysis*.bin -o profile_space_analysis.html
 ```
 Gives a html formatted flamegraph that displays the distribution of space allocated by
 function calls involved in the dp.Profiler
 
 The script can be run as follows:
 ```console
-python structured_throughput_testing.py
+python structured_space_time_analysis.py
 ```
 
 ### Tunable parameters
@@ -108,12 +110,12 @@ data.to_csv('data/time_structured_profiler.csv', index=False)
 
 
 ### Obtaining outputs
-- Run `python structured_spect_time_analysis.py`
+- Run `python structured_space_time_analysis.py`
 - This will output:
   - `.bin` files in the `./space_analysis` folder:
     - To generate readable flamegraph reports run:
     ```console
-    ./create_flamegraph.sh
+    ./create_flamegraphs.sh
     ```
   - Text files in the `./time_analysis` folder