Skip to content

Commit

Permalink
Merge pull request ukri-excalibur#260 from ukri-excalibur/ko/post_pro…
Browse files Browse the repository at this point in the history
…cessing_multiple_axis

Picking columns to export CSV
  • Loading branch information
ilectra authored Apr 12, 2024
2 parents 9b801ed + 264a07f commit b505e07
Show file tree
Hide file tree
Showing 5 changed files with 135 additions and 27 deletions.
14 changes: 11 additions & 3 deletions post-processing/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,7 @@ python post_processing.py log_path config_path [-p plot_type]
- `config_path` - Path to a configuration file containing plot details.
- `plot_type` - (Optional.) Type of plot to be generated. (`Note: only a generic bar chart is currently implemented.`)

Run `post_processing.py -h` for more information (including debugging flags).
Run `post_processing.py -h` for more information (including debugging and file output flags).

#### Streamlit

Expand Down Expand Up @@ -68,12 +68,13 @@ Before running post-processing, create a config file including all necessary inf
- `Format: [column_name, value]`
- `column_types` - Pandas dtype for each relevant column (axes, units, filters, series). Specified with a dictionary.
- `Accepted types: "str"/"string"/"object", "int"/"int64", "float"/"float64", "datetime"/"datetime64"`
- `additional_columns_to_csv` - (Optional.) List of additional columns to export to csv file, in addition to the ones above. Those columns are not used in plotting. (Specify an empty list if no additional columns are required.)

#### A Note on Replaced ReFrame Columns

A perflog contains certain columns that will not be present in the DataFrame available to the graphing script. Currently, these columns are `display_name`, `extra_resources`, and `env_vars`. Removed columns should not be referenced in a plot config file.
A perflog contains certain columns with complex information that has to be unpacked in order to be useful. Currently, such columns are `display_name`, `extra_resources`, `env_vars`, and `spack_spec_dict`. Those columns are parsed by the postprocessing, removed from the DataFrame, and substituted by new columns with the unpacked information. Therefore they will not be present in the DataFrame available to the graphing script and should not be referenced in a plot config file.

When the row contents of `display_name` are parsed, they are separated into their constituent benchmark names and parameters. This column is replaced with a new `test_name` column and new parameter columns (if present). Similarly, the `extra_resources` and `env_vars` columns are replaced with their respective dictionary row contents (keys become columns, values become row contents).
When the row contents of `display_name` are parsed, they are separated into their constituent benchmark names and parameters. This column is replaced with a new `test_name` column and new parameter columns (if present). Similarly, the `extra_resources`, `env_vars`, and `spack_spec_dict` columns are replaced with their respective dictionary row contents (keys become columns, values become row contents).

#### Complete Config Template

Expand Down Expand Up @@ -121,6 +122,10 @@ series: <series_list>
# accepted types: string/object, int, float, datetime
column_types:
<column_name>: <column_type>

# optional (default: no extra columns exported to a csv file in addition to the ones above)
additional_columns_to_csv:
<columns_list>
```
#### Example Config
Expand Down Expand Up @@ -162,6 +167,9 @@ column_types:
filter_col_1: "datetime"
filter_col_2: "int"
series_col: "str"

additional_columns_to_csv:
["additional_col_1", "additional_col_2"]
```
#### X-axis Grouping
Expand Down
8 changes: 8 additions & 0 deletions post-processing/config_handler.py
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@ def __init__(self, config: dict, template=False):
self.filters = config.get("filters")
self.series = config.get("series")
self.column_types = config.get("column_types")
self.extra_columns = config.get("additional_columns_to_csv")

# parse filter information
self.and_filters = []
Expand Down Expand Up @@ -153,6 +154,13 @@ def parse_columns(self):
dict.fromkeys((self.plot_columns + self.filter_columns +
([self.scaling_column.get("name")] if self.scaling_column else []))))

# remove duplicated columns from the extra_columns list
duplicates = set(self.all_columns) & set(self.extra_columns)
while len(duplicates) != 0:
for d in duplicates:
self.extra_columns.remove(d)
duplicates = set(self.all_columns) & set(self.extra_columns)

def remove_redundant_types(self):
"""
Check for columns that are no longer in use and remove them from the type dict.
Expand Down
28 changes: 20 additions & 8 deletions post-processing/post_processing.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@
import operator as op
import traceback
from functools import reduce
import os
from pathlib import Path

import pandas as pd
Expand All @@ -12,19 +13,23 @@

class PostProcessing:

def __init__(self, log_path: Path, debug=False, verbose=False):
def __init__(self, log_path: Path, debug=False, verbose=False, save=False, plotting=True):
"""
Initialise class.
Args:
log_path: Path, path to performance log file or directory.
debug: bool, flag to print additional information to console.
verbose: bool, flag to print more additional information to console.
save: bool, flag to save the filtered dataframe in csv file
plotting: bool, flag to generate and store a plot in html file
"""

# FIXME (issue #264): add proper logging
self.debug = debug
self.verbose = verbose
self.save = save
self.plotting = plotting
# find and read perflogs
self.original_df = PerflogHandler(log_path, self.debug).get_df()
# copy original data for modification during post-processing
Expand Down Expand Up @@ -58,16 +63,18 @@ def run_post_processing(self, config: ConfigHandler):
# scale y-axis
self.transform_df_data(
config.x_axis["value"], config.y_axis["value"], *config.get_y_scaling(), config.series_filters)

# FIXME (#issue #255): have an option to put this into a file (-s / --save flag?)
if self.debug:
print("Selected dataframe:")
print(self.df[self.mask][config.plot_columns])
print(self.df[self.mask][config.plot_columns + config.extra_columns])
if self.save:
self.df[self.mask][config.plot_columns + config.extra_columns].to_csv(
path_or_buf=os.path.join(Path(__file__).parent,'output.csv'), index=True) # Set index=False to exclude the DataFrame index from the CSV

# call a plotting script
self.plot = plot_generic(
config.title, self.df[self.mask][config.plot_columns],
config.x_axis, config.y_axis, config.series_filters, self.debug)
if self.plotting:
self.plot = plot_generic(
config.title, self.df[self.mask][config.plot_columns],
config.x_axis, config.y_axis, config.series_filters, self.debug)

# FIXME (#issue #255): maybe save this bit to a file as well for easier viewing
if self.debug & self.verbose:
Expand Down Expand Up @@ -396,6 +403,11 @@ def read_args():
parser.add_argument("-v", "--verbose", action="store_true",
help="verbose flag for printing more debug information \
(must be used in conjunction with the debug flag)")
parser.add_argument("-s", "--save", action="store_true",
help="save flag for saving the filtered dataframe in csv file")
parser.add_argument("-np", "--no_plot", action="store_true",
help="no-plot flag for disabling generating and storing a plot")


return parser.parse_args()

Expand All @@ -405,7 +417,7 @@ def main():
args = read_args()

try:
post = PostProcessing(args.log_path, args.debug, args.verbose)
post = PostProcessing(args.log_path, args.debug, args.verbose, args.save, not(args.no_plot))
config = ConfigHandler.from_path(args.config_path)
post.run_post_processing(config)

Expand Down
5 changes: 5 additions & 0 deletions post-processing/post_processing_config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -47,3 +47,8 @@ column_types:
flops_unit: "str"
system: "str"
cpus_per_task: "int"

# Optional setting to specify additional columns to export to csv file, in addition to
# the ones in axes/series/filters
additional_columns_to_csv:
["spack_spec"]
107 changes: 91 additions & 16 deletions post-processing/test_post_processing.py
Original file line number Diff line number Diff line change
Expand Up @@ -236,7 +236,8 @@ def test_high_level_script(run_sombrero):
"series": [],
"column_types": {"fake_column": "int",
"flops_value": "float",
"flops_unit": "str"}}))
"flops_unit": "str"},
"additional_columns_to_csv": []}))
except KeyError as e:
assert e.args[1] == ["fake_column"]
else:
Expand All @@ -256,7 +257,8 @@ def test_high_level_script(run_sombrero):
"series": [],
"column_types": {"tasks": "int",
"flops_value": "float",
"flops_unit": "str"}}))
"flops_unit": "str"},
"additional_columns_to_csv": []}))
except KeyError as e:
assert e.args[1] == "!!"
else:
Expand All @@ -276,7 +278,8 @@ def test_high_level_script(run_sombrero):
"series": [],
"column_types": {"tasks": "int",
"flops_value": "float",
"flops_unit": "str"}}))
"flops_unit": "str"},
"additional_columns_to_csv": []}))
except ValueError:
assert True
else:
Expand All @@ -296,7 +299,8 @@ def test_high_level_script(run_sombrero):
"series": [],
"column_types": {"tasks": "int",
"flops_value": "float",
"flops_unit": "str"}}))
"flops_unit": "str"},
"additional_columns_to_csv": []}))
except pd.errors.EmptyDataError:
assert True
else:
Expand All @@ -315,7 +319,8 @@ def test_high_level_script(run_sombrero):
"series": [],
"column_types": {"tasks": "int",
"flops_value": "float",
"flops_unit": "str"}}))
"flops_unit": "str"},
"additional_columns_to_csv": []}))
except RuntimeError:
assert True
else:
Expand All @@ -334,7 +339,8 @@ def test_high_level_script(run_sombrero):
"series": [],
"column_types": {"tasks": "int",
"cpus_per_task": "int",
"extra_param": "int"}}))
"extra_param": "int"},
"additional_columns_to_csv": []}))
except RuntimeError as e:
# three param columns found in changed log
EXPECTED_FIELDS = ["tasks", "cpus_per_task", "extra_param"]
Expand All @@ -356,7 +362,8 @@ def test_high_level_script(run_sombrero):
"series": [],
"column_types": {"job_completion_time": "datetime",
"flops_value": "float",
"flops_unit": "str"}}))
"flops_unit": "str"},
"additional_columns_to_csv": []}))
# check returned subset is as expected
assert len(df) == 2

Expand All @@ -374,7 +381,8 @@ def test_high_level_script(run_sombrero):
"column_types": {"tasks": "int",
"cpus_per_task": "int",
"flops_value": "float",
"flops_unit": "str"}}))
"flops_unit": "str"},
"additional_columns_to_csv": []}))
# check returned subset is as expected
assert len(df) == 4

Expand All @@ -394,7 +402,8 @@ def test_high_level_script(run_sombrero):
"flops_value": "float",
"flops_unit": "str",
"cpus_per_task": "int",
"OMP_NUM_THREADS": "int"}}))
"OMP_NUM_THREADS": "int"},
"additional_columns_to_csv": []}))
# check flops values are halved compared to previous df
assert (dfs["flops_value"].values == df[df["cpus_per_task"] == 2]["flops_value"].values/2).all()

Expand All @@ -413,7 +422,8 @@ def test_high_level_script(run_sombrero):
"column_types": {"tasks": "int",
"flops_value": "float",
"flops_unit": "str",
"cpus_per_task": "int"}}))
"cpus_per_task": "int"},
"additional_columns_to_csv": []}))
assert (dfs[dfs["cpus_per_task"] == 1]["flops_value"].values ==
df[df["cpus_per_task"] == 1]["flops_value"].values /
df[df["cpus_per_task"] == 1]["flops_value"].values).all()
Expand All @@ -437,7 +447,8 @@ def test_high_level_script(run_sombrero):
"column_types": {"tasks": "int",
"flops_value": "float",
"flops_unit": "str",
"cpus_per_task": "int"}}))
"cpus_per_task": "int"},
"additional_columns_to_csv": []}))
assert (dfs["flops_value"].values == df["flops_value"].values /
df[(df["cpus_per_task"] == 1) & (df["tasks"] == 2)]["flops_value"].iloc[0]).all()

Expand All @@ -456,7 +467,8 @@ def test_high_level_script(run_sombrero):
"column_types": {"tasks": "int",
"flops_value": "float",
"flops_unit": "str",
"cpus_per_task": "int"}}))
"cpus_per_task": "int"},
"additional_columns_to_csv": []}))
# check flops values are halved compared to previous df
assert (dfs["flops_value"].values == df[df["cpus_per_task"] == 2]["flops_value"].values/2).all()

Expand All @@ -476,7 +488,8 @@ def test_high_level_script(run_sombrero):
"flops_value": "float",
"flops_unit": "str",
"cpus_per_task": "int",
"OMP_NUM_THREADS": "str"}}))
"OMP_NUM_THREADS": "str"},
"additional_columns_to_csv": []}))
except TypeError:
assert True

Expand All @@ -496,7 +509,8 @@ def test_high_level_script(run_sombrero):
"column_types": {"tasks": "int",
"flops_value": "float",
"flops_unit": "str",
"cpus_per_task": "int"}}))
"cpus_per_task": "int"},
"additional_columns_to_csv": []}))
except ValueError:
assert True

Expand All @@ -514,7 +528,8 @@ def test_high_level_script(run_sombrero):
"series": [],
"column_types": {"tasks": "int",
"flops_value": "float",
"flops_unit": "str"}}))
"flops_unit": "str"},
"additional_columns_to_csv": []}))
except RuntimeError as e:
# dataframe has records from both files
assert len(e.args[1]) == 8
Expand All @@ -535,9 +550,69 @@ def test_high_level_script(run_sombrero):
"column_types": {"tasks": "int",
"flops_value": "float",
"flops_unit": "str",
"cpus_per_task": "int"}}))
"cpus_per_task": "int"},
"additional_columns_to_csv": []}))

EXPECTED_FIELDS = ["tasks", "flops_value", "flops_unit"]
# check returned subset is as expected
assert df.columns.tolist() == EXPECTED_FIELDS
assert len(df) == 1

# get filtered dataframe with extra columns for csv
df = PostProcessing(sombrero_log_path, save=True).run_post_processing(
ConfigHandler(
{"title": "Title",
"x_axis": {"value": "tasks",
"units": {"custom": None}},
"y_axis": {"value": "flops_value",
"units": {"column": "flops_unit"}},
"filters": {"and": [["tasks", ">", 1], ["cpus_per_task", "==", 2]],
"or": []},
"series": [],
"column_types": {"tasks": "int",
"flops_value": "float",
"flops_unit": "str",
"cpus_per_task": "int"},
"additional_columns_to_csv": ["spack_spec"]}
))

EXPECTED_FIELDS = ["tasks", "flops_value", "flops_unit"]
# check returned subset is as expected
assert df.columns.tolist() == EXPECTED_FIELDS
assert len(df) == 1

EXPECTED_FIELDS.append("spack_spec")
# check subset written to csv is as expected
output_file = "output.csv"
df_saved = pd.read_csv(output_file, index_col=0)
assert df_saved.columns.tolist() == EXPECTED_FIELDS
assert len(df_saved) == 1

# get filtered dataframe with duplicated extra columns for csv
df = PostProcessing(sombrero_log_path, save=True).run_post_processing(
ConfigHandler(
{"title": "Title",
"x_axis": {"value": "tasks",
"units": {"custom": None}},
"y_axis": {"value": "flops_value",
"units": {"column": "flops_unit"}},
"filters": {"and": [["tasks", ">", 1], ["cpus_per_task", "==", 2]],
"or": []},
"series": [],
"column_types": {"tasks": "int",
"flops_value": "float",
"flops_unit": "str",
"cpus_per_task": "int"},
"additional_columns_to_csv": ["tasks", "tasks"]}
))

EXPECTED_FIELDS = ["tasks", "flops_value", "flops_unit"]
# check returned subset is as expected
assert df.columns.tolist() == EXPECTED_FIELDS
assert len(df) == 1

# check subset written to csv is as expected
output_file = "output.csv"
df_saved = pd.read_csv(output_file, index_col=0)
assert df_saved.columns.tolist() == EXPECTED_FIELDS
assert len(df_saved) == 1

0 comments on commit b505e07

Please sign in to comment.