Skip to content

Commit

Permalink
AutoGluon TimeSeries Support (first version) (#494)
Browse files Browse the repository at this point in the history
* Add AutoGluon TimeSeries Prototype

* AutoMLBenchmark TimeSeries Prototype. (#6)

* fixed loading test & train, changed pred.-l. 5->30

* ignore launch.json of vscode

* ensuring timestamp parsing

* pass config, save pred, add results

* remove unused code

* add readability, remove slice from timer

* ensure autogluonts has required info

* add comments for readability

* setting defaults for timeseries task

* remove outer context manipulation

* corrected spelling error for quantiles

* adding mape, correct available metrics

* beautify config options

* fixed config for public access

* Update readme

* Autogluon timeseries, addressed comments by sebhrusen (#7)

* fixed loading test & train, changed pred.-l. 5->30

* ignore launch.json of vscode

* ensuring timestamp parsing

* pass config, save pred, add results

* remove unused code

* add readability, remove slice from timer

* ensure autogluonts has required info

* add comments for readability

* setting defaults for timeseries task

* remove outer context manipulation

* corrected spelling error for quantiles

* adding mape, correct available metrics

* beautify config options

* fixed config for public access

* no outer context manipulation, add dataset subdir

* add more datasets

* include error raising for too large pred. length.

* mergin AutoGluonTS framework folder into AutoGluon

* renaming ts.yaml to timeseries.yaml, plus ext.

* removing presets, correct latest config for AGTS

* move dataset timeseries ext to datasets/file.py

* dont bypass test mode

* move quantiles and y_past_period_error to opt_cols

* remove whitespaces

* deleting merge artifacts

* delete merge artifacts

* renaming prediction_length to forecast_range_in_steps

* use public dataset, reduced range to maximum

* fix format string works

* fix key error bug, remove magic time limit

* Addressed minor comments, and fixed version call for tabular and timeseries modularities (#8)

* fixed loading test & train, changed pred.-l. 5->30

* ignore launch.json of vscode

* ensuring timestamp parsing

* pass config, save pred, add results

* remove unused code

* add readability, remove slice from timer

* ensure autogluonts has required info

* add comments for readability

* setting defaults for timeseries task

* remove outer context manipulation

* corrected spelling error for quantiles

* adding mape, correct available metrics

* beautify config options

* fixed config for public access

* no outer context manipulation, add dataset subdir

* add more datasets

* include error raising for too large pred. length.

* mergin AutoGluonTS framework folder into AutoGluon

* renaming ts.yaml to timeseries.yaml, plus ext.

* removing presets, correct latest config for AGTS

* move dataset timeseries ext to datasets/file.py

* dont bypass test mode

* move quantiles and y_past_period_error to opt_cols

* remove whitespaces

* deleting merge artifacts

* delete merge artifacts

* renaming prediction_length to forecast_range_in_steps

* use public dataset, reduced range to maximum

* fix format string works

* fix key error bug, remove magic time limit

* swapped timeseries and tabular to set version

* make warning message more explicit

* remove outer context manipulation

* split timeseries / tabular into functions

Co-authored-by: Leo <LeonhardSommer96@gmail.com>
  • Loading branch information
Innixma and limpbot authored Oct 10, 2022
1 parent f55cd0b commit 4029472
Show file tree
Hide file tree
Showing 16 changed files with 437 additions and 38 deletions.
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@ venv/
.idea/
*.iml
*.swp
launch.json

# tmp files
.ipynb_checkpoints/
Expand Down
12 changes: 9 additions & 3 deletions amlb/benchmark.py
Original file line number Diff line number Diff line change
Expand Up @@ -489,7 +489,9 @@ def load_data(self):
# TODO
raise NotImplementedError("OpenML datasets without task_id are not supported yet.")
elif hasattr(self._task_def, 'dataset'):
self._dataset = Benchmark.data_loader.load(DataSourceType.file, dataset=self._task_def.dataset, fold=self.fold)
dataset_name_and_config = copy(self._task_def.dataset)
dataset_name_and_config.name = self._task_def.name
self._dataset = Benchmark.data_loader.load(DataSourceType.file, dataset=dataset_name_and_config, fold=self.fold)
else:
raise ValueError("Tasks should have one property among [openml_task_id, openml_dataset_id, dataset].")

Expand Down Expand Up @@ -522,7 +524,12 @@ def run(self):
predictions_dir=self.benchmark.output_dirs.predictions)
framework_def = self.benchmark.framework_def
task_config = copy(self.task_config)
task_config.type = 'regression' if self._dataset.type == DatasetType.regression else 'classification'
if self._dataset.type == DatasetType.regression:
task_config.type = 'regression'
elif self._dataset.type == DatasetType.timeseries:
task_config.type = 'timeseries'
else:
task_config.type = 'classification'
task_config.type_ = self._dataset.type.name
task_config.framework = self.benchmark.framework_name
task_config.framework_params = framework_def.params
Expand Down Expand Up @@ -552,4 +559,3 @@ def run(self):
finally:
self._dataset.release()
return results.compute_score(result=result, meta_result=meta_result)

1 change: 1 addition & 0 deletions amlb/data.py
Original file line number Diff line number Diff line change
Expand Up @@ -172,6 +172,7 @@ class DatasetType(Enum):
binary = 1
multiclass = 2
regression = 3
timeseries = 4


class Dataset(ABC):
Expand Down
76 changes: 61 additions & 15 deletions amlb/datasets/file.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@
from ..utils import Namespace as ns, as_list, lazy_property, list_all_files, memoize, path_from_split, profile, repr_def, split_path

from .fileutils import is_archive, is_valid_url, unarchive_file, get_file_handler

from copy import deepcopy

log = logging.getLogger(__name__)

Expand All @@ -33,7 +33,7 @@ def __init__(self, cache_dir=None):
def load(self, dataset, fold=0):
dataset = dataset if isinstance(dataset, ns) else ns(path=dataset)
log.debug("Loading dataset %s", dataset)
paths = self._extract_train_test_paths(dataset.path if 'path' in dataset else dataset, fold=fold)
paths = self._extract_train_test_paths(dataset.path if 'path' in dataset else dataset, fold=fold, name=dataset['name'] if 'name' in dataset else None)
assert fold < len(paths['train']), f"No training dataset available for fold {fold} among dataset files {paths['train']}"
# seed = rget().seed(fold)
# if len(paths['test']) == 0:
Expand All @@ -51,21 +51,28 @@ def load(self, dataset, fold=0):
if ext == '.arff':
return ArffDataset(train_path, test_path, target=target, features=features, type=type_)
elif ext == '.csv':
return CsvDataset(train_path, test_path, target=target, features=features, type=type_)
if DatasetType[dataset['type']] == DatasetType.timeseries and dataset['timestamp_column'] is None:
log.warning("Warning: For timeseries task setting undefined timestamp column to `timestamp`.")
dataset = deepcopy(dataset)
dataset['timestamp_column'] = "timestamp"
csv_dataset = CsvDataset(train_path, test_path, target=target, features=features, type=type_, timestamp_column=dataset['timestamp_column'] if 'timestamp_column' in dataset else None)
if csv_dataset.type == DatasetType.timeseries:
csv_dataset = self.extend_dataset_with_timeseries_config(csv_dataset, dataset)
return csv_dataset
else:
raise ValueError(f"Unsupported file type: {ext}")

def _extract_train_test_paths(self, dataset, fold=None):
def _extract_train_test_paths(self, dataset, fold=None, name=None):
if isinstance(dataset, (tuple, list)):
assert len(dataset) % 2 == 0, "dataset list must contain an even number of paths: [train_0, test_0, train_1, test_1, ...]."
return self._extract_train_test_paths(ns(train=[p for i, p in enumerate(dataset) if i % 2 == 0],
test=[p for i, p in enumerate(dataset) if i % 2 == 1]),
fold=fold)
fold=fold, name=name)
elif isinstance(dataset, ns):
return dict(train=[self._extract_train_test_paths(p)['train'][0]
return dict(train=[self._extract_train_test_paths(p, name=name)['train'][0]
if i == fold else None
for i, p in enumerate(as_list(dataset.train))],
test=[self._extract_train_test_paths(p)['train'][0]
test=[self._extract_train_test_paths(p, name=name)['train'][0]
if i == fold else None
for i, p in enumerate(as_list(dataset.test))])
else:
Expand Down Expand Up @@ -116,7 +123,10 @@ def _extract_train_test_paths(self, dataset, fold=None):
assert len(paths) > 0, f"No dataset file found in {dataset}: they should follow the naming xxxx_train.ext, xxxx_test.ext or xxxx_train_0.ext, xxxx_test_0.ext, xxxx_train_1.ext, ..."
return paths
elif is_valid_url(dataset):
cached_file = os.path.join(self._cache_dir, os.path.basename(dataset))
if name is None:
cached_file = os.path.join(self._cache_dir, os.path.basename(dataset))
else:
cached_file = os.path.join(self._cache_dir, name, os.path.basename(dataset))
if not os.path.exists(cached_file): # don't download if previously done
handler = get_file_handler(dataset)
assert handler.exists(dataset), f"Invalid path/url: {dataset}"
Expand All @@ -129,6 +139,40 @@ def __repr__(self):
return repr_def(self)


def extend_dataset_with_timeseries_config(self, dataset, dataset_config):
dataset = deepcopy(dataset)
dataset_config = deepcopy(dataset_config)
if dataset_config['id_column'] is None:
log.warning("Warning: For timeseries task setting undefined `id_column` to `item_id`.")
dataset_config['id_column'] = "item_id"
if dataset_config['forecast_range_in_steps'] is None:
log.warning("Warning: For timeseries task setting undefined `forecast_range_in_steps` to `1`.")
dataset_config['forecast_range_in_steps'] = "1"

dataset.timestamp_column=dataset_config['timestamp_column']
dataset.id_column=dataset_config['id_column']
dataset.forecast_range_in_steps=int(dataset_config['forecast_range_in_steps'])

train_seqs_lengths = dataset.train.X.groupby(dataset.id_column).count()
test_seqs_lengths = dataset.test.X.groupby(dataset.id_column).count()
forecast_range_in_steps_mean_diff_train_test = int((test_seqs_lengths - train_seqs_lengths).mean())
forecast_range_in_steps_max_min_train_test = int(min(int(test_seqs_lengths.min()), int(train_seqs_lengths.min()))) - 1
if not dataset.forecast_range_in_steps == forecast_range_in_steps_mean_diff_train_test:
msg = f"Warning: Forecast range {dataset.forecast_range_in_steps}, does not equal mean difference between test and train sequence lengths {forecast_range_in_steps_mean_diff_train_test}."
log.warning(msg)
if not (test_seqs_lengths - train_seqs_lengths).var().item() == 0.:
msg = f"Error: Not all sequences of train and test set have same sequence length difference."
raise ValueError(msg)
if dataset.forecast_range_in_steps > forecast_range_in_steps_mean_diff_train_test:
msg = f"Error: Forecast range {dataset.forecast_range_in_steps} longer than difference between test and train sequence lengths {forecast_range_in_steps_mean_diff_train_test}."
raise ValueError(msg)
if dataset.forecast_range_in_steps > forecast_range_in_steps_max_min_train_test:
msg = f"Error: Forecast range {dataset.forecast_range_in_steps} longer than minimum sequence length + 1, {forecast_range_in_steps_max_min_train_test}."
raise ValueError(msg)
return dataset



class FileDataset(Dataset):

def __init__(self, train: Datasplit, test: Datasplit,
Expand Down Expand Up @@ -302,25 +346,26 @@ def release(self, properties=None):
class CsvDataset(FileDataset):

def __init__(self, train_path, test_path,
target=None, features=None, type=None):
target=None, features=None, type=None, timestamp_column=None):
# todo: handle auto-split (if test_path is None): requires loading the training set, split, save
super().__init__(None, None,
target=target, features=features, type=type)
self._train = CsvDatasplit(self, train_path)
self._test = CsvDatasplit(self, test_path)
self._train = CsvDatasplit(self, train_path, timestamp_column=timestamp_column)
self._test = CsvDatasplit(self, test_path, timestamp_column=timestamp_column)
self._dtypes = None


class CsvDatasplit(FileDatasplit):

def __init__(self, dataset, path):
def __init__(self, dataset, path, timestamp_column=None):
super().__init__(dataset, format='csv', path=path)
self._ds = None
self.timestamp_column = timestamp_column

def _ensure_loaded(self):
if self._ds is None:
if self.dataset._dtypes is None:
df = read_csv(self.path)
df = read_csv(self.path, timestamp_column=self.timestamp_column)
# df = df.convert_dtypes()
dt_conversions = {name: 'category'
for name, dtype in zip(df.dtypes.index, df.dtypes.values)
Expand All @@ -336,8 +381,9 @@ def _ensure_loaded(self):

self._ds = df
self.dataset._dtypes = self._ds.dtypes

else:
self._ds = read_csv(self.path, dtype=self.dataset._dtypes.to_dict())
self._ds = read_csv(self.path, dtype=self.dataset._dtypes.to_dict(), timestamp_column=self.timestamp_column)

@profile(logger=log)
def load_metadata(self):
Expand All @@ -348,7 +394,7 @@ def load_metadata(self):
else 'number' if pat.is_numeric_dtype(dt)
else 'category' if pat.is_categorical_dtype(dt)
else 'string' if pat.is_string_dtype(dt)
# else 'datetime' if pat.is_datetime64_dtype(dt)
else 'datetime' if pat.is_datetime64_dtype(dt)
else 'object')
features = [Feature(i, col, to_feature_type(dtypes[i]))
for i, col in enumerate(self._ds.columns)]
Expand Down
10 changes: 8 additions & 2 deletions amlb/datautils.py
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@
log = logging.getLogger(__name__)


def read_csv(path, nrows=None, header=True, index=False, as_data_frame=True, dtype=None):
def read_csv(path, nrows=None, header=True, index=False, as_data_frame=True, dtype=None, timestamp_column=None):
"""
read csv file to DataFrame.
Expand All @@ -37,13 +37,19 @@ def read_csv(path, nrows=None, header=True, index=False, as_data_frame=True, dty
:param header: if the columns header should be read.
:param as_data_frame: if the result should be returned as a data frame (default) or a numpy array.
:param dtype: data type for columns.
:param timestamp_column: column name for timestamp, to ensure dates are correctly parsed by pandas.
:return: a DataFrame
"""
if dtype is not None and timestamp_column is not None and timestamp_column in dtype:
dtype = dtype.copy() # to avoid outer context manipulation
del dtype[timestamp_column]

df = pd.read_csv(path,
nrows=nrows,
header=0 if header else None,
index_col=0 if index else None,
dtype=dtype)
dtype=dtype,
parse_dates=[timestamp_column] if timestamp_column is not None else None)
return df if as_data_frame else df.values


Expand Down
82 changes: 79 additions & 3 deletions amlb/results.py
Original file line number Diff line number Diff line change
Expand Up @@ -228,12 +228,17 @@ def load_predictions(predictions_file):
try:
df = read_csv(predictions_file, dtype=object)
log.debug("Predictions preview:\n %s\n", df.head(10).to_string())

if rconfig().test_mode:
TaskResult.validate_predictions(df)
if df.shape[1] > 2:
return ClassificationResult(df)

if 'y_past_period_error' in df.columns:
return TimeSeriesResult(df)
else:
return RegressionResult(df)
if df.shape[1] > 2:
return ClassificationResult(df)
else:
return RegressionResult(df)
except Exception as e:
return ErrorResult(ResultError(e))
else:
Expand All @@ -254,6 +259,7 @@ def load_metadata(metadata_file):
def save_predictions(dataset: Dataset, output_file: str,
predictions: Union[A, DF, S] = None, truth: Union[A, DF, S] = None,
probabilities: Union[A, DF] = None, probabilities_labels: Union[list, A] = None,
optional_columns: Union[A, DF] = None,
target_is_encoded: bool = False,
preview: bool = True):
""" Save class probabilities and predicted labels to file in csv format.
Expand All @@ -264,6 +270,7 @@ def save_predictions(dataset: Dataset, output_file: str,
:param predictions:
:param truth:
:param probabilities_labels:
:param optional_columns:
:param target_is_encoded:
:param preview:
:return: None
Expand Down Expand Up @@ -308,6 +315,10 @@ def save_predictions(dataset: Dataset, output_file: str,

df = df.assign(predictions=preds)
df = df.assign(truth=truth)

if optional_columns is not None:
df = pd.concat([df, optional_columns], axis=1)

if preview:
log.info("Predictions preview:\n %s\n", df.head(20).to_string())
backup_file(output_file)
Expand Down Expand Up @@ -656,6 +667,71 @@ def r2(self):
"""R^2"""
return float(r2_score(self.truth, self.predictions))

class TimeSeriesResult(RegressionResult):

def __init__(self, predictions_df, info=None):
super().__init__(predictions_df, info)
self.truth = self.df['truth'].values if self.df is not None else None #.iloc[:, 1].values if self.df is not None else None
self.predictions = self.df['predictions'].values if self.df is not None else None #.iloc[:, -2].values if self.df is not None else None
self.y_past_period_error = self.df['y_past_period_error'].values
self.quantiles = self.df.iloc[:, 2:-1].values
self.quantiles_probs = np.array([float(q) for q in self.df.columns[2:-1]])
self.truth = self.truth.astype(float, copy=False)
self.predictions = self.predictions.astype(float, copy=False)
self.quantiles = self.quantiles.astype(float, copy=False)
self.y_past_period_error = self.y_past_period_error.astype(float, copy=False)

self.target = Feature(0, 'target', 'real', is_target=True)
self.type = DatasetType.timeseries

@metric(higher_is_better=False)
def mase(self):
"""Mean Absolute Scaled Error"""
return float(np.nanmean(np.abs(self.truth/self.y_past_period_error - self.predictions/self.y_past_period_error)))

@metric(higher_is_better=False)
def smape(self):
"""Symmetric Mean Absolute Percentage Error"""
num = np.abs(self.truth - self.predictions)
denom = (np.abs(self.truth) + np.abs(self.predictions)) / 2
# If the denominator is 0, we set it to float('inf') such that any division yields 0 (this
# might not be fully mathematically correct, but at least we don't get NaNs)
denom[denom == 0] = math.inf
return np.mean(num / denom)

@metric(higher_is_better=False)
def mape(self):
"""Symmetric Mean Absolute Percentage Error"""
num = np.abs(self.truth - self.predictions)
denom = np.abs(self.truth)
# If the denominator is 0, we set it to float('inf') such that any division yields 0 (this
# might not be fully mathematically correct, but at least we don't get NaNs)
denom[denom == 0] = math.inf
return np.mean(num / denom)

@metric(higher_is_better=False)
def nrmse(self):
"""Normalized Root Mean Square Error"""
return self.rmse() / np.mean(np.abs(self.truth))

@metric(higher_is_better=False)
def wape(self):
"""Weighted Average Percentage Error"""
return np.sum(np.abs(self.truth - self.predictions)) / np.sum(np.abs(self.truth))

@metric(higher_is_better=False)
def ncrps(self):
"""Normalized Continuous Ranked Probability Score"""
quantile_losses = 2 * np.sum(
np.abs(
(self.quantiles - self.truth[:, None])
* ((self.quantiles >= self.truth[:, None]) - self.quantiles_probs[None, :])
),
axis=0,
)
denom = np.sum(np.abs(self.truth)) # shape [num_time_series, num_quantiles]
weighted_losses = quantile_losses.sum(0) / denom # shape [num_quantiles]
return weighted_losses.mean()

_encode_predictions_and_truth_ = False

Expand Down
16 changes: 16 additions & 0 deletions frameworks/AutoGluon/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
# AutoGluon

To run v0.5.2: ```python3 ../automlbenchmark/runbenchmark.py autogluon ...```

To run mainline: ```python3 ../automlbenchmark/runbenchmark.py autogluonts:latest ...```


# AutoGluonTS

AutoGluonTS stands for autogluon.timeseries. This framework handles time series problems.

## Run Steps

To run v0.5.2: ```python3 ../automlbenchmark/runbenchmark.py autogluonts timeseries ...```

To run mainline: ```python3 ../automlbenchmark/runbenchmark.py autogluonts:latest timeseries ...```
Loading

0 comments on commit 4029472

Please sign in to comment.