forked from openml/automlbenchmark
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathvalidation.yaml
54 lines (44 loc) · 1.98 KB
/
validation.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
---
# a selection of tasks used to validate the benchmark app: various sizes, format (with/out nans, missing values...)
- name: bioresponse
openml_task_id: 9910
description: |
Binary, many features, all numericals.
- name: dresses-sales
openml_task_id: 125920
description: |
Binary, mainly categorical features, with missing values in most categorical features.
Also the Arff file contains a categorical feature with 2 labels for the same value (case sensitive).
- name: eucalyptus
openml_task_id: 2079
description: |
Multiclass, mix of numerical and categorical features, with missing values in several numerical features.
- name: internet-advertisements
openml_task_id: 167125
description: |
Binary, many features, almost all categorical but (0, 1), so can be interpreted as int.
- name: micro-mass
openml_task_id: 9950
description: |
Multiclass, many features, all numericals.
- name: kc1
openml_task_id: 3917
description: |
Binary with (true, false) as target classes.
This causes issues if the framework is using Pandas when obtaining predictions:
pandas will automatically convert ("true", "false") strings to (True, False) booleans which will then be reconverted to ("True", "False") when saved to csv.
for those cases, Pandas should be avoided at that particular time or string type/conversion should be enforced for target predictions column.
cf. H2OAutoML where pandas could be avoided when reading predictions.
- name: APSFailure
openml_task_id: 168868
description: |
Dataset doesn't have its target as last column by default, and some framework may rely on this.
cf. AutoWEKA for an example showing how to handle this when the framework requires the target in a specific position.
- name: diabetes130US
openml_task_id: 168877
description: |
Missing values not formatted correctly.
- name: census-income
openml_task_id: 211985
description: |
Many categoricals with labels starting/ending with spaces.