Replies: 3 comments 6 replies
-
Hiyo, I think that's perfectly reasonable and I often use AutoMLBenchmark as a test for handling all the subtleties of non-encoded data. With regards to backwards compatibility, I would suggest that perhaps |
Beta Was this translation helpful? Give feedback.
-
+1. In my opinion, even the base models (think XGBoost, CatBoost, TabPFN, ...) of AutoML should be able to handle the raw data starting in 2025. Even if scikit-learn does not follow this approach. |
Beta Was this translation helpful? Give feedback.
-
I am considering to drop the support for encoded data. When we started this project in 2018, there were still several AutoML frameworks which could not handle categorical and/or text data as part of tabular data natively. To that end, we also supported providing the frameworks with data-encoded all-numerical numpy arrays. However, nowadays almost all frameworks support the inclusion of text data.
Going into 2025, I think it's completely fair to expect AutoML frameworks to 1) be able to take columns with text data as input and 2) be able to take textual class labels as input and produce them as predictions. For that reason, and to reduce the complexity of some of the data loading and serialization code (and consequently the options exposed to integration scripts), I propose to do away with the option to get encoded data directly from the benchmark software.
The drawback is that we may need to drop support for some frameworks which have ceased to be updated (Auto-WEKA, TPOT), and/or provide a legacy pipeline (also for e.g., the baselines). On the balance, I think this is very reasonable. In the case these old systems would need to be benchmarked, older versions of the AutoML benchmark may continue to be used.
I hope people in the AutoML community can weigh in on this decision.
(in particular because of earlier contributions to the benchmark, or because they have a framework which currently uses encoded data, I am also pinging @mfeurer @eddiebergman @LennartPurucker @Innixma @shchur @fmohr - but the discussion is open to everyone )
Beta Was this translation helpful? Give feedback.
All reactions