Drop support for encoded data #663

PGijsbers · 2024-12-03T15:11:03Z

PGijsbers
Dec 3, 2024
Maintainer

I am considering to drop the support for encoded data. When we started this project in 2018, there were still several AutoML frameworks which could not handle categorical and/or text data as part of tabular data natively. To that end, we also supported providing the frameworks with data-encoded all-numerical numpy arrays. However, nowadays almost all frameworks support the inclusion of text data.

Going into 2025, I think it's completely fair to expect AutoML frameworks to 1) be able to take columns with text data as input and 2) be able to take textual class labels as input and produce them as predictions. For that reason, and to reduce the complexity of some of the data loading and serialization code (and consequently the options exposed to integration scripts), I propose to do away with the option to get encoded data directly from the benchmark software.

The drawback is that we may need to drop support for some frameworks which have ceased to be updated (Auto-WEKA, TPOT), and/or provide a legacy pipeline (also for e.g., the baselines). On the balance, I think this is very reasonable. In the case these old systems would need to be benchmarked, older versions of the AutoML benchmark may continue to be used.

I hope people in the AutoML community can weigh in on this decision.

(in particular because of earlier contributions to the benchmark, or because they have a framework which currently uses encoded data, I am also pinging @mfeurer @eddiebergman @LennartPurucker @Innixma @shchur @fmohr - but the discussion is open to everyone )

eddiebergman · 2024-12-03T15:25:10Z

eddiebergman
Dec 3, 2024
Collaborator

Hiyo, I think that's perfectly reasonable and I often use AutoMLBenchmark as a test for handling all the subtleties of non-encoded data. With regards to backwards compatibility, I would suggest that perhaps amlb exposes some default encoder (what's there currently?) that integrating frameworks can utilize in their own def run() if they like. One issue with this is that the environment in which the framework is run will not have amlb installed, perhaps either it's a relative import or at the very least, a copy-pastable function. Either way, responsibility for this is best left to the implementor of a framework.

2 replies

PGijsbers Dec 3, 2024
Maintainer Author

That's kind of the way it currently works. There is a custom encoder in the datautils directory, which is just a wrapper around a set of scikit-learn encoders on how to deal with missing values. Though this is often also used under the hood by e.g., data.X_enc. However, the modern scikit-learn encoders might be able to replace this encoder entirely, I would have to investigate. If I would leave this in, I would probably prefer for people to call the encoder explicitly only.

mfeurer Dec 4, 2024
Maintainer

How about using skrub as a default encoder?

LennartPurucker · 2024-12-03T15:48:12Z

LennartPurucker
Dec 3, 2024

+1. In my opinion, even the base models (think XGBoost, CatBoost, TabPFN, ...) of AutoML should be able to handle the raw data starting in 2025.

Even if scikit-learn does not follow this approach.

4 replies

PGijsbers Dec 3, 2024
Maintainer Author

I would agree, since the type of encoding matters w.r.t. the different base models.

mfeurer Dec 4, 2024
Maintainer

Can they handle raw strings?

LennartPurucker Dec 4, 2024

No, but the framework / individual model should then implement code that handles any data as intended by the developer of the framework / individual model.

LennartPurucker Dec 4, 2024

Oh, CatBoost can AFAIK.

mfeurer · 2024-12-04T20:41:40Z

mfeurer
Dec 4, 2024
Maintainer

+1

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Drop support for encoded data #663

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 6 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Drop support for encoded data #663

PGijsbers Dec 3, 2024 Maintainer

Replies: 3 comments · 6 replies

eddiebergman Dec 3, 2024 Collaborator

PGijsbers Dec 3, 2024 Maintainer Author

mfeurer Dec 4, 2024 Maintainer

LennartPurucker Dec 3, 2024

PGijsbers Dec 3, 2024 Maintainer Author

mfeurer Dec 4, 2024 Maintainer

LennartPurucker Dec 4, 2024

LennartPurucker Dec 4, 2024

mfeurer Dec 4, 2024 Maintainer

PGijsbers
Dec 3, 2024
Maintainer

Replies: 3 comments 6 replies

eddiebergman
Dec 3, 2024
Collaborator

PGijsbers Dec 3, 2024
Maintainer Author

mfeurer Dec 4, 2024
Maintainer

LennartPurucker
Dec 3, 2024

PGijsbers Dec 3, 2024
Maintainer Author

mfeurer Dec 4, 2024
Maintainer

mfeurer
Dec 4, 2024
Maintainer