Release Multimodal Toolkit v0.4.0 · georgian-io/Multimodal-Toolkit

Features:

CategoricalFeatures now uses a fit(), transform(), and fit_transform() method.
Created a new NumericalFeatures object with the same functions as above for consistency in use.
Decoupled CategoricalFeatures and the dataset. The object is now independent of dataset and performs transformations based on information from the fit() step. It can now be used separately for inference.
Resolve #76 by saving numerical and categorical transformers for inference usage.
Updated NaN handling for both categorical and numerical features. Users can specify if NaNs should be handled and what they should be replaced by. Numerical features can be replaced by the median, mean or a custom value while categorical features can be replaced by a custom value only. Also resolves #69
Resolve #66 by adding handle_unknown argument for OneHotEncoders in the config.
Add a new inference.py script to showcase how to use the saved feature transformers.
Update default types for several variables such as categorical_cols and label_list to use lists instead of None.
Class weights have been removed from the dataset and preprocessing sections. This was not usable and even when it was set, it resulted in errors. Instead it is now a parameter in TabularConfig and is used by the model in the forward() call.
Update tests & main.py to support new features.
Update test configuration to reduce the maximum token length - this speeds up the testing and also prevents certain models with lower sequence lengths from throwing an error due to an unsupported sequence length.
Argument classes are now part of the library, no need to redefine them each time.
Add a .gitignore file
Change license to Apache 2.0

Fixes:

Add a note to the example notebook to address #71
#61 (thanks @DougTrajano!)
#62 (thanks @DougTrajano!)
Add importlib-metadata to setup.py as there was a dependency error without it.
Reset index before preprocessing as categorical preprocessing resets the index which in turn causes issues when merging it with the numerical & text features.
Fix: OneHotEncoder no longer uses a deprecated parameter.
Fix: Categorical features are now correctly processed as numpy arrays after transformation.
Misc bugfixes

Housekeeping:

Deps: Update requirements to resolve dependabot alerts.
Deps: Update setup.py to use latest versions of transformers, pandas, scikit-learn, scipy and accelerate.
Refactor: Rename the notebooks folder into an examples folder.
Refactor: Update all function calls to explicitly name parameters to avoid confusion.
Style: Reformat entire library with black.
Docs: Update repository maintainers
Docs: Add type hints & docstrings to data module.
Docs: Update Sphinx and regenerate documentation.
Chore: Update library to version 0.4.0.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multimodal Toolkit v0.4.0

Contributors