Skip to content

Commit

Permalink
[data/preprocessors] Allow encoders to be used in append mode (#50324)
Browse files Browse the repository at this point in the history
<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?

This is part of #48133.
Continuing the approach taken in
#49426, make all the encoders
work in append mode

## Related issue number

#49426

## Checks

- [x] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [x] I've run `scripts/format.sh` to lint the changes in this PR.
- [x] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [x] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: Martin Bomio <martinbomio@spotify.com>
Signed-off-by: Richard Liaw <rliaw@berkeley.edu>
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
  • Loading branch information
2 people authored and israbbani committed Feb 25, 2025
1 parent 0d81741 commit 39e6f71
Show file tree
Hide file tree
Showing 5 changed files with 307 additions and 72 deletions.
23 changes: 22 additions & 1 deletion python/ray/data/preprocessor.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
import pickle
import warnings
from enum import Enum
from typing import TYPE_CHECKING, Any, Dict, Union
from typing import TYPE_CHECKING, Any, Dict, Union, List, Optional

from ray.air.util.data_batch_conversion import BatchFormat
from ray.util.annotations import DeveloperAPI, PublicAPI
Expand Down Expand Up @@ -277,6 +277,27 @@ def _transform_batch(self, data: "DataBatchType") -> "DataBatchType":
elif transform_type == BatchFormat.NUMPY:
return self._transform_numpy(_convert_batch_type_to_numpy(data))

@classmethod
def _derive_and_validate_output_columns(
cls, columns: List[str], output_columns: Optional[List[str]]
) -> List[str]:
"""Returns the output columns after validation.
Checks if the columns are explicitly set, otherwise defaulting to
the input columns.
Raises:
ValueError if the length of the output columns does not match the
length of the input columns.
"""

if output_columns and len(columns) != len(output_columns):
raise ValueError(
"Invalid output_columns: Got len(columns) != len(output_columns)."
"The length of columns and output_columns must match."
)
return output_columns or columns

@DeveloperAPI
def _transform_pandas(self, df: "pd.DataFrame") -> "pd.DataFrame":
"""Run the transformation on a data batch in a Pandas DataFrame format."""
Expand Down
Loading

0 comments on commit 39e6f71

Please sign in to comment.