-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create ensembler web service #165
Create ensembler web service #165
Conversation
COPY --from=builder /venv /venv | ||
|
||
RUN /bin/bash -c ". /venv/bin/activate && \ | ||
python -m pyfunc_ensembler_runner --mlflow_ensembler_dir /ensembler --dry_run" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What exactly this --dry_run
does?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh it's just an option to run the web service (to load the ensembler from the mlflow registry and the runner) without actually serving it. It's... actually kinda pointless but I observed @gojek/merlin having it for their pyfunc-server
s so I was wondering if that option would serve some other separate/greater purpose, which was why I included it to be safe.
But I can remove it also since I can't seem to find any reason why we should run it once without serving it in order to ensure it's working, before actually running the service again for serving.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was wondering if that option would serve some other separate/greater purpose, which was why I included it to be safe.
I think the reason is that if you run this command at the image building step, then failure of this dry-run
step will prevent image from being even published and hence we will not spend more time trying to deploy the image if it's misconfigured
def _flatten_json(y: Dict[str, Any]) -> Dict[str, Any]: | ||
""" | ||
Helper function to normalise a nested dictionary into a dictionary of depth 1 with names following the | ||
convention: key_1.key_2.key_3..., with a period acting as a delimiter between nested keys | ||
|
||
Items in lists have their names rendered using their index numbers within the lists they are found in. | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not very comfortable with this. It seems like dict
could be a better container for this data, compared to pandas.DataFrame/pandas.Series
. The reason for this is:
- Latency – for real-time ensemblers, the latency requirement is much more critical, compared to batch ensembling jobs. Having all this json manipulation and transformation into a pandas containers is likely something we'd better avoid
- This "flat-json" is not something the end-users would expect to work with. Also, JSON keys could have
.
in their names, which is not handled here
With that said, re-working the interface of the PyFunc ensembler makes more sense, and then batch-ensembler
can be updated too, to transform pandas data into dict
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay with regards to the latency concerns for real-time ensemblers, I've reworked (overloaded) the predict
method of the PyFunc
base class a little to make it work differently depending on whether it gets passed a dict
(when called by a real-time ensembler engine) or a pandas.DataFrame
(when called by a batch-ensembler engine) as its argument:
https://github.com/gojek/turing/blob/1bce96a181324636367236df3e9ad6d98c285c68/sdk/turing/ensembler.py#L73
With this change, I've completely removed any of those pandas.DataFrame
transformations like flatten_json
or other preprocessing table operations that might introduce latency in the real-time ensemblers.
While it'd be nice to unify both the batch and real-time ensembling use cases to act on and return dict
inputs, the predict
method, as an inherited method of the mlflow.pyfunc.PythonModel
abstract class, gets utilised by other downstream methods such as mlflow.pyfunc.spark_udf
, that forces the predict
method to follow a certain contract by taking pandas.DataFrame
s as input and output.
In particular, the current batch ensembling engine uses mlflow.pyfunc.spark_udf
, which necessitates at least some implementation of predict
that takes in and outputs pandas.DataFrame
s. For now I've kept our original implementation intact to prevent the batch ensembler from breaking, but I'll see if I can find a solution that's more elegant.
…ith current batch ensemblers
8fb53c7
to
d5b9a40
Compare
…l-time ensembling
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great. Thanks, @deadlycoconuts!
Alright thanks @romanwozniak for the useful comments; I'm gonna merge this now 🚀 |
Context
In PR #164, the abstract class
PyFunc
was modified in preparation for the introduction of a real-time pyfunc ensembler engine.This PR introduces the aforementioned real-time ensembler engine which is in effect a web service wrapped around by a container that serves real-time ensembling requests using an ensembler defined in the
mlflow
pyfunc flavour. This ensembler is to be created using the Turing SDK by implementing the samePyFunc
class currently used for implementing batch ensemblers. Hence with these changes, a user would be able to implement an pyfunc ensembler that works in both batch and real-time ensembling (if the same batch columns/live payload naming convention is used).Features
ensemble
method of child classes of thePyFunc
classtornado
framework to serve ensembling requests sent from a Turing router, by using the real-time ensembler engineMain Additions
engines/real-time-ensembler/pyfunc_ensembler_runner/ensembler_runner.py
- Class that holds the real-time ensembler engineengines/real-time-ensembler/pyfunc_ensembler_runner/handler.py
- Class containing the HTTP handler (according to thetornado
framework) that handles post requests from a Turing routerengines/real-time-ensembler/pyfunc_ensembler_runner/server.py
- Class containing the web serviceengines/real-time-ensembler/pyfunc_ensembler_runner/__main__.py
- Main entry point that runs the web serviceengines/real-time-ensembler/Dockerfile
- Base image for the web service; does not contain artefacts of the pyfunc ensemblerengines/real-time-ensembler/app.Dockerfile
- Docker image that builds upon the base image by loading artefacts of a given pyfunc ensemblersdk/turing/ensembler.py
- Changes to thepredict
method of thePyFunc
class to process the incoming arguments differently according to their input type (pandas.DataFrame
for batch ensembling vsdict
for real-time ensembling).github/workflows/real-time-ensembler.yaml
- GitHub workflow that tests and publishes the base Docker image containing the real-time ensembler engine