PGV Python Approach #237

kothariaditya · 2019-07-10T22:08:47Z

We are considering a new approach for Python that’s a bit different from existing language support as it wouldn’t do meaningful compile-time generation and instead generate the validation function just-in-time (JIT).

What is the approach?

Hand-code a generate_validate function (proto.Message->(proto.Message->bool)) that would do JIT generation at runtime of the validate function for each proto.Message class.

The generated pb2.py files have Descriptors that provide access to the options and annotations on a proto message. In generate_validate we’d walk a Descriptor and get the validation rules for every field. Based on the validation rules, we generate a validation function at runtime and return that to the user.

For performance, we could cache the validate function for each message class. We could also allow pre-priming the cache to avoid cold starts.

A proof of concept of generate_validate:

# proto_message is an instance of a proto message
def _generate_validate(proto_message):
    func = "def validate_fn(p): "
    for field in proto_message.DESCRIPTOR.fields:
        if hasValidation(field): # hasValidation() will check if field is annotated with validation rules
            for option_descriptor, option_value in field.GetOptions().ListFields():
                if option_descriptor.full_name == "validate.rules":
                    if option_value.string != None:
                        func += string_template(option_value.string, field.name)
           	    else if option_value.int32 != None:
                        func += int_template(option_value.int32, field.name)
                    else if ........
    func += "\n  return True"
    exec(func); return validate_fn

This would be used through a validate function that the user can write themselves, or we could write a minimal python package containing this.

#!/usr/bin/python
from validate import generate_validate

validation_funcs = {} #Caching done to avoid generating the function each time
def validate(proto_message):
    if proto_message.DESCRIPTOR.full_name not in validation_funcs:
        validator = _generate_validate(proto_message):
        validation_funcs[proto_message.DESCRIPTOR.full_name] = validator
    return validation_funcs[proto_message.DESCRIPTOR.full_name](proto_message)

Although this approach doesn’t do compile time code generation, we plan to output the file containing generate_validate as part of PGV. This keeps the Python implementation together in the same repo as other languages, but more importantly allows for the same option verification performed now (e.g. int can’t have string rules, etc).

Benefits
This avoids a user needing to find and import each specific foo_validate.py file where the validate() is located which is difficult at runtime without a mess of extra code. Unlike Go, there isn’t a way to hook the validate method into the instance itself without extra imports.

This allows the validation code to be totally generic. As long as the user has imported the standard generated code (foo_pb2.py) before calling validate (as that registers the descriptor in the symbol db), generate_validate will work.

Potential problems
It would involve performance hits due to generating the validation code JIT. However, we minimize the pitfalls through caching, and allowing the cache to be primed to avoid cold starts.

The text was updated successfully, but these errors were encountered:

akonradi · 2019-07-11T13:27:23Z

I have a couple issues with this approach:

It makes it extremely difficult to debug the generated code since it only exists in memory, not in readable source code. You could take the generated code above and print it or something for debugging, but that's not the same as having source code you can step through in a debugger.
It introduces more magic into the system. Right now the generated code is extremely straightforward, and while some things are spread across files, it's pretty easy to understand what's going on. With generated code, all that goes out the window.
It introduces a runtime dependency on PGV. With the existing system, the generated code is self-contained, so if you #include/import the generated files you don't have to pull in validate.proto, nor any of the generating code.

dkunitsk · 2019-07-11T18:36:28Z

Hey @akonradi. Thanks for your input. You're right that this approach has tradeoffs, but we believe it's the right way to go considering the lack of a good alternative for finding a validate function at runtime, which is a major blocker to using PGV in a generic library. To respond more closely to your issues:

Debugging is likely more of a concern for the PGV developer (i.e. us) than an end user. But still, you do have a valid concern. But perhaps it can be mitigated by providing a way to get the string source code for a given class' validate fun. Then a developer could literally step through it in a debugger if they so choose.
I would argue that a single function which does the generation by building up a string is not too magical. Combined with an ability to print the source code as described above, this should make it clear to understand.
There would be no runtime dependency on PGV. Perhaps you've misunderstood our approach. All the necessary code would be entirely self-contained and trivially generated by PGV. There would be no need to pull in validate.proto or anything else.

akonradi · 2019-07-23T19:08:24Z

Sorry, meant to reply and then got lost for two weeks.

I see, so the problem is that there's no way to do static registration in Python like in C++ without loading the modules containing the validate functions ahead of time. I believe Java has similar restrictions and the way the generated code works around it there is by using reflection to load the appropriate Validator class (with caching) when requested. Would it be possible to take a similar approach here, doing a programmatic import of the validation code instead of generating it dynamically?

I'm also a bit confused by how this works with the Bazel-based build. The norm there is to have a separate PGV build rule for each .proto file. Would that just check that the validation annotations are sane and then exit? Would the generate_validate function be provided by the PGV runtime library, independent of the per-.proto output?

Lastly, this seems like it might be a good use case for proto insertion points, provided we can add PGV to all the protoc invocations (in Bazel by changing the proto_library rules).

dkunitsk · 2019-07-23T19:53:12Z

Hey @akonradi. Thank you for the follow up and your good suggestions. See below for our thoughts:

We mulled over this for a long time and could not come up with a solution we could live with. Since there is no relationship between the python code and the validation code, the best we could do to locate the validation module would be: take the Proto class, get its module path, pop off the filename, and import dir+<filename>_validator.py (i.e. supposedly where the validation code for that proto would live).

Even then that presents other challenges such as: which of the classes or functions in that file is the right one? Two solutions there: either name the classes/functions descriptively and do name matching - i.e. to validate Foo, do

import importlib
validation_class = "Foo" + "Validator"
importlib.import_module(dir+<filename>_validator.py, validation_class) # or something like that...

OR we could try to have the generated code in dir+<filename>_validator.py import the Proto class and hook the validation method into it. But that presents a similar "module discovery" problem.

Exactly, it would check the validation annotations are sane, output the validate.py file - which is independent of the input, but for consistency and to allow for a future dependency we'd like this file to be output by PGV - and then exit.
We thought quite a bit about insertion points. We believe the fundamental change to the API of PGV would be too large of a departure. PGV is additive to the native code, and requiring that it be run in tandem with the native code generation isn't something we were willing to change. (@rodaine can also speak more to this).

rodaine · 2019-07-25T19:06:29Z

@akonradi, given the gotchas python presents here that Dmitriy and Aditya have laid out (for the first interpreted language compared to the rest of the pack), I think the JIT strategy makes sense as an initial endeavor; we can iterate on it to see if a long-term imperative strategy can grow from it.

As mentioned by @dkunitsk, insertion points would be a break from the existing behavior and where previously it was not enforced to run simultaneously with the base code gen plugin, it would be necessary. Towards your Bazel point, we should assume, as a default, consumers of PGV do not use Bazel.

akonradi · 2019-07-29T17:53:43Z

Agreed, there's really not a great way to make the Java approach (where validators can be loaded by class name) or the C++ approach (where validators are registered statically) work for Python.

Agreed on not using insertion points, I think that would end up being more of a headache than we want. I'm still wary of how this will work with Bazel where we invoke PGV once per .proto file, and given that Envoy is one of the main users of PGV, I don't think that's a good assumption. I worry that we're going to end up with some weird asymmetries between using PGV to generate Python and C++/Java code.

adriangb · 2020-07-07T17:56:26Z

Before I found this library, I was actually thinking of doing something similar to this project by using a combination of python-betterproto and pydantic. It looks something like this (this is a working but incomplete example):

@dataclass
class Bar(betterproto.Message):
    boo: str = betterproto.string_field(1)

    @validator('boo')
    def validate_boo(cls, v: str) -> str:
        """A very bad regex email matcher."""
        pattern = r"\w+@\w+\.\w{2,5}"
        if re.fullmatch(pattern, v):
            return v
        else:
            raise ValueError

This seems to achieve what PGV is doing, but using pure idiomatic Python. Does anyone have any insights on this approach vs PGV, or how the two might be reconciled (ex: compile the PGV protobufs to this type of python validation)?

akonradi · 2020-07-07T18:43:26Z

One of the goals with PGV is to have the validation specifications be language-neutral, so any approach would need to take protos as input and produce Python code.

adriangb · 2020-07-07T18:52:47Z

Right, so couldn't code like the above be generated based on the protos? It's not clear to me how the PGV part of the protos is parsed. So the input to the above example would be:

import "validate/validate.proto";

message Bar {

  string boo = 1 [(validate.rules).string.pattern =  "\w+@\w+\.\w{2,5}"];
}

akonradi · 2020-07-07T19:02:39Z

Absolutely. The PGV part is parsed by looking at the descriptor for the message being passed in, using jinja2 templates to generate some Python code, and then executing that. See here, for example.

For the Go and C++ code, we've been reluctant to add additional run-time dependencies, including additional libraries, since that makes PGV code harder to use for consumers. I don't know what the stance is for Python code.

adriangb · 2020-07-07T21:34:52Z

That looks pretty similar to how python-betterproto itself generates the compiled .py files (see here).

It should be possible to pre-write the Python validation functions and have a jinja2 template that ties them to the python-betterproto compiled .py files.

A quick example/summary:

.py file with pre-written PGV validators:

from typing import Callable
import re

import betterproto
from pydantic.dataclasses import dataclasses


def string_pattern_validator(pattern: str) -> Callable
    def validator(cls, v: str) -> str:
        if not re.fullmatch(pattern, v):
            raise ValueError(...)
        return v
    validator.__qualname__ = 'validator' + pattern  # see https://github.com/samuelcolvin/pydantic/issues/806
    return validator

The Pydantic dataclass/message that was generated by python-betterproto

@dataclass
class Bar(betterproto.Message):
    boo: str = betterproto.string_field(1)

The jijna2 .py file generated by PGV

from better_proto_generated_py_file import Bar
import pgv.python_validators as vals

class Bar(Bar):
    boo_validator = validator("boo")(vals.string_pattern_val(pattern="regex parsed from .proto"))

Then the end-user would import this constrained Bar class

adriangb · 2020-07-14T13:53:52Z

@akonradi any further thoughts on this approach?

I've been working with python-betterproto and it certainly is possible to bake in support for parsing custom options. But I'm not sure if adding python-betterproto and pydantic dependency for using PGV with Python is okay.

In terms of steps, it would look something like:

Get custom option parsing in betterproto.
Build out Pydantic validators in PGV.
Re-work PGV Python plugin to use betterproto/pydantic and attach validators to their respective fields.

akonradi · 2020-07-14T13:56:25Z

@rodaine @kothariaditya can you chime in?

adriangb · 2020-08-05T14:39:44Z

Realizing that the same end goal (i.e. compile .py files) could probably be accomplished by just extending protoc-gen-star to build Python files that use python-betterproto + pydantic directly (instead of relying on python-betterproto to build them). Since protoc-gen-star is already used for Go/Java that kind of makes more sense I think. I'm fairly familiar with both python-betterproto and pydantic so if someone can help walk me through how to convert the PG* AST to actual Python code and how the custom options are parsed, I can work on the details.

adriangb · 2020-09-04T17:27:50Z

@akonradi @rodaine @kothariaditya any updates on this? I'm happy to work on this if there is interest.

so1n · 2023-05-17T02:18:15Z

An open source tool I wrote that is compatible with pgv rules can generate pydantic model code through protobuf plug-ins https://github.com/so1n/protobuf_to_pydantic

kothariaditya mentioned this issue Jul 24, 2019

Python Support #238

Merged

9 tasks

rmichela added the Enhancement Extend or improve functionality label Jul 26, 2019

adriangb mentioned this issue Jul 18, 2020

Add support for custom options danielgtaylor/python-betterproto#119

Open

elliotmjackson closed this as completed Oct 4, 2022

elliotmjackson reopened this Oct 5, 2022

elliotmjackson added the Wont Fix Out of scope of this project label Aug 10, 2023

rodaine closed this as completed Oct 3, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PGV Python Approach #237

PGV Python Approach #237

kothariaditya commented Jul 10, 2019 •

edited

Loading

akonradi commented Jul 11, 2019

dkunitsk commented Jul 11, 2019

akonradi commented Jul 23, 2019

dkunitsk commented Jul 23, 2019 •

edited

Loading

rodaine commented Jul 25, 2019

akonradi commented Jul 29, 2019

adriangb commented Jul 7, 2020

akonradi commented Jul 7, 2020

adriangb commented Jul 7, 2020 •

edited

Loading

akonradi commented Jul 7, 2020

adriangb commented Jul 7, 2020 •

edited

Loading

adriangb commented Jul 14, 2020

akonradi commented Jul 14, 2020

adriangb commented Aug 5, 2020 •

edited

Loading

adriangb commented Sep 4, 2020

so1n commented May 17, 2023

PGV Python Approach #237

PGV Python Approach #237

Comments

kothariaditya commented Jul 10, 2019 • edited Loading

akonradi commented Jul 11, 2019

dkunitsk commented Jul 11, 2019

akonradi commented Jul 23, 2019

dkunitsk commented Jul 23, 2019 • edited Loading

rodaine commented Jul 25, 2019

akonradi commented Jul 29, 2019

adriangb commented Jul 7, 2020

akonradi commented Jul 7, 2020

adriangb commented Jul 7, 2020 • edited Loading

akonradi commented Jul 7, 2020

adriangb commented Jul 7, 2020 • edited Loading

adriangb commented Jul 14, 2020

akonradi commented Jul 14, 2020

adriangb commented Aug 5, 2020 • edited Loading

adriangb commented Sep 4, 2020

so1n commented May 17, 2023

kothariaditya commented Jul 10, 2019 •

edited

Loading

dkunitsk commented Jul 23, 2019 •

edited

Loading

adriangb commented Jul 7, 2020 •

edited

Loading

adriangb commented Jul 7, 2020 •

edited

Loading

adriangb commented Aug 5, 2020 •

edited

Loading