Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PGV Python Approach #237

Closed
kothariaditya opened this issue Jul 10, 2019 · 16 comments
Closed

PGV Python Approach #237

kothariaditya opened this issue Jul 10, 2019 · 16 comments
Labels
Enhancement Extend or improve functionality Wont Fix Out of scope of this project

Comments

@kothariaditya
Copy link
Contributor

kothariaditya commented Jul 10, 2019

We are considering a new approach for Python that’s a bit different from existing language support as it wouldn’t do meaningful compile-time generation and instead generate the validation function just-in-time (JIT).

What is the approach?

Hand-code a generate_validate function (proto.Message->(proto.Message->bool)) that would do JIT generation at runtime of the validate function for each proto.Message class.

The generated pb2.py files have Descriptors that provide access to the options and annotations on a proto message. In generate_validate we’d walk a Descriptor and get the validation rules for every field. Based on the validation rules, we generate a validation function at runtime and return that to the user.

For performance, we could cache the validate function for each message class. We could also allow pre-priming the cache to avoid cold starts.

A proof of concept of generate_validate:

# proto_message is an instance of a proto message
def _generate_validate(proto_message):
    func = "def validate_fn(p): "
    for field in proto_message.DESCRIPTOR.fields:
        if hasValidation(field): # hasValidation() will check if field is annotated with validation rules
            for option_descriptor, option_value in field.GetOptions().ListFields():
                if option_descriptor.full_name == "validate.rules":
                    if option_value.string != None:
                        func += string_template(option_value.string, field.name)
           	    else if option_value.int32 != None:
                        func += int_template(option_value.int32, field.name)
                    else if ........
    func += "\n  return True"
    exec(func); return validate_fn

This would be used through a validate function that the user can write themselves, or we could write a minimal python package containing this.

#!/usr/bin/python
from validate import generate_validate

validation_funcs = {} #Caching done to avoid generating the function each time
def validate(proto_message):
    if proto_message.DESCRIPTOR.full_name not in validation_funcs:
        validator = _generate_validate(proto_message):
        validation_funcs[proto_message.DESCRIPTOR.full_name] = validator
    return validation_funcs[proto_message.DESCRIPTOR.full_name](proto_message)

Although this approach doesn’t do compile time code generation, we plan to output the file containing generate_validate as part of PGV. This keeps the Python implementation together in the same repo as other languages, but more importantly allows for the same option verification performed now (e.g. int can’t have string rules, etc).

Benefits
This avoids a user needing to find and import each specific foo_validate.py file where the validate() is located which is difficult at runtime without a mess of extra code. Unlike Go, there isn’t a way to hook the validate method into the instance itself without extra imports.

This allows the validation code to be totally generic. As long as the user has imported the standard generated code (foo_pb2.py) before calling validate (as that registers the descriptor in the symbol db), generate_validate will work.

Potential problems
It would involve performance hits due to generating the validation code JIT. However, we minimize the pitfalls through caching, and allowing the cache to be primed to avoid cold starts.

@akonradi
Copy link
Contributor

I have a couple issues with this approach:

  1. It makes it extremely difficult to debug the generated code since it only exists in memory, not in readable source code. You could take the generated code above and print it or something for debugging, but that's not the same as having source code you can step through in a debugger.
  2. It introduces more magic into the system. Right now the generated code is extremely straightforward, and while some things are spread across files, it's pretty easy to understand what's going on. With generated code, all that goes out the window.
  3. It introduces a runtime dependency on PGV. With the existing system, the generated code is self-contained, so if you #include/import the generated files you don't have to pull in validate.proto, nor any of the generating code.

@dkunitsk
Copy link
Contributor

Hey @akonradi. Thanks for your input. You're right that this approach has tradeoffs, but we believe it's the right way to go considering the lack of a good alternative for finding a validate function at runtime, which is a major blocker to using PGV in a generic library. To respond more closely to your issues:

  1. Debugging is likely more of a concern for the PGV developer (i.e. us) than an end user. But still, you do have a valid concern. But perhaps it can be mitigated by providing a way to get the string source code for a given class' validate fun. Then a developer could literally step through it in a debugger if they so choose.
  2. I would argue that a single function which does the generation by building up a string is not too magical. Combined with an ability to print the source code as described above, this should make it clear to understand.
  3. There would be no runtime dependency on PGV. Perhaps you've misunderstood our approach. All the necessary code would be entirely self-contained and trivially generated by PGV. There would be no need to pull in validate.proto or anything else.

@akonradi
Copy link
Contributor

Sorry, meant to reply and then got lost for two weeks.

I see, so the problem is that there's no way to do static registration in Python like in C++ without loading the modules containing the validate functions ahead of time. I believe Java has similar restrictions and the way the generated code works around it there is by using reflection to load the appropriate Validator class (with caching) when requested. Would it be possible to take a similar approach here, doing a programmatic import of the validation code instead of generating it dynamically?

I'm also a bit confused by how this works with the Bazel-based build. The norm there is to have a separate PGV build rule for each .proto file. Would that just check that the validation annotations are sane and then exit? Would the generate_validate function be provided by the PGV runtime library, independent of the per-.proto output?

Lastly, this seems like it might be a good use case for proto insertion points, provided we can add PGV to all the protoc invocations (in Bazel by changing the proto_library rules).

@dkunitsk
Copy link
Contributor

dkunitsk commented Jul 23, 2019

Hey @akonradi. Thank you for the follow up and your good suggestions. See below for our thoughts:

  1. We mulled over this for a long time and could not come up with a solution we could live with. Since there is no relationship between the python code and the validation code, the best we could do to locate the validation module would be: take the Proto class, get its module path, pop off the filename, and import dir+<filename>_validator.py (i.e. supposedly where the validation code for that proto would live).

Even then that presents other challenges such as: which of the classes or functions in that file is the right one? Two solutions there: either name the classes/functions descriptively and do name matching - i.e. to validate Foo, do

import importlib
validation_class = "Foo" + "Validator"
importlib.import_module(dir+<filename>_validator.py, validation_class) # or something like that...

OR we could try to have the generated code in dir+<filename>_validator.py import the Proto class and hook the validation method into it. But that presents a similar "module discovery" problem.

  1. Exactly, it would check the validation annotations are sane, output the validate.py file - which is independent of the input, but for consistency and to allow for a future dependency we'd like this file to be output by PGV - and then exit.

  2. We thought quite a bit about insertion points. We believe the fundamental change to the API of PGV would be too large of a departure. PGV is additive to the native code, and requiring that it be run in tandem with the native code generation isn't something we were willing to change. (@rodaine can also speak more to this).

@kothariaditya kothariaditya mentioned this issue Jul 24, 2019
9 tasks
@rodaine
Copy link
Member

rodaine commented Jul 25, 2019

@akonradi, given the gotchas python presents here that Dmitriy and Aditya have laid out (for the first interpreted language compared to the rest of the pack), I think the JIT strategy makes sense as an initial endeavor; we can iterate on it to see if a long-term imperative strategy can grow from it.

As mentioned by @dkunitsk, insertion points would be a break from the existing behavior and where previously it was not enforced to run simultaneously with the base code gen plugin, it would be necessary. Towards your Bazel point, we should assume, as a default, consumers of PGV do not use Bazel.

@rmichela rmichela added the Enhancement Extend or improve functionality label Jul 26, 2019
@akonradi
Copy link
Contributor

Agreed, there's really not a great way to make the Java approach (where validators can be loaded by class name) or the C++ approach (where validators are registered statically) work for Python.

Agreed on not using insertion points, I think that would end up being more of a headache than we want. I'm still wary of how this will work with Bazel where we invoke PGV once per .proto file, and given that Envoy is one of the main users of PGV, I don't think that's a good assumption. I worry that we're going to end up with some weird asymmetries between using PGV to generate Python and C++/Java code.

@adriangb
Copy link

adriangb commented Jul 7, 2020

Before I found this library, I was actually thinking of doing something similar to this project by using a combination of python-betterproto and pydantic. It looks something like this (this is a working but incomplete example):

@dataclass
class Bar(betterproto.Message):
    boo: str = betterproto.string_field(1)

    @validator('boo')
    def validate_boo(cls, v: str) -> str:
        """A very bad regex email matcher."""
        pattern = r"\w+@\w+\.\w{2,5}"
        if re.fullmatch(pattern, v):
            return v
        else:
            raise ValueError

This seems to achieve what PGV is doing, but using pure idiomatic Python. Does anyone have any insights on this approach vs PGV, or how the two might be reconciled (ex: compile the PGV protobufs to this type of python validation)?

@akonradi
Copy link
Contributor

akonradi commented Jul 7, 2020

One of the goals with PGV is to have the validation specifications be language-neutral, so any approach would need to take protos as input and produce Python code.

@adriangb
Copy link

adriangb commented Jul 7, 2020

Right, so couldn't code like the above be generated based on the protos? It's not clear to me how the PGV part of the protos is parsed. So the input to the above example would be:

import "validate/validate.proto";

message Bar {

  string boo = 1 [(validate.rules).string.pattern =  "\w+@\w+\.\w{2,5}"];
}

@akonradi
Copy link
Contributor

akonradi commented Jul 7, 2020

Absolutely. The PGV part is parsed by looking at the descriptor for the message being passed in, using jinja2 templates to generate some Python code, and then executing that. See here, for example.

For the Go and C++ code, we've been reluctant to add additional run-time dependencies, including additional libraries, since that makes PGV code harder to use for consumers. I don't know what the stance is for Python code.

@adriangb
Copy link

adriangb commented Jul 7, 2020

That looks pretty similar to how python-betterproto itself generates the compiled .py files (see here).

It should be possible to pre-write the Python validation functions and have a jinja2 template that ties them to the python-betterproto compiled .py files.

A quick example/summary:

.py file with pre-written PGV validators:

from typing import Callable
import re

import betterproto
from pydantic.dataclasses import dataclasses


def string_pattern_validator(pattern: str) -> Callable
    def validator(cls, v: str) -> str:
        if not re.fullmatch(pattern, v):
            raise ValueError(...)
        return v
    validator.__qualname__ = 'validator' + pattern  # see https://github.com/samuelcolvin/pydantic/issues/806
    return validator

The Pydantic dataclass/message that was generated by python-betterproto

@dataclass
class Bar(betterproto.Message):
    boo: str = betterproto.string_field(1)

The jijna2 .py file generated by PGV

from better_proto_generated_py_file import Bar
import pgv.python_validators as vals

class Bar(Bar):
    boo_validator = validator("boo")(vals.string_pattern_val(pattern="regex parsed from .proto"))

Then the end-user would import this constrained Bar class

@adriangb
Copy link

@akonradi any further thoughts on this approach?

I've been working with python-betterproto and it certainly is possible to bake in support for parsing custom options. But I'm not sure if adding python-betterproto and pydantic dependency for using PGV with Python is okay.

In terms of steps, it would look something like:

  1. Get custom option parsing in betterproto.
  2. Build out Pydantic validators in PGV.
  3. Re-work PGV Python plugin to use betterproto/pydantic and attach validators to their respective fields.

@akonradi
Copy link
Contributor

@rodaine @kothariaditya can you chime in?

@adriangb
Copy link

adriangb commented Aug 5, 2020

Realizing that the same end goal (i.e. compile .py files) could probably be accomplished by just extending protoc-gen-star to build Python files that use python-betterproto + pydantic directly (instead of relying on python-betterproto to build them). Since protoc-gen-star is already used for Go/Java that kind of makes more sense I think. I'm fairly familiar with both python-betterproto and pydantic so if someone can help walk me through how to convert the PG* AST to actual Python code and how the custom options are parsed, I can work on the details.

@adriangb
Copy link

adriangb commented Sep 4, 2020

@akonradi @rodaine @kothariaditya any updates on this? I'm happy to work on this if there is interest.

@so1n
Copy link

so1n commented May 17, 2023

An open source tool I wrote that is compatible with pgv rules can generate pydantic model code through protobuf plug-ins https://github.com/so1n/protobuf_to_pydantic

@elliotmjackson elliotmjackson added the Wont Fix Out of scope of this project label Aug 10, 2023
@rodaine rodaine closed this as completed Oct 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Extend or improve functionality Wont Fix Out of scope of this project
Projects
None yet
Development

No branches or pull requests

8 participants