Protobuf Validation #111

xfbs · 2023-10-10T07:18:57Z

xfbs
Oct 10, 2023

This conversation tracks the progress of coming up with a plan to validate protobuf definitions.

What are we proposing?

One of the ideas that we have discussed in the meeting yesterday was the possibility of doing some validation of protobuf before it is published to the registry. There are two validation categories that we could do:

As far as I understand, there are both api and lib packages in buffrs. For the lib packages, we could validate that they do not expose any service or rpc blocks. This just prevents you from publishing incorrectly.
We can enforce semantic versioning. This stipulates that version numbers must be of the shape 1.2.3, where the three numbers in order are the major, minor and patch numbers.

The rules are that whenever you release a new version, you must incremement the:
- major version when you make incompatible API changes
- minor version when you add functionality in a backward compatible manner
- patch version when you make backward compatible bug fixes
Validating this means that we do not let you push a new buffrs package with a breaking change unless you also increment the major version, and we do not let you add functionality unless you increment the minor version.

Why is this important?

With buffrs, we can actually build a very powerful primitive for our services to communicate. It allows us to reuse type definitions for our RPC calls and easily update them. But in order for this to be something that can be deployed on a large scale, it needs to have some properties, such as stability guarantees.

Let's consider an example. If we have a library called physics that includes types such as Temperature and Coordinates, these are very useful primitive types that may be used in many services, either directly or indirectly.

graph BT
    physics
    flightpath-->physics
    serviceA-->physics
    serviceB-->physics
    serviceC-->flightpath
    serviceA-->flightpath

Let's assume that this physics package exists in multiple versions: physics 3.1.0, physics 3.1.1, physics 3.1.2. What I believe we want to avoid is having to pin dependencies to a specific version. Because when we do that, what inevitably happens is that in one service, we have multiple versions of a package (serviceA 0.1.0 depends on physics 3.1.0, but it also depends on flightpath 0.2.5 which depends on physics 3.1.2). Now, we have not really made our lives much easier because now our service has several, different Temperature types that need to be converted between.

Instead, we would like to be able to express dependencies such as physics 3.*. By doing that, we can essentially communicate that we are okay with using any updates of this dependency, so long as there are no breaking changes.

But this is only possible or useful if we can rely on one important property: we need to know for sure that updates to the physics package do not introduce breaking changes. Because if they do, then we have no bricked serviceA, serviceB and serviceC, as they all depend on physics.

How can we parse Protobuf?

We should be able to use protobuf-parse to parse the protobuf files and determine breakage. It has a parser implemented in Rust, but it also allows using protoc as a parser (which might be useful in case we run into compatibility issues). It can spit out a FileDescriptorProto per protobuf file. That looks like this:

///  Describes a complete .proto file.
#[derive(PartialEq,Clone,Default,Debug)]
pub struct FileDescriptorProto {
    pub name: Option<String>,
    pub package: Option<String>,
    pub dependency: Vec<String>,
    pub public_dependency: Vec<i32>,
    pub weak_dependency: Vec<i32>,
    pub message_type: Vec<DescriptorProto>,
    pub enum_type: Vec<EnumDescriptorProto>,
    pub service: Vec<ServiceDescriptorProto>,
    pub extension: Vec<FieldDescriptorProto>,
    pub options: MessageField<FileOptions>,
    pub source_code_info: MessageField<SourceCodeInfo>,
    pub syntax: Option<String>,
    pub special_fields: SpecialFields,
}

Some interesting properties:

We can extract the package name from this. This is quite useful, as it allows us to mandate that you use a package name (in the protobuf file) that aligns with the name of the buffrs package.
We get all the parsed definitions, letting us easily check if something was added/removed.
The parser does not care if things that are included do not exist, which is good for us because we don't really care about the dependencies at this point (or maybe we do?)

How can we determine breakage?

Determine previous relevant version
Figure out that kind of semver bump this is
Diff the current and the previous version to figure out what the differences are
Determine if semver bump aligns with code differences.

Determine version to compare to

Get previous version. This is not as trivial as it sounds, because the versions might not be published in order. For example, you might already be at version 4.5.12, but you want to publish a bugfix for 3.4.19. So the way this works is very simple

Get a list of all published versions: [1.0.0, 1.0.1, 2.0.4, 2.0.5, 2.1.0, 2.1.1, 3.0.0, 3.0.1, 3.0.2].
Exclude every version that is higher than the version that is supposed to be published. For example, if I am trying to publish version 2.0.6, the list becomes: [1.0.0, 1.0.1, 2.0.4, 2.0.5].
Get the last version of that resulting list.

Semver bump classification

Determine if this is a major, minor or patch bump, by comparing the two version numbers. For example:

3.5.12 to 3.5.13 is a patch bump.
3.5.12 to 3.6.0 is a minor bump.
3.5.12 to 4.0.0 is a major bump.

Compare two versions

Next we need to compare the two protobuf definitions to determine what the breakage level is in there.

Remove message, service or rpc: requires major bump.
Add new message, service or rpc: requires minor bump.
Rename or renumber message fields: major bump.
Add message fields, if optional: minor bump.
Add message fields, if required: major bump.
Add comments: patch bump.

It looks to me that the simplest way to accomplish this would be to:

Populate the results of the parsing into custom structs (we need this because having the data in a slightly different shape is more optimal for this comparison, and we are unable to derive custom properties on the external types).
Use a crate such as structdiff or diff to programmatically get a diff type that shows us all of the differences.
Run a simple decider over that set of differences which classifies if this is a patch, minor or major bump.

Acceptance criteria

If the semver bump aligns with the code differences, accept it. Otherwise, print helpful message stating that it was rejected, and why it was rejected.

Game plan

The question is where this validation should take place and how it should work. A good place to validate is always the registry, but looking at cargo for example, they also do some validation client-side because it allows you to do a local dry run.

I think a good approach is to write a validation crate that exposes the necessary primitives. This crate can then be used anywhere this validation needs to be performed. This way, the functionality is also decoupled from buffrs and can be easily tested.

Open questions

Do we enforce that the protobuf package name matches the buffrs package name?
What do we do with yanked releases?
What do we do with alpha releases?
Should we support a force override, which disables the semver check?
Should we, similar to the Rust project, allow beaking changes in versions below 1.0.0?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Protobuf Validation #111

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

Protobuf Validation #111

xfbs Oct 10, 2023

What are we proposing?

Why is this important?

How can we parse Protobuf?

How can we determine breakage?

Determine version to compare to

Semver bump classification

Compare two versions

Acceptance criteria

Game plan

Open questions

Replies: 0 comments

xfbs
Oct 10, 2023