Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add replace_import_package Python option. #7470

Closed

Conversation

tpboudreau
Copy link

@tpboudreau tpboudreau commented May 5, 2020

Overview

This patch proposes a new python_opt option, replace_import_package, for the Python code generator that provides a lightweight mechanism for replacing the package name ordinarily generated by protoc in Python import statements (the "generated" package name) with a user supplied package name (the "preferred" package name) in order to allow the imported modules to be found more easily at runtime.

In the simplest case, given a message file (say, "MainMessage.proto") that imports another message file (say, "ImportedMessage.proto") from within the same directory, invoking the new option like:

protoc --proto_path /protobuf/files --python_out ./my_package --python_opt replace_import_package MainMessage.proto

will replace the currently generated statement import ImportedMessage_pb2 as ... (with no from clause), with the preferred statement from . import ImportedMessage_pb2 as ... (with a from clause referencing an explicit package).

Issue Background

When generating Python code from input files that import other protobuf files, protoc will generate Python import statements that either contain no from clause (if the input file's import statement contains no relative path prefix) or contain a from clause listing a package name derived solely from the relative path used in the protobuf import statement (protoc always ignores the protobuf package statement when generating Python code). This behavior occasionally results in Python import statements that don't work as expected: for example, import statements with no from clause are no longer interpreted as implicit relative imports in Python3.

Current workarounds include:

  • extending the PYTHONPATH at runtime to include packages that previously were implicitly searched
  • structuring the protobuf input directories to match the desired Python package structure
  • post processing the generated code with 2to3 (in simple cases) or sed/awk/etc (in more complex cases) to recast the Python import statements

Each has drawbacks. Extended discussion of issues and workarounds can be found here.

A Few Preliminaries

Note that the scope of the new option is limited. In particular:

  1. If you do not supply the new replace_import_package option for Python compiles, there is no change in behavior for protoc.

  2. If you do not use the protobuf "import" statement in any input files, the new replace_import_package option (even if present) is inoperative; it only changes the form of the Python import statements produced by protoc when generating Python code from protobuf files that themselves contain import statements.

  3. If your protobuf files happen to be arranged in a directory hierarchy that precisely matches the desired Python package structure, then the new replace_import_package option is unnecessary; protoc already creates output files in the desired locations (relative to the directory named in --python_out) and produces correct import statements in the generated Python code.

  4. The new replace_import_package option does not change the protoc dataflow; output files are still generated 1:1 with input files and the location of the generated files is still controlled by the --python_out option

Proposed Option Operation
  • the new option is supplied as a flag to the --python_opt command line argument
  • the option has the form replace_import_package=MAPPINGS
  • each mapping in the MAPPINGS string has the form generated.package|preferred.package
    • the generated.package name:
      • may be empty (with a special meaning, see below)
      • if present, must be absolute (because protoc generates only absolute package names)
    • the preferred.package name:
      • must be non-empty
      • may be relative (like . or ..my_package)
  • multiple mappings may be specified separated by semicolons
  • if the package name in a generated import statement exactly matches the generated.package in one of the mappings, that generated package name is fully replaced by the preferred.package name in that mapping

For example, given the mapping string "a|p;b|q.b;c|c", if protoc generates the import statements:
from a import m
from b import n
from c import o
They will be transformed into:
from p import m
from q.b import n
from c import o

  • an import statement with no generated package name (that is, a generated import statement with no from clause) matches a mapping with an empty generated.package, causing the insertion of a from clause containing the preferred.package name in that mapping

For example, given the mapping string "|.", if protoc generates the import statement:
import m
It will be transformed into:
from . import m

  • a generated package name that is not matched by any mapping is left unchanged

For example, given the mapping string "|.;a|p;b|q", if protoc generates the import statement:
from c import o
It will not be transformed in any way.

  • the MAPPINGS value is optional; if no value is supplied then the implied mapping string is "|."
  • some minutiae:
    • if the replace_import_package flag appears more than once, the effective set of mappings will be the union of the mappings from all the mapping strings
    • if the same generated.package value appears in more than one mapping, the latest appearing mapping (the rightmost mapping in the rightmost mapping string) is operative

For example, the command line option --python_opt 'replace_import_package=a|p;b|q,replace_import_package=c|r', is equivalent to --python_opt 'replace_import_package=a|p;b|q;c|r'.

And the mapping string "a|p;b|q;a|z" is equivalent to "b|q;a|z".

Usage Scenarios

Some example usage scenarios for the new option (with runnable code) can be found here.

@tpboudreau
Copy link
Author

tpboudreau commented May 5, 2020

I believe the failing check requires the addition of labels that I'm not authorized to add.

@tpboudreau
Copy link
Author

Is it possible to get some feedback on this PR, or to have the proper labels applied so that I can see the results of an initial CI run? Thanks in advance.

@stackedsax
Copy link

@jamescooke, I saw you mentioned this PR yesterday. Would this approach help your issue?

@jamescooke
Copy link

@stackedsax Just some background: the dsrf project uses protocol buffers and ideally would support Python 3. It suffers from the problem mentioned in the pull text above:

import statements with no from clause are no longer interpreted as implicit relative imports in Python3.

Which means that Python 3 support is broken with protocol buffer compilation as it is at the moment. I went for a quick work-around with sed as mentioned is possible above:

sed/awk/etc (in more complex cases) to recast the Python import statements

If there was a way to adjust the imports used in compiled Python files, as I believe this PR provides, then yes - I think that would help the dsrf project support Python 3 more easily. I don't think that doing a sed replacement on the imports in the compiled outputs is a long term or stable solution.

@stackedsax
Copy link

Thanks for the feedback, @jamescooke.

Certainly, sed isn't a great solution, but it may be long-term if we can't get eyes on possible fixes like this PR...

@dlj-NaN
Copy link
Contributor

dlj-NaN commented Jul 7, 2020

Fixing relative imports seems like a good thing to do.

However, renaming packages doesn't strike me as a sound practice. The background in #1491 lays out a lot of the issues... what's relevant here is that aligning the proto package and Python module is partly a correctness guarantee, not just a style consideration.

In other words, remapping the Python package path would actually be a correctness regression, since it would make it possible to import two different "foo.bar.Baz" messages (which may be completely unrelated).

If package styling is truly necessary, that can be done in pure Python, without any changes to protoc:

# my/package/__init__.py

# Generated from foo/bar/my_message.proto:
from foo.bar.my_message_pb2 import MyMessage

This ensures that the correctness guarantees provided by Python's import mechanism (and depended upon by the runtime) are not lost.

@tpboudreau
Copy link
Author

@dlj-NaN thanks for the feedback. And thanks for agreeing that the relative import fix seems useful. That was the original motivation for this patch and I at first considered a narrower option that just added a from clause to generated import statements that had none. But I think that to fully address even the relative import problem, you really need a mechanism that allows the user to change an absolute generated from clause to a relative one for those python import statements that are generated from a protobuf import statement that targets a file with a path prefix (there's a fuller example in Scenario Three here). Once you go there, you need to be able to distinguish between those generated from clauses that should be changed from those that shouldn't (see Senario Four here). That can either be an exemption mechanism or an inclusion mechanism -- I thought the latter was more direct.

I'm not sure I completely follow your objection to this remapping mechanism, however.

It's true that this option provides an override to the default generated import statements and that of course introduces the possibility of mistake or misuse. But that just means it should be used judiciously and carefully.

I'm not sure exactly what you're referring to when you say "aligning the proto package and Python module" ensures correctness. That alignment doesn't happen automatically, as currently, protoc completely ignores (for sound reasons) the protobuf package statement when generating python code. So it's up to the user to arrange the directory structure of the protobuf message files to match the desired generated python package structure to get properly generated python import statements.

Perhaps you're saying this filesystem based approach is the preferable? That's not unreasonable -- I'm sure lots of users are doing it. But that option may not be available (if the user doesn't control the protobuf message repo) and may not be desirable (if, for example, in a polyglot environment there's a better directory structure that's not dictated by the python output).

@stackedsax
Copy link

@dlj-NaN there were a couple of questions in @tpboudreau's comment that I'd love to hear your thoughts on. Thanks!

@bpeake-illuscio
Copy link

bpeake-illuscio commented Oct 13, 2020

Any movement on this? It's currently a big pain point on the python side of our protobuf packages. Would be nice to have an equivalent of go's go_package option.

Thank you!

@justinfx
Copy link

@bpeake-illuscio makes a great point. I currently achieve this goal in my Go language target through an option in the proto file to set the generated package import name in the Go source. Other languages have similar options, and I think being able to do the same to control the python import statement would align well and have a consistent place for the setting instead of externally from the command line.

@theunkn0wn1
Copy link

theunkn0wn1 commented Oct 23, 2020

I currently work around this by post-processing the generated python files.
You can find the gist here https://gist.github.com/theunkn0wn1/9b2407669d607e8d15bacab6e6fb4cc9
In effect, i convert protoc-generated file imports to be relative to the current directory.
In my projects, this allows me to touch a __init__.py in a subdirectory and import the generated modules from there instead of being bound to tossing them at the root level.

@tpboudreau tpboudreau force-pushed the replace_import_package branch from 37a3061 to 5f10c6c Compare November 3, 2020 05:32
@wyattanderson
Copy link

what's relevant here is that aligning the proto package and Python module is partly a correctness guarantee, not just a style consideration.

For me, I think the proposed functionality would help me improve correctness. Let me describe my scenario.

  1. I'm working in a large, existing monorepo that's currently undergoing adoption of Bazel. I would like to be able to graft Bazel on top without making too many changes to the existing repository structure.
  2. Existing .proto definitions live in some/awfully/long/prefix/a.proto and are generated relative to some/awfully/long/prefix. Imports are relative to that directory, so a.proto might have b.proto.
  3. Currently, our legacy Python tooling (that I'm working on replacing) generates Python from Protobuf out into python/some/other/directory. Let's say that python/ ends up on the PYTHONPATH, and users import a_pb2 as some.other.directory.a_pb2. This all works fine and dandy, with one major exception.
  4. If a.proto imports b.proto, the generated a_pb2.py includes an import for b_pb2. This, to me, is undesirable, as I would very much like to avoid adding python/some/other/directory to the PYTHONPATH.
  5. I would like to force the import statements in the generated Python code to be more specific, i.e. to import some.other.directory.b_pb2 instead of b_pb2. I also don't want to import from some.awfully.long.prefix (because facilitating this would be disruptive and require substantial changes to existing code).
  6. Ideally, I don't want to change or otherwise pre/postprocess Python or Protobuf files to accomplish this.

Does this use case seem reasonable?

@aggieNick02
Copy link

The same issue occurs when you utilize grpc generation. For foo.proto, foo_pb2_grpc.py does import foo_pb2 . Does this feature allow for fixing this as well, even though technically I only have one proto file, not importing another one?

@c-bata

This comment has been minimized.

@GoddessLuBoYan
Copy link

So, Now is 2022, the patch is still not merged. Oh No!

@dlj-NaN
Copy link
Contributor

dlj-NaN commented Feb 9, 2022

Unfortunately, remapping at the Python package level isn't something we would be interested in pursuing. The existing path import semantics are directly mapped to the import structure of the generated Python code, and although I certainly sympathize with having to live with past choices in code structure, I don't think adding a knob is actually in the overall best interest.

One thing that isn't particularly well-documented is that the import paths can be arbitrarily remapped and fine-tuned with the --proto_path=VIRTUAL=REAL flag. The major caveat is that it only works on disk sources (i.e., it won't remap descriptor sets), but it effectively makes the generated code import structure match however the imports are stylized in the .proto files.

In other words, if the desired generated output is (notionally):

import a.b.c_pb2

then the .proto file import should be spelled:

import "a/b/c.proto";

and the protocol compiler step should stitch those paths together, say, --proto_path=a/b=some/other_really/long/path, or even --proto_path=a/b/c.proto=d/e/f.proto.

The .proto file path is ultimately the most-cardinal part of identifiers, so keeping consistency there is pretty important for ending up with a working tree of generated code. The best way to approach this tends to be to ensure that imports follow a consistent absolute import structure everywhere, and controlling that path resolution as needed with --proto_path=X=Y remapping.

This tends to fall out nicely with (reasonable :-) ) language-level semantics, like PYTHONPATH (or -I in C/C++, etc.)... ultimately, however you spell the import/#include/whatever in your target language, the .proto file should say the same thing, and then it's up to you how much complexity you want to pay to lay out the files differently and remap them at parse time.

On a side note, I would medium-strongly advise against post-processing the generated sources. We explicitly do not promise stability of how imports are spelled, so it might work today, but it's not guaranteed behavior and might block a future upgrade for you. Even in extreme cases, a forwarding-import-only is usually more principled than trying to fake out things like PYTHONPATH:

# your/target/package/__init__.py:
from actual.generated.path import your_proto_file_pb2
# etc.

or:

# your/target/package/your_proto_file_pb2.py
from actual.generated.path.your_proto_file_pb2 import *

Unfortunately, I think I will have to close this PR. We certainly do appreciate the contribution, but on balance, it sounds to me like this particular problem is just surfacing an issue that probably should be resolved at the .proto file schema level. :-/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.