Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

specify RPATH when linking to libpq #225

Merged
merged 15 commits into from
Sep 15, 2021
Merged

specify RPATH when linking to libpq #225

merged 15 commits into from
Sep 15, 2021

Conversation

davepacheco
Copy link
Collaborator

@davepacheco davepacheco commented Sep 8, 2021

See #213 for context.

I don't think we should land this PR, but I wanted to keep it here for future reference. This was my best attempt at a generalizable pattern for linking to a native library that's not on the runtime linker's default search path. The basic idea is to have the *-sys crate that wraps the native library expose metadata that indicates where it found the native library. Then we use this at the "top level" (here in omicron) to pass an appropriate -R flag to the linker (wrapped in a -W flag because we're really passing it to gcc). This change depends on the "oxide" branch of https://github.com/oxidecomputer/pq-sys, which adds that metadata. (This dependency shows up as a "patch" section in the top level Cargo.toml.)

The good things about this approach:

  • We're still able to leverage the pq-sys crate's logic for locating the library and the resulting RPATH entry is necessarily the same.
  • The logic that adds the RPATH is at the top level, not in pq-sys. That's where it belongs -- see this comment.
  • This approach gives us the flexibility to decide that official builds will put the library at some known place on the deployed system, even if that doesn't match what where it is on the build machine. (We'd need to update the top-level build script logic to do this, and then pass the env var to pq-sys that causes it to find the library we want, but this seems straightforward.)

There's a bunch of ugly here:

  • For every omicron crate that this applies to, I had to add a build.rs script that checks for the pq-sys metadata and uses the cargo:rustc-link-arg option to emit the -R flag that we want.
  • The cargo:rustc-link-arg option is currently unstable, so we need to be on "nightly". It's stabilized in Rust 1.56, which is a couple of months away.
  • Metadata emitted by Cargo build scripts is only available to immediate dependents. We don't depend on pq-sys -- we depend on Diesel. To work around this, I added dependencies on pq-sys wherever we're currently depending on diesel. This relies on our dependency matching Diesel's, too -- if Diesel updated to one that was incompatible with our semver, I'm not sure what would happen.
  • All of this is copy/pasted across most of the crates in this repo. That's because omicron-common depends on diesel (see Use fewer features for the diesel dependency in omicron-common #221?), and most other things depend on omicron-common. We're deliberately trying to set RPATH at the top level, so naturally all of the affected top-level crates need this.

If we were to go with this approach, we could:

  • cleanup the duplicated logic in the build.rs files by putting it into a common crate and using that as a build-dependency
  • fix the XXX comments to explain the situation.
  • double-check the "nightly" toolchain we're using. As I write this up, I realize that I removed the notes about the unstable "asm" feature because my build worked without them, but that may be because I wasn't actually using propolis.

The real killer, though, is that "cargo:rustc-link-arg" is not used when building rustdoc tests. The net result is that builders of omicron still need to set RUSTDOCFLAGS if they want to run the test suite. All this ugliness just prevents them from having to set RUSTFLAGS.

@@ -145,8 +145,10 @@ async fn collection_task(
struct CollectionTask {
// Channel used to send messages from the agent to the actual task. The task owns the other
// side.
#[allow(dead_code)]
Copy link
Collaborator Author

@davepacheco davepacheco Sep 14, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bnaecker I added this here because these unused fields were generating clippy warnings, which were treated as errors by the clippy-lint target. It looks like these are unused but I wasn't sure what the right answer was here. It'd be good to remove these allow directives, one way or another.

I'm not sure why these only generated warnings now. I did update to a newer nightly, and maybe that one has different behavior here.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No problem, thanks for cc-ing me. I'm not sure why they never triggered errors before, either, but the updated toolchain seems likely.

These fields are there so that we can communicate with the collection task. For example, updating the interval on which metrics are collected from a producer is accomplished by sending something on the inbox channel. The join handle might be used if we want to stop collecting metrics from a producer.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't seem like either one is actually used in the code today, though. At least, I think that's what clippy is saying and I couldn't find a case that contradicted it.

@davepacheco
Copy link
Collaborator Author

After exploring the static linking approach under #234, I've come back around and think this is the most promising direction.

The problem I mentioned earlier regarding rustdoc tests is rust-lang/cargo#9895. It appears confirmed as a bug and I believe someone is working on fixing it. With that fixed, this will be a complete solution, even if it's ugly. In the meantime, we can just tell people to set RUSTDOCFLAGS. This is annoying, but at least the failure mode is only that test binaries are broken.

In terms of the ugliness: on @cbiffle's suggestion I refactored the build script logic into a new omicron-rpaths package in the same workspace. The remaining ugliness is (1) that nearly every crate in the workspace grows a dummy dependency on pq-sys in order to get its build script metadata, and (2) that same list of crates also grows an identical, tiny build.rs that just calls out to omicron-rpaths, and (3) we're on nightly until 1.56, but we were anyway for "asm".

The only thing here that I know needs to be updated before landing this change is the patch in the top level Cargo.toml. We should probably update the "oxide" branch in pq-sys and use that instead.

@davepacheco
Copy link
Collaborator Author

davepacheco commented Sep 15, 2021

Pending the CI checks and the change I mentioned in my last comment, I think this is ready for review.

Reviewers: I would suggest:

@bnaecker
Copy link
Collaborator

bnaecker commented Sep 15, 2021 via email

Copy link
Collaborator

@smklein smklein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for working in such depth to fix this. Frankly I had no idea this linkage would be so hairy prior to reading your work in #213 , but hopefully this makes life easier when we inevitably need to link against another native library in the future.

@luqmana I'm super interested in your thoughts here - from my point of view, this whole case is a good reason why rpaths should become a first-class citizen in cargo. Do you think there's anything we could/should do to signal boost this to the cargo team?

I still think an "ideal solution" - which would likely require language/cargo changes to be supported - would be one where a "-sys" crate can define explicit rpaths that are transitively picked up by anyone who depends, directly or indirectly, on that "-sys" crate. That said, I understand the concern about "implicit changes to linker-flags", but having a section like:

[build]
transitive_rpath = true

In a workspace seems much more natural than altering the dependency hierarchy to acquire / act on "DEP_" variables.

Regardless, thank you @dap - this solution looks like it'll cover our native linking use cases while protecting us from unexpected missing rpaths for the foreseeable future.

Comment on lines +69 to +79
* So we need to emit the linker argument here. How do we know what value to
* use? We take the approach used by most build systems: we use the path where
* the library was found on the build machine. But how do we know where that
* was? *-sys packages have non-trivial mechanisms for locating the desired
* library. We don't want to duplicate those here. Instead, we make use of
* metadata emitted by those build scripts, which shows up as "DEP_*"
* environment variables for us.
*
* **Important note:** In order for us to have metadata for these dependencies,
* we must *directly* depend on them. This may mean adding an otherwise unused
* dependency from the top-level package to the *-sys package.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting hack - I recall reading about the "DEP_" variables and playing around with them, but thought this was a non-starter because of the lack of transitiveness. Good idea just adding the direct dependency!

Comment on lines +143 to +145
* expect these to always be set, and if they're not, it's most likely that
* somebody has forgotten to include a required dependency. We want to tell
* them that rather than silently produce unrunnable binaries.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At build time especially, this seems like a reasonable inclination.

Copy link
Contributor

@luqmana luqmana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @davepacheco for the incredible depth in researching all the possible solutions! I think this is a reasonable approach right now barring any specific support from cargo/rust. Mirroring @smklein sentiments, this got incredibly hairy fast.

I think it'll probably be some time before we can realistically expect a first-class cargo solution there given the limited bandwidth of the cargo team (correct me if I'm wrong @steveklabnik). To really get it accelerated I'd think maybe writing up a pre-RFC and posting to rust-internals would be good. Our own issues and the ones linked above definitely indicate this isn't just a pain point for us. If we can come up with a reasonably scoped surface I'd be happy to help implement/drive it forward.

@davepacheco
Copy link
Collaborator Author

Thanks @luqmana and @smklein for taking a look and for all the help along the way.

I'd love to move the Cargo solution forward, too. There's a lot of good context in the Cargo issue. I'm not sure we have consensus internally on the best approach. It might be helpful to write down the goals and see if we can't find a common solution. I think @smklein wants a package to be able to encapsulate its RPATH requirements (correct me if that's wrong), and other folks (both on the Cargo issue and internally) want to specify this from the top level build. It'd help to write down the use cases where each is desirable. I expect we can come up with a solution where, for folks using pkg-config and its ilk that assume the build machine mirrors the production machine, this all "just works" (i.e., top-level builders don't need to do anything), while still allowing the folks who want to control this from the top level to do so.

@davepacheco davepacheco marked this pull request as ready for review September 15, 2021 22:43
@davepacheco davepacheco merged commit e34626f into main Sep 15, 2021
@davepacheco davepacheco deleted the issue-213-2 branch September 15, 2021 22:44
@smklein
Copy link
Collaborator

smklein commented Sep 15, 2021

I'd love to move the Cargo solution forward, too. There's a lot of good context in the Cargo issue. I'm not sure we have consensus internally on the best approach. It might be helpful to write down the goals and see if we can't find a common solution. I think @smklein wants a package to be able to encapsulate its RPATH requirements (correct me if that's wrong), and other folks (both on the Cargo issue and internally) want to specify this from the top level build. It'd help to write down the use cases where each is desirable. I expect we can come up with a solution where, for folks using pkg-config and its ilk that assume the build machine mirrors the production machine, this all "just works" (i.e., top-level builders don't need to do anything), while still allowing the folks who want to control this from the top level to do so.

I spent a while chatting with @rmustacc on this topic today, as I understand it:

The advantage of specifying the rpath at a binary / top-level point-of-view is that, at the point a binary is being produced, the target machine should be "known" - as should the target rpath. After all, the .dynamic section of the ELF file needs to be filled with something at this point. I'm definitely convinced that a binary which depends on pq-sys should be able to influence the rpath in a way that, in isolation, pq-sys may not be able to do on its own. One solution is very similar to the result of this PR: all top-level targets explicitly specify the native libraries on which they depend, even indirectly.

I 100% understand the need for top-level targets to be able to determine where rpaths should be uncovered - but it still seems like a bummer that these targets need to basically have hard-coded knowledge about all used native libraries, even if they're the result of a transitive crate dependency. That's the key part I'm wondering about - if the top-level build was responsible for specifying the rpath, but internal "*-sys" crates could emit the required rpaths in some well-known format that could be aggregated somewhere, I think this could accomplish the same goal.

Like, right now, the DEP_PQ_LIBDIRS environment variable gets passed from the pq-sys crate up to the top-level crates, but only because:

  • The top-level crates directly depend on pq-sys (artificially! just to see the environment variable)
  • omicron-rpaths has hardcoded this environment variable (within RPATH_ENV_VARS).

I could imagine a feature where - similar to how how the DEP_* variables are propagated upward from dependencies - a set of rpaths are bubbled up transitively from dependencies as an environment variable.

The top-level build could opt-in to using this environment variable (... or whatever mechanism cargo prefers), passing it to the linker, but IMO this would have some advantages over what we're being forced to do now:

  • No more direct dependencies on the "*-sys" crates! "omicron-nexus" wouldn't need to enumerate pq-sys explicitly, it could just indicate that "for each RPATH in TRANSITIVE_RPATHS, pass 'em to cargo:rustc-link-arg=-Wl,-R...".
  • Dependencies could change which "*-sys" crates they're using, and the top-level targets wouldn't need to be modified (explicitly).

This would basically change the framing of the problem to: "when you build a top-level target, you see all the rpaths recommended by your dependencies. What do you wanna do with 'em?"

With this setup, there would be the option to customize behavior arbitrarily, but also the potential for really concise semantic sugar to say "... yeah, give them all to the linker".

@davepacheco
Copy link
Collaborator Author

I think the basic idea makes sense. It's the details I'm not sure about: exactly what information do the *-sys crates expose, how do they get that information, how do they expose it, how do top level crates see that, and how do top level crates use it to control the final RPATH?

I still think it's essential that we articulate the two different use cases. Correct me if this is wrong, but I believe the one you're describing is: "my build machine matches my production machine, and I don't want to the top level crate to have to think about the RPATHs for native libraries that are pulled in by transitive dependencies." I think that's a common goal and important to support.

I'd like to have a concrete example of what I'll call the "release engineering" case. A simple way to package up Nexus might be to deliver /opt/oxide/bin/nexus with an RPATH pointing to /opt/ooce/pgsql-13/lib/amd64 so it can find libpq. That's easy, though. More interesting would be if we wanted to avoid depending on /opt/ooce and instead build an artifact that contains:

/opt/oxide/bin/nexus (RPATH: $ORIGIN/../lib)
/opt/oxide/lib/libpq.so

Now suppose some Nexus saga wants to use pg_dump (a PostgreSQL command built on libpq), and for whatever reason it needs a newer (or older, or just different) build of libpq. We might want to ship:

/opt/oxide/bin/nexus (RPATH: $ORIGIN/../lib/nexus)
/opt/oxide/bin/pg_dump (RPATH: $ORIGIN/../lib/pg_dump)
/opt/oxide/lib/nexus/libpq.so
/opt/oxide/lib/pg_dump/libpq.so

(openssl might be a better example than libpq.)

Is that the release engineering case? Is that too contrived?

@smklein
Copy link
Collaborator

smklein commented Sep 16, 2021

I think that's a super reasonable case - and while I want to make the "host = target" case easy, anything proposed should definitely be able to support that case of "the target machine may not look like my own".

This is super handwavey, but I think the case you mentioned could basically still be handled effectively if the rpath information emitted by "*-sys" crates was more structured.

As an example, suppose the "*-sys" crate could emit:

  • "This is a native library I need (by name, not path)"
  • "This is the path to the native library (on your host machine)"

At a top-level, a resolver in the build.rs file could be used to identify "when you're building this binary, you need libpq.so. Here's where you should look for that path on your target machine". This "resolver" would be bespoke for whatever target environment we're talking about - as an example, this could be where, for helios, we say "look in /opt/oxide/wherever" - but presumably that mapping of "libpq.so -> path on target to libpq.so" needs to happen somewhere.

Regarding "how does it get exposed up", I dunno. The rust book section on native libraries suggests the "key-value pairs" are basically the go-to.

Also, regarding "*-sys packages", the book mentions the following:

The set of *-sys packages provides a common set of dependencies for linking to native libraries. There are a number of benefits earned from having this convention of native-library-related packages:

  • Common dependencies on foo-sys alleviates the rule about one package per value of links.
  • Other -sys packages can take advantage of the DEP_NAME_KEY=value environment variables to better integrate with other packages. See the "Using another sys crate" example.
  • A common dependency allows centralizing logic on discovering libfoo itself (or building it from source).
  • These dependencies are easily overridable.

The language here implies - to me, at least - that the goal of the "*-sys" packages is to provide a single, centralized location regarding "how do we find the underlying native library" - or in their words:

centralizing logic on discovering libfoo itself (or building it from source)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants