Examples of Running a JAX function in C++ #5337

drebain · 2021-01-07T00:56:25Z

Are there any examples available of running jit functions defined in python from C++? I see that there is an interface for generating something usable by XLA but it is a bit unclear how to use the result of this when the function is dependent on variables/weights (e.g. a flax module).

@zhangqiaorjc I am told you have some knowledge of this?

Thanks

The text was updated successfully, but these errors were encountered:

zhangqiaorjc · 2021-01-08T03:19:36Z

I'll show an example for a simple JAX function. Let me know if it suffices for your use case.

There are two steps involved here:

Save a JAX program as an xla::HloModule
Load the HloModule from file, and run it using the C++ Runtime.

Step 1: Use jax/tools/jax_to_hlo.py to save a JAX program.

Suppose we have a dummy JAX program jax/tools/prog.py,

import jax.numpy as jnp

def fn(x, y, z):
  return jnp.dot(x, y) / z

Let's convert it to HLO, with input shapes and constants provided (see usage in jax_to_hlo.py). The commandline options roughly correspond to jax.api.jit options.

$ python3 jax_to_hlo.py \
--fn prog.fn \
--input_shapes '[("x", "f32[2,2]"), ("y", "f32[2,2]")]' \
--constants '{"z": 2.0}' \
--hlo_text_dest /tmp/fn_hlo.txt \
--hlo_proto_dest /tmp/fn_hlo.pb

Pay special attention to the order of parameters specified in --input_shapes. If we had specified --input_shapes '[("y", "f32[2,2]"), ("x", "f32[2,2]")]', x and y would have flipped, even though they are named in the tuple.

Let's see the saved HloModule

$ cat /tmp/fn_hlo.txt
HloModule xla_computation_ordered_wrapper.9

ENTRY xla_computation_ordered_wrapper.9 {
  constant.3 = pred[] constant(false)
  parameter.1 = f32[2,2]{1,0} parameter(0)
  parameter.2 = f32[2,2]{1,0} parameter(1)
  dot.4 = f32[2,2]{1,0} dot(parameter.1, parameter.2), lhs_contracting_dims={1}, rhs_contracting_dims={0}
  constant.5 = f32[] constant(2)
  broadcast.6 = f32[2,2]{1,0} broadcast(constant.5), dimensions={}
  divide.7 = f32[2,2]{1,0} divide(dot.4, broadcast.6)
  ROOT tuple.8 = (f32[2,2]{1,0}) tuple(divide.7)
}

Note the single output with shape f32[2,2]{1,0}.

Step 2: Use PJRT runtime to run the saved HloModule with user-provided input values

Note that there are multiple C++ runtime APIs that we can use to run an HloModule -- HloRunner, LocalClient/LocalExecutable, or PJRT, which are all in the Tensorflow tree. Since I'm most familiar with JAX's runtime API, I'll show an example using PJRT (see tensorflow/compiler/xla/pjrt/pjrt_client.h).

Suppose we have the following BUILD and cc files

/Users/zhangqiaorjc/repo/tensorflow/tensorflow/compiler/xla/examples
└── jax_cpp
    ├── BUILD
    └── main.cc

The BUILD file looks like

load(
    "//tensorflow:tensorflow.bzl",
    "tf_cc_binary",
)

tf_cc_binary(
    name = "main",
    srcs = ["main.cc"],
    deps = [
      "//tensorflow/compiler/xla/tools:hlo_module_loader",
      "//tensorflow/compiler/xla/pjrt:pjrt_client",
      "//tensorflow/compiler/xla/pjrt:cpu_device",
      "//tensorflow/compiler/xla:status",
      "//tensorflow/compiler/xla:statusor",
      "//tensorflow/compiler/xla:literal",
      "//tensorflow/compiler/xla:literal_util",
      "//tensorflow/compiler/xla:shape_util",
      "//tensorflow/compiler/xla/service:hlo_proto_cc",
    ],
)

The cc file looks like

// An example for reading a HloModule from a HloProto file and execute the
// module on PJRT CPU client.

#include <memory>
#include <string>
#include <vector>

#include "tensorflow/compiler/xla/literal.h"
#include "tensorflow/compiler/xla/literal_util.h"
#include "tensorflow/compiler/xla/pjrt/cpu_device.h"
#include "tensorflow/compiler/xla/pjrt/pjrt_client.h"
#include "tensorflow/compiler/xla/status.h"
#include "tensorflow/compiler/xla/statusor.h"
#include "tensorflow/compiler/xla/tools/hlo_module_loader.h"
#include "tensorflow/core/platform/init_main.h"
#include "tensorflow/core/platform/logging.h"

int main(int argc, char** argv) {
  tensorflow::port::InitMain("", &argc, &argv);

  // Load HloModule from file.
  std::string hlo_filename = "/tmp/fn_hlo.txt";
  std::function<void(xla::HloModuleConfig*)> config_modifier_hook =
      [](xla::HloModuleConfig* config) { config->set_seed(42); };
  std::unique_ptr<xla::HloModule> test_module =
      LoadModuleFromFile(hlo_filename, xla::hlo_module_loader_details::Config(),
                         "txt", config_modifier_hook)
          .ValueOrDie();
  const xla::HloModuleProto test_module_proto = test_module->ToProto();

  // Run it using JAX C++ Runtime (PJRT).

  // Get a CPU client.
  std::unique_ptr<xla::PjRtClient> client =
      xla::GetCpuClient(/*asynchronous=*/true).ValueOrDie();

  // Compile XlaComputation to PjRtExecutable.
  xla::XlaComputation xla_computation(test_module_proto);
  xla::CompileOptions compile_options;
  std::unique_ptr<xla::PjRtExecutable> executable =
      client->Compile(xla_computation, compile_options).ValueOrDie();

  // Prepare inputs.
  xla::Literal literal_x =
      xla::LiteralUtil::CreateR2<float>({{1.0f, 2.0f}, {3.0f, 4.0f}});
  xla::Literal literal_y =
      xla::LiteralUtil::CreateR2<float>({{1.0f, 1.0f}, {1.0f, 1.0f}});
  std::unique_ptr<xla::PjRtBuffer> param_x =
      client->BufferFromHostLiteral(literal_x, client->local_devices()[0])
          .ValueOrDie();
  std::unique_ptr<xla::PjRtBuffer> param_y =
      client->BufferFromHostLiteral(literal_y, client->local_devices()[0])
          .ValueOrDie();

  // Execute on CPU.
  xla::ExecuteOptions execute_options;
  // One vector<buffer> for each device.
  std::vector<std::vector<std::unique_ptr<xla::PjRtBuffer>>> results =
      executable->Execute({{param_x.get(), param_y.get()}}, execute_options)
          .ValueOrDie();

  // Get result.
  std::shared_ptr<xla::Literal> result_literal =
      results[0][0]->ToLiteral().ValueOrDie();
  LOG(INFO) << "result = " << *result_literal;
  return 0;
}

To run it

$ bazel run -c opt :main

2021-01-07 17:30:23.472798: I tensorflow/compiler/xla/examples/jax_cpp/main.cc:69] result = (
f32[2,2] {
  { 1.5, 1.5 },
  { 3.5, 3.5 }
}
)

gnecula · 2021-01-08T07:27:54Z

Do you think it makes sense to link this from the FAQ?

drebain · 2021-01-08T17:29:51Z

Awesome! This will be very helpful, thank you.

I only have one question: if I understand correctly, the arguments to PjRtExecutable::Execute correspond to a list of arrays/tensors, so to run a neural network with weights you either need to pass the weights as individual tensor arguments or compile them into the HLO.

Is one of these approaches preferred, or would they be more or less equivalent in terms of performance?

Thanks again!

zhangqiaorjc · 2021-01-08T18:38:29Z

@gnecula we could turn this into a developer doc to make it more discoverable?

@drebain For the question on NN weights as constants or parameters, I believe making them constants will improve dispatch performance. There may be issues with increased memory consumption. I would recommend try both and measure performance and memory consumption.

Do let me know if you find any part of the example unclear, so I can make a developer doc that others can use (running JAX program via C++ API seems to have come up a few times)

Just a clarification on the PjRtExecutable::Execute signature since that part may be confusing to other readers.

  // Executes on devices addressable by the client. Requires executable has a
  // device_assignment and all devices in the device_assignment are addressable
  // by the client.
  virtual StatusOr<std::vector<std::vector<std::unique_ptr<PjRtBuffer>>>>
  Execute(absl::Span<const std::vector<PjRtBuffer*>> argument_handles,
          const ExecuteOptions& options) = 0;

The inner vector correspond to the parameters of the HLO computation. It's a Span<vector> because it's one set of parameters per device.

If it's only a single device, then you just pass in {{param_x.get(), param_y.get()}}. Note the double braces.

zhangqiaorjc · 2021-01-08T19:39:04Z

@drebain a bit of nuance on constants vs parameters, there seems to be some nontrivial tradeoffs (may depend on platforms too)

some XLA experts recommend turning small or scalars to constants while staying away from large constants

unfortunately, it's not easy to experiment with this, so you will have to play around with it to know

drebain · 2021-01-08T19:48:24Z

I see. I guess It makes sense that this would depend on the size of the constants. I will try both on a medium-sized MLP network and report back on which is faster for future reference.

drebain · 2021-01-15T18:17:26Z

Ok, as promised I have done some basic benchmarks and it seems that compiling network weights in as constants gets me a roughly 30% performance increase over passing them as arguments. This is with a 6-layer, 256-unit MLP, running on an RTX 3090.

zhangqiaorjc · 2021-06-22T21:47:48Z

For gpu, you may need the following changes

header include
#include "tensorflow/compiler/xla/pjrt/cpu_device.h"
-->
#include "tensorflow/compiler/xla/pjrt/gpu_device.h"
xla::GetCpuClient --> xla::GetGpuClient
BUILD dep change

"//tensorflow/compiler/xla/pjrt:cpu_device",
-->
"//tensorflow/compiler/xla/pjrt:gpu_device",

zhangqiaorjc · 2021-06-22T21:53:24Z

PJRT is the lowest level runtime API, and only advisable if one wants to avoid direct TF deps.

Otherwise, please use jax2tf and the usual TF C++ server for running a SavedModel.

https://github.com/google/jax/tree/main/jax/experimental/jax2tf

lucagrementieri · 2022-05-19T15:33:12Z

Thank you for the script, it's very useful!
Here, at the end of the script the result is printed, instead I would like to convert it to a Eigen::Matrix object to use it.

Do you know how to cast a xla::Literal to Eigen::Matrix?
Reading the header file for xla::Literal I have found no specific function to do it, but maybe I missed the relevant piece of code.

xueeinstein · 2022-06-18T14:45:23Z

@zhangqiaorjc Thank you for the example, it's very useful! I tried it using the newest codebase, but the output is

2022-06-18 22:26:10.013679: I jax_cpp/main.cc:68] result = (
f32[2,2] {
  { 2, 3 },
  { 2, 3 }
}
)

It looks like xla::LiteralUtil::CreateR2 creates column-major matrix now. It is different from your previous result.

Did this API change recently? Is there any building option to switch to row-major matrix like it in Python?

joaospinto · 2024-06-25T00:19:04Z

I'm trying to determine to what extent something like what @zhangqiaorjc shared above could be used to deploy JAX code into real-time/low-memory/restricted (e.g. no dynamic memory allocation) environments.

For example, would it be possible to "pre-compute" client->Compile, and auto-generate a Bazel library with functionality equivalent to the executable->Execute method of the xla::PjRtExecutable object?

Does the executable->Execute method perform any dynamic memory allocation?

joaospinto · 2024-06-29T00:54:06Z

^ This is discussed here: #22184

kranipa · 2024-07-05T19:43:31Z

I am trying to build the JAX CPP example for GPU Backend. It runs into linking error.
I am using se_gpu_pjrt_client for jax cpp example.

Build file is as follows


load("@tsl//tsl/platform:rules_cc.bzl", "cc_library")
load("//xla:xla.bzl", "xla_cc_binary")


xla_cc_binary(
    name = "main_gpu",
    srcs = ["main_gpu.cc"],
  
    tags = [
        "manual",
    ],
    deps = [
        "//xla/pjrt/gpu:gpu_topology",
        "//xla/pjrt/gpu:se_gpu_pjrt_client",
        "//xla:literal",
        "//xla:literal_util",
        "//xla:shape_util",
        "//xla:status_macros",
        "//xla:xla_data_proto_cc",
        "//xla/client:xla_computation",
        "//xla/ffi",
        "//xla/ffi:ffi_api",
        "//xla/pjrt:host_memory_spaces",
        "//xla/pjrt:pjrt_client",
        "//xla/pjrt:pjrt_executable",
        "//xla/pjrt:pjrt_future",
        "//xla/pjrt:pjrt_stream_executor_client",
        "//xla/pjrt/distributed:in_memory_key_value_store",
        "//xla/service:gpu_plugin",
        "//xla/service:hlo_parser",
        "//xla/service:platform_util",
        "//xla/stream_executor",
        "//xla/stream_executor:executor_cache",
        "@com_google_absl//absl/log:check",
        "@com_google_absl//absl/status",
        "@com_google_absl//absl/status:statusor",
        "@com_google_absl//absl/strings",
        "@com_google_absl//absl/synchronization",
        "@com_google_absl//absl/time",
        "@tsl//tsl/platform:env",
        "@tsl//tsl/platform:errors",
        "@tsl//tsl/platform:status",
        "@tsl//tsl/platform:statusor",
        "//xla/tools:hlo_module_loader",
        "@tsl//tsl/platform:logging",
        "@tsl//tsl/platform:platform_port",
    ],
)

Build Commands

./configure.py --backend=CUDA --host_compiler=gcc --nccl

bazel build --verbose_failures xla/jax_pjrt_demo:main_gpu

Runs into following linking error

  external/local_config_cuda/crosstool/clang/bin/crosstool_wrapper_driver_is_not_gcc @bazel-out/k8-opt/bin/xla/jax_pjrt_demo/main_gpu-2.params)
# Configuration: 2e0f9574b87d96e39f67c831c7dc6a1546625e87b8c3e3ca3e42b8605ed3a5c8
# Execution platform: @local_execution_config_platform//:platform
/usr/bin/ld: bazel-out/k8-opt/bin/xla/stream_executor/cuda/libcuda_platform_cuda_only.lo(cuda_platform.o): in function `stream_executor::gpu::CudaPlatform::~CudaPlatform()':
cuda_platform.cc:(.text._ZN15stream_executor3gpu12CudaPlatformD2Ev+0x21): undefined reference to `stream_executor::ExecutorCache::~ExecutorCache()'
/usr/bin/ld: bazel-out/k8-opt/bin/xla/stream_executor/cuda/libcuda_platform_cuda_only.lo(cuda_platform.o): in function `stream_executor::gpu::CudaPlatform::GetExecutor(stream_executor::StreamExecutorConfig const&)':
cuda_platform.cc:(.text._ZN15stream_executor3gpu12CudaPlatform11GetExecutorERKNS_20StreamExecutorConfigE+0x30): undefined reference to `stream_executor::ExecutorCache::Get(stream_executor::StreamExecutorConfig const&)'
/usr/bin/ld: cuda_platform.cc:(.text._ZN15stream_executor3gpu12CudaPlatform11GetExecutorERKNS_20StreamExecutorConfigE+0x79): undefined reference to `stream_executor::ExecutorCache::GetOrCreate(stream_executor::StreamExecutorConfig const&, std::function<absl::lts_20230802::StatusOr<std::unique_ptr<stream_executor::StreamExecutor, std::default_delete<stream_executor::StreamExecutor> > > ()> const&)'
/usr/bin/ld: bazel-out/k8-opt/bin/xla/stream_executor/cuda/libcuda_platform_cuda_only.lo(cuda_platform.o): in function `stream_executor::gpu::CudaPlatform::CudaPlatform()':
cuda_platform.cc:(.text._ZN15stream_executor3gpu12CudaPlatformC2Ev+0x43): undefined reference to `stream_executor::ExecutorCache::ExecutorCache()'
collect2: error: ld returned 1 exit status
Target //xla/jax_pjrt_demo:main_gpu failed to build

peastman · 2024-10-16T18:50:48Z

I tried to use the example above (#5337 (comment)), but in TensorFlow 2.17 the header tensorflow/compiler/xla/tools/hlo_module_loader.h no longer exists. I don't think it's a matter of the API having changed, because it does still exist in the OpenXLA repository. It seems TensorFlow no longer includes the necessary classes to run HLO from C++? Is there any way it can still be done?

zhangqiaorjc self-assigned this Jan 7, 2021

zhangqiaorjc mentioned this issue Jan 8, 2021

jax cpp interface #1871

Closed

zhangqiaorjc mentioned this issue Jan 12, 2021

Add example code to save JAX program and run using C++ runtime. #5383

Merged

zhangqiaorjc linked a pull request Jan 13, 2021 that will close this issue

Add example code to save JAX program and run using C++ runtime. #5383

Merged

copybara-service bot closed this as completed in #5383 Jan 26, 2021

maxstupp mentioned this issue Jan 28, 2021

Failed precondition error when trying to compile hlo #5544

Closed

maxstupp mentioned this issue Feb 23, 2021

Compiling an HLO module without bazel #5814

Open

lucagrementieri mentioned this issue May 5, 2022

Wrong values in tensors for JAX function executed in C++ as xla::HloModule #10588

Open

agoscinski mentioned this issue Mar 16, 2023

How will we interface with different molecular dynamics (MD) codes? lab-cosmo/equisolve#42

Open

VivekPanyam mentioned this issue Sep 29, 2023

XLA runner (to support JAX) VivekPanyam/carton#163

Open

peastman mentioned this issue Jul 25, 2024

JAX/XLA plugin openmm/openmm#4594

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Examples of Running a JAX function in C++ #5337

Examples of Running a JAX function in C++ #5337

drebain commented Jan 7, 2021

zhangqiaorjc commented Jan 8, 2021 •

edited

Loading

gnecula commented Jan 8, 2021

drebain commented Jan 8, 2021

zhangqiaorjc commented Jan 8, 2021

zhangqiaorjc commented Jan 8, 2021

drebain commented Jan 8, 2021

drebain commented Jan 15, 2021

zhangqiaorjc commented Jun 22, 2021

zhangqiaorjc commented Jun 22, 2021

lucagrementieri commented May 19, 2022

xueeinstein commented Jun 18, 2022

joaospinto commented Jun 25, 2024 •

edited

Loading

joaospinto commented Jun 29, 2024

kranipa commented Jul 5, 2024

peastman commented Oct 16, 2024

Examples of Running a JAX function in C++ #5337

Examples of Running a JAX function in C++ #5337

Comments

drebain commented Jan 7, 2021

zhangqiaorjc commented Jan 8, 2021 • edited Loading

gnecula commented Jan 8, 2021

drebain commented Jan 8, 2021

zhangqiaorjc commented Jan 8, 2021

zhangqiaorjc commented Jan 8, 2021

drebain commented Jan 8, 2021

drebain commented Jan 15, 2021

zhangqiaorjc commented Jun 22, 2021

zhangqiaorjc commented Jun 22, 2021

lucagrementieri commented May 19, 2022

xueeinstein commented Jun 18, 2022

joaospinto commented Jun 25, 2024 • edited Loading

joaospinto commented Jun 29, 2024

kranipa commented Jul 5, 2024

peastman commented Oct 16, 2024

zhangqiaorjc commented Jan 8, 2021 •

edited

Loading

joaospinto commented Jun 25, 2024 •

edited

Loading