Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Kernel and Op Implementation and Registration API #133

Merged
merged 6 commits into from
Jun 4, 2020

Conversation

sjamesr
Copy link
Contributor

@sjamesr sjamesr commented Aug 14, 2019

Comment period is open until 2019-08-28

Kernel and Op Implementation and Registration API

Status Proposed
Author(s) James Ring (sjr@google.com), Anna Revinskaya (annarev@google.com)
Sponsor Günhan Gülsoy (gunan@google.com)
Updated 2019-08-14

Objective

Tensorflow (TF) currently provides a C++ API for implementing kernels and ops. The Voltron project aims to create a modular/plugin-based TF implementation with API and ABI surfaces. Plugins will be able to create and register custom kernel and op implementations.

In order to provide a stable ABI, the Voltron team has chosen to provide C APIs to plugin authors. This document introduces the C API for op and kernel registration. For authors who wish to continue using C++ to interface with TensorFlow, an ABI-stable C++ header-only API is provided.

@ewilderj ewilderj self-assigned this Aug 14, 2019
@ewilderj ewilderj added the RFC: Proposed RFC Design Document label Aug 14, 2019
@ewilderj ewilderj changed the title add draft kernel RFC RFC: Kernel and Op Implementation and Registration API Aug 14, 2019
Copy link
Contributor

@alextp alextp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some key concerns are still missing from this proposal.

For example, we need to handle forward compatibility (is there a way for a single kernel to work across versions of TF which are from before it was built?) similar to how we discussed it in the filesystem API RFC.

This proposal would also be much easier to read if it had proposed code files for a full simple example op registration and kernel definition, so we can properly understand the consequences of this API (including error handling, etc).


// Bitcast_Create, Bitcast_Compute and Bitcast_Delete actually implement the
// kernel. See the section below for discussion on kernel implementation.
static void* Bitcast_Create(TF_OpKernelConstruction* context) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In TF today op kernel construction can fail (and if often does, as it validates attributes etc). This should have a status somewhere.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OpKernelConstruction itself contains a status which can be set with TF_OpKernelConstruction_Failure. I'll update the doc.

return (void*) k;
}

static void* Bitcast_Compute(void* k, TF_OpKernelContext* context) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the return value here? Outputs are not supposed to be allocated and then returned (they are allocated via the context) so I don't know why this is not void.

Copy link
Contributor Author

@sjamesr sjamesr Aug 15, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a typo, well spotted :)

void (*compute_func)(void*, TF_OpKernelContext*),
void (*delete_func)(void*));
void TF_RegisterKernelBuilder(const char* name, TF_KernelBuilder* builder,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why two functions instead of one function?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The kernel builder may be customized with other calls, such as TF_KernelBuilder_TypeConstraint and TF_KernelBuilder_HostMemory, I'll add these to the doc.

int64_t cost_per_unit,
void (*fn)(int64_t, int64_t));
TF_CAPI_EXPORT extern void TF_ThreadPool_ParallelForWithWorkerId(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing which is not obvious when reading this proposal is that in C++ we need to wrap almost every call to a method in the context with the op_requires_ok macro to terminate execution in case of an error.

The same will need to be done here as all of these can fail, so I think we need to be explicit about the code the user is expected to write.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Context is just used to call accessors, for e.g.:
auto thread_pool = context->device()->tensorflow_cpu_worker_threads()->workers;

https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/kernels/matrix_diag_op.cc#L323

We could modify ThreadPool interface though to also return a Status. Then, we can also take TF_Status output parameter in C API.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a section below called "Getting Status when using device APIs". Basically, since the status is not on the surface of APIs that we are wrapping (we could add it to ThreadPool, but probably not to Eigen::StreamInterface), I propose to have a separate function call to get status.

Let me know if this sounds good.

@karllessard
Copy link
Contributor

karllessard commented Aug 27, 2019

I don't know if this is in the scope of this RFC but how will be registered ops be exposed through the C API? Right now, both Ops defs and API defs are returned as proto buffers, will that remain the same?

Since at some points we want to move out the different language bindings out of the core repository (e.g. Java), for me it does not make sense to keep the per-language API defs in there. What would be the best strategy then?

@sjamesr, @annarev : any reply for this? thanks

@annarev
Copy link
Contributor

annarev commented Sep 8, 2019

@karllessard sorry I missed your question earlier. See replies inline.

I don't know if this is in the scope of this RFC but how will be registered ops be exposed through the C API? Right now, both Ops defs and API defs are returned as proto buffers, will that remain the same?

Yes, the way Op defs and API defs are returned will remain the same. These accessors are already a part of our C API. But let us know if you think some part of the API won't work well for some languages.

Since at some points we want to move out the different language bindings out of the core repository (e.g. Java), for me it does not make sense to keep the per-language API defs in there. What would be the best strategy then?

I think language bindings can already be removed with current API. You can keep ApiDef files with the language bindings (they are just read when generating language APIs and aren't used anywhere else). For example, Go API is generated based on a set of directories that contain ApiDef files: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/go/genop/main.go#L39 that are added using calls to TF_ApiDefMapPut (https://github.com/tensorflow/tensorflow/blob/master/tensorflow/go/genop/internal/api_def_map.go#L91). So, I would think we can move language API designed in this way to a separate repo. However, I haven't checked how Java APIs are generated.

@karllessard
Copy link
Contributor

However, I haven't checked how Java APIs are generated.

The Java API works pretty much the same, it has a op generator written in C++ that uses directly TF internal classes to read the API defs found in this directory merged with the base one.

But what I'm wondering is that if the op kernels are now implemented and registered outside the core, those API defs cannot reside in the core anymore. Having them in the language bindings would mean that the owners of a plugin must go in each client library (Python, Java, Go...) to add the definition of their op kernels in there as well, which does not sounds right neither.

So then, maybe what would be best is if the API defs are registered by the plugins using similar mechanisms than the ops (e.g. via TF_NewOpDefinitionBuilder, or via a new TF_NewApiDefinitionBuilder method?), and we should let diligent plugin maintainers the ability to provide a API def per language.

What do you think?

@annarev
Copy link
Contributor

annarev commented Sep 11, 2019

Good point. We definitely need a way to register an ApiDef for an op.

@theadactyl
Copy link
Contributor

@sjamesr what's the status of this revision?

@seanpmorgan
Copy link
Member

@sjamesr @gunan @annarev This is the most important change we are looking forward to for TF-Addons as we get a ~few issues per week on ABI issues. Would it be possible to get a status update on this & expected timeline. It would be very beneficial for us to know how much work we should put into workarounds given that this is on the horizon.

@byronyi
Copy link
Contributor

byronyi commented Dec 22, 2019

What’s the current ABI in TF core wheels?

@bhack
Copy link
Contributor

bhack commented Apr 14, 2020

Comment period is open until 2019-08-28. /cc @ematejska

I removed device section of this doc. We never made a final decision on the device solution and then I moved to work on something else. If we come up with a final decision for device, we can create another RFC (I removed my name from authors as well since my only contribution was device API. ).

The rest of the doc can be merged.
@annarev annarev requested a review from ematejska as a code owner June 3, 2020 00:51
@annarev
Copy link
Contributor

annarev commented Jun 3, 2020

@theadactyl , @seanpmorgan , @byronyi Really sorry for not providing any updates earlier. We really haven't had much work on this after initial Kernel API implementation by @sjamesr : https://github.com/tensorflow/tensorflow/blob/master/tensorflow/c/kernels.h . Since then, James moved to work on another project and I was focusing on other tasks in TensorFlow.

It seems best to merge this RFC since this doc has been pending for a while and initial implementation was submitted. I removed the device RFC part, since it wasn't finalized. Also, I will keep in mind that we need to support ApiDefs and look into a separate RFC for it when I get time.

For now, this RFC as it stands can be merged.

@byronyi
Copy link
Contributor

byronyi commented Jun 3, 2020

@annarev Thanks for rolling things forward.

Btw, would you mind to share with the community the current status of the "Voltron" project?

@annarev
Copy link
Contributor

annarev commented Jun 4, 2020

I think @mihaimaruseac completed the filesystem part (https://github.com/tensorflow/community/blob/master/rfcs/20190506-filesystem-plugin-modular-tensorflow.md). Unfortunately, we got caught up in other work and didn't get to spend as much time working on other tasks listed in the doc.

  • @gunan Adding Gunhan in case he has more details

@mihaimaruseac
Copy link
Contributor

Only the POSIX part. Windows should come by end of this quarter and GCS and S3 are part of a GSoC project. If there are no more delays (like there have been in the past), by TF 2.4 we should be able to convert filesystems to the Voltron world

@ematejska ematejska added RFC: Accepted RFC Design Document: Accepted by Review and removed RFC: Proposed RFC Design Document labels Jun 4, 2020
@ematejska ematejska merged commit 5001b24 into tensorflow:master Jun 4, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla: yes RFC: Accepted RFC Design Document: Accepted by Review
Projects
None yet
Development

Successfully merging this pull request may close these issues.