-
Notifications
You must be signed in to change notification settings - Fork 184
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add tools to build TF AOT models #9093
Add tools to build TF AOT models #9093
Conversation
A new Pull Request was created by @riga for branch IB/CMSSW_14_1_X/master. @cmsbuild, @iarspider, @smuzaffar, @aandvalenzuela can you please review it and eventually sign? Thanks. |
cms-bot internal usage |
py3-cms-tfaot.spec
Outdated
@@ -0,0 +1,22 @@ | |||
### RPM external py3-cms-tfaot 1.0.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@riga , for pip based packages it is batter to use the default way of building ( i.e. using https://github.com/cms-sw/cmsdist/blob/IB/CMSSW_14_1_X/master/build-with-pip.file ). So please drop this spec and add cmsdist/pip/cms-tfaot.file
with following contents
Requires: py3-cmsml py3-tensorflow
%define github_user riga
%define tag a2bbd06cbed0efa1fa191cc2f50a6b03067e59d2
%define branch master
%define source0 git+https://github.com/%{github_user}/cms-tfaot.git?obj=%{branch}/%{tag}&export=%{n}-%{realversion}&output=/%{n}-%{realversion}-%{tag}.tgz
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 Changed in 7838448.
tensorflow-xla-runtime.spec
Outdated
|
||
# build the shared library, linking missing symbols from abseil | ||
gcc -shared -o libtf_xla_runtime.so -Wl,--whole-archive libtf_xla_runtime.a -Wl,--no-whole-archive \ | ||
-L${ABSEIL_CPP_ROOT}/lib -labsl_strings -labsl_str_format_internal |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you understand why there are missing symbols? May be it is not picking up our abseil-cpp properly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@riga, I changed the cmake file (CMakeLists.txt) to have
add_library(tf_xla_runtime SHARED
$<TARGET_OBJECTS:tf_xla_runtime_objects>
)
and then looking at the missing abseil
symbols, I get
U _ZN4absl12lts_2022062319str_format_internal10FormatPackB5cxx11ENS1_21UntypedFormatSpecImplENS0_4SpanIKNS1_13FormatArgImplEEE
U _ZN4absl12lts_2022062319str_format_internal13FormatArgImpl8DispatchIiEEbNS2_4DataENS1_24FormatConversionSpecImplEPv
U _ZN4absl12lts_2022062319str_format_internal13FormatArgImpl8DispatchISt17basic_string_viewIcSt11char_traitsIcEEEEbNS2_4DataENS1_24FormatConversionSpecImplEPv
U _ZN4absl12lts_202206239StrAppendEPNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEERKNS0_8AlphaNumE
all of these are defined in the absl_strings
library. We can patch [a] xla_aot_runtime_src/CMakeLists.txt
and then it should generate the shared library (properly linked to absl::strings)
[a]
--- xla_aot_runtime_src/CMakeLists.txt 2024-03-24 08:28:34.000000000 +0100
+++ xla_aot_runtime_src/CMakeLists.txt 2024-03-25 11:17:58.108587945 +0100
@@ -14,6 +14,8 @@
-Wno-sign-compare
)
-add_library(tf_xla_runtime STATIC
+find_package(absl REQUIRED)
+add_library(tf_xla_runtime SHARED
$<TARGET_OBJECTS:tf_xla_runtime_objects>
)
+target_link_libraries(tf_xla_runtime absl::strings)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you understand why there are missing symbols? May be it is not picking up our abseil-cpp properly?
Only partially. Indeed it seems that it didn't find the abseil-cpp libraries which is now fixed. For the missing symbols of the tsl library I think the reason is that the tensorflow_xla_runtime
sources (shipped with py3-tensorflow
) only provide a subset of the implementations defined by headers.
These symbols aren't needed anyway - unlike the absl ones - hence the -Wl,--unresolved-symbols=ignore-in-shared-libs
in the toolfile. (I guess one could also fetch and compile the missing sources, though at the risk of inflating the runtime with unnecessary symbols.)
I added a patch in 7838448 and removed the initial static build step.
Pull request #9093 was updated. |
</client> | ||
<lib name="tf_xla_runtime-static"/> | ||
<lib name="tf_xla_runtime"/> | ||
<flags LDFLAGS="-Wl,--unresolved-symbols=ignore-in-shared-libs"/> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@riga , we do not need these ld flags any more .. right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah ok, I now have read your #9093 (comment) . I will check what else we can do to avoid the missing symbols. We should not add this flag here otherwise scram will not fail for missing symbols for every cmssw shared lib which depend on tensorflow-xml-runtime (directly or in-directly)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this already felt like a dangerous thing to do. Do you think there is any other option besides a) fetching the missing source files to tensorflow/tsl
before compiling, or b) installing the full tsl as a dependency (still seems a bit like work in progress)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@riga , all the missing TSL symbols are coming from https://github.com/tensorflow/tensorflow/blob/v2.12.1/tensorflow/compiler/xla/service/cpu/runtime_fork_join.cc e.g various macros calls VLOG, CHECK_*
and tsl::BlockingCounter
. If you think we can run without these missing symbols ( e.g. we never call these) so why not just then drop the runtime_fork_join.cc
from the compilation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can also add -Wl,-z,defs
to cmake CXXFLAGS to make sure that the shared library has no missing symbols. I mean at link time make
will fail if there are any missing symbols.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you think we can run without these missing symbols ( e.g. we never call these) so why not just then drop the runtime_fork_join.cc from the compilation?
Ok, so when we used the static lib before, the corresponding object files were likely dropped already, makes sense. And since we will potentially never use eigen thread pools for inference, we can skip it right away. Then I guess we are lucky that the tsl dependence only affects that part of the xla runtime source that we anyway don't need (well, at least for now).
We just verified that everything still works 👍
Pull request #9093 was updated. |
test parameters: |
please test |
+1 Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-e51cbd/38430/summary.html The following merge commits were also included on top of IB + this PR after doing git cms-merge-topic: You can see more details here: Comparison SummarySummary:
|
please test |
Pull request #9093 was updated. |
please test lets run the final tests |
please test for el8_aarch64_gcc12 |
+1 Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-e51cbd/38490/summary.html |
+externals |
This pull request is fully signed and it will be integrated in one of the next IB/CMSSW_14_1_X/master IBs after it passes the integration tests. This pull request will now be reviewed by the release team before it's merged. @rappoccio, @antoniovilela, @sextonkennedy (and backports should be raised in the release meeting by the corresponding L2) |
REMINDER @antoniovilela, @rappoccio, @sextonkennedy: This PR was tested with cms-sw/cmssw#44519, please check if they should be merged together |
+1 Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-e51cbd/38489/summary.html Comparison SummarySummary:
|
please test lets see if this can go in without cmssw PR |
+1 Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-e51cbd/38504/summary.html The following merge commits were also included on top of IB + this PR after doing git cms-merge-topic: You can see more details here: Comparison SummarySummary:
|
+1 |
This PR is a continuation of #9005 and adds all necessary tools to AOT compile TensorFlow models provided via external resources or data repositories.
cms_tfaot
is provided which forwards the compilation commands to the already existing cmsml package and dynamically builds tool files. It also has adev
mode which produces more verbose outputs with instructions for local compilations.tfaot-compile.file
which only requires 2 variables. With that, each model should have its own spec file which can##INCLUDE tfaot-compile.file
.*.so
normally, the plugin build process in CMSSW will complain about missing symbols from absl and tsl. For this reason we manually linked to 2 absl libraries and disabled complaints about unresolved symbols of the tsl library (shipped incomplete within the xla runtime source, i.e., some symbols are missing but also not required when built as a static lib) via-Wl,--unresolved-symbols=ignore-in-shared-libs
in the tool file.I enabled edit permissions, so feel free to move the
cms-tfaot
tool to cms-externals.This PR is needed by cms-sw/cmssw#44519.
@Bogdan-Wiederspan