Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add tools to build TF AOT models #9093

Merged

Conversation

riga
Copy link
Contributor

@riga riga commented Mar 22, 2024

This PR is a continuation of #9005 and adds all necessary tools to AOT compile TensorFlow models provided via external resources or data repositories.

  • A new, small external package cms_tfaot is provided which forwards the compilation commands to the already existing cmsml package and dynamically builds tool files. It also has a dev mode which produces more verbose outputs with instructions for local compilations.
  • The package also provides 2 test models which will be required by the unittests in PhysicsTools/TensorFlowAOT.
  • The compilation instructions are handled by a tfaot-compile.file which only requires 2 variables. With that, each model should have its own spec file which can ##INCLUDE tfaot-compile.file.
  • The tensorflow-xla-runtime is now provided as a shared library. However, it seems like this is not necessarily the intention. When building the *.so normally, the plugin build process in CMSSW will complain about missing symbols from absl and tsl. For this reason we manually linked to 2 absl libraries and disabled complaints about unresolved symbols of the tsl library (shipped incomplete within the xla runtime source, i.e., some symbols are missing but also not required when built as a static lib) via -Wl,--unresolved-symbols=ignore-in-shared-libs in the tool file.

I enabled edit permissions, so feel free to move the cms-tfaot tool to cms-externals.

This PR is needed by cms-sw/cmssw#44519.

@Bogdan-Wiederspan

@cmsbuild
Copy link
Contributor

A new Pull Request was created by @riga for branch IB/CMSSW_14_1_X/master.

@cmsbuild, @iarspider, @smuzaffar, @aandvalenzuela can you please review it and eventually sign? Thanks.
@rappoccio, @antoniovilela, @sextonkennedy you are the release manager for this.
cms-bot commands are listed here

@cmsbuild
Copy link
Contributor

cmsbuild commented Mar 22, 2024

cms-bot internal usage

@@ -0,0 +1,22 @@
### RPM external py3-cms-tfaot 1.0.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@riga , for pip based packages it is batter to use the default way of building ( i.e. using https://github.com/cms-sw/cmsdist/blob/IB/CMSSW_14_1_X/master/build-with-pip.file ). So please drop this spec and add cmsdist/pip/cms-tfaot.file with following contents

Requires: py3-cmsml py3-tensorflow
%define github_user riga
%define tag a2bbd06cbed0efa1fa191cc2f50a6b03067e59d2
%define branch master
%define source0 git+https://github.com/%{github_user}/cms-tfaot.git?obj=%{branch}/%{tag}&export=%{n}-%{realversion}&output=/%{n}-%{realversion}-%{tag}.tgz 

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Changed in 7838448.


# build the shared library, linking missing symbols from abseil
gcc -shared -o libtf_xla_runtime.so -Wl,--whole-archive libtf_xla_runtime.a -Wl,--no-whole-archive \
-L${ABSEIL_CPP_ROOT}/lib -labsl_strings -labsl_str_format_internal
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you understand why there are missing symbols? May be it is not picking up our abseil-cpp properly?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@riga, I changed the cmake file (CMakeLists.txt) to have

add_library(tf_xla_runtime SHARED
  $<TARGET_OBJECTS:tf_xla_runtime_objects>
)

and then looking at the missing abseil symbols, I get

                 U _ZN4absl12lts_2022062319str_format_internal10FormatPackB5cxx11ENS1_21UntypedFormatSpecImplENS0_4SpanIKNS1_13FormatArgImplEEE
                 U _ZN4absl12lts_2022062319str_format_internal13FormatArgImpl8DispatchIiEEbNS2_4DataENS1_24FormatConversionSpecImplEPv
                 U _ZN4absl12lts_2022062319str_format_internal13FormatArgImpl8DispatchISt17basic_string_viewIcSt11char_traitsIcEEEEbNS2_4DataENS1_24FormatConversionSpecImplEPv
                 U _ZN4absl12lts_202206239StrAppendEPNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEERKNS0_8AlphaNumE

all of these are defined in the absl_strings library. We can patch [a] xla_aot_runtime_src/CMakeLists.txt and then it should generate the shared library (properly linked to absl::strings)

[a]

--- xla_aot_runtime_src/CMakeLists.txt    2024-03-24 08:28:34.000000000 +0100
+++ xla_aot_runtime_src/CMakeLists.txt      2024-03-25 11:17:58.108587945 +0100
@@ -14,6 +14,8 @@
   -Wno-sign-compare
 )
 
-add_library(tf_xla_runtime STATIC
+find_package(absl REQUIRED)
+add_library(tf_xla_runtime SHARED
   $<TARGET_OBJECTS:tf_xla_runtime_objects>
 )
+target_link_libraries(tf_xla_runtime absl::strings)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you understand why there are missing symbols? May be it is not picking up our abseil-cpp properly?

Only partially. Indeed it seems that it didn't find the abseil-cpp libraries which is now fixed. For the missing symbols of the tsl library I think the reason is that the tensorflow_xla_runtime sources (shipped with py3-tensorflow) only provide a subset of the implementations defined by headers.

tsl

These symbols aren't needed anyway - unlike the absl ones - hence the -Wl,--unresolved-symbols=ignore-in-shared-libs in the toolfile. (I guess one could also fetch and compile the missing sources, though at the risk of inflating the runtime with unnecessary symbols.)


I added a patch in 7838448 and removed the initial static build step.

@cmsbuild
Copy link
Contributor

Pull request #9093 was updated.

</client>
<lib name="tf_xla_runtime-static"/>
<lib name="tf_xla_runtime"/>
<flags LDFLAGS="-Wl,--unresolved-symbols=ignore-in-shared-libs"/>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@riga , we do not need these ld flags any more .. right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah ok, I now have read your #9093 (comment) . I will check what else we can do to avoid the missing symbols. We should not add this flag here otherwise scram will not fail for missing symbols for every cmssw shared lib which depend on tensorflow-xml-runtime (directly or in-directly)

Copy link
Contributor Author

@riga riga Mar 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this already felt like a dangerous thing to do. Do you think there is any other option besides a) fetching the missing source files to tensorflow/tsl before compiling, or b) installing the full tsl as a dependency (still seems a bit like work in progress)?

Copy link
Contributor

@smuzaffar smuzaffar Mar 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@riga , all the missing TSL symbols are coming from https://github.com/tensorflow/tensorflow/blob/v2.12.1/tensorflow/compiler/xla/service/cpu/runtime_fork_join.cc e.g various macros calls VLOG, CHECK_* and tsl::BlockingCounter. If you think we can run without these missing symbols ( e.g. we never call these) so why not just then drop the runtime_fork_join.cc from the compilation?

Copy link
Contributor

@smuzaffar smuzaffar Mar 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can also add -Wl,-z,defs to cmake CXXFLAGS to make sure that the shared library has no missing symbols. I mean at link time make will fail if there are any missing symbols.

Copy link
Contributor Author

@riga riga Mar 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you think we can run without these missing symbols ( e.g. we never call these) so why not just then drop the runtime_fork_join.cc from the compilation?

Ok, so when we used the static lib before, the corresponding object files were likely dropped already, makes sense. And since we will potentially never use eigen thread pools for inference, we can skip it right away. Then I guess we are lucky that the tsl dependence only affects that part of the xla runtime source that we anyway don't need (well, at least for now).

We just verified that everything still works 👍

@cmsbuild
Copy link
Contributor

Pull request #9093 was updated.

@smuzaffar
Copy link
Contributor

smuzaffar commented Mar 26, 2024

test parameters:

@smuzaffar
Copy link
Contributor

please test

@cmsbuild
Copy link
Contributor

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-e51cbd/38430/summary.html
COMMIT: d2d2575
CMSSW: CMSSW_14_1_X_2024-03-26-1100/el8_amd64_gcc12
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmsdist/9093/38430/install.sh to create a dev area with all the needed externals and cmssw changes.

The following merge commits were also included on top of IB + this PR after doing git cms-merge-topic:

You can see more details here:
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-e51cbd/38430/git-recent-commits.json
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-e51cbd/38430/git-merge-result

Comparison Summary

Summary:

  • You potentially removed 9 lines from the logs
  • Reco comparison results: 39 differences found in the comparisons
  • DQMHistoTests: Total files compared: 48
  • DQMHistoTests: Total histograms compared: 3297857
  • DQMHistoTests: Total failures: 0
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 3297837
  • DQMHistoTests: Total skipped: 20
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 47 files compared)
  • Checked 202 log files, 165 edm output root files, 48 DQM output files
  • TriggerResults: no differences found

@smuzaffar
Copy link
Contributor

please test

@cmsbuild
Copy link
Contributor

Pull request #9093 was updated.

@smuzaffar
Copy link
Contributor

please test

lets run the final tests

@smuzaffar
Copy link
Contributor

please test for el8_aarch64_gcc12

@cmsbuild
Copy link
Contributor

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-e51cbd/38490/summary.html
COMMIT: a964e74
CMSSW: CMSSW_14_1_X_2024-03-27-2300/el8_aarch64_gcc12
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmsdist/9093/38490/install.sh to create a dev area with all the needed externals and cmssw changes.

@smuzaffar
Copy link
Contributor

smuzaffar commented Mar 28, 2024

+externals

@cmsbuild
Copy link
Contributor

This pull request is fully signed and it will be integrated in one of the next IB/CMSSW_14_1_X/master IBs after it passes the integration tests. This pull request will now be reviewed by the release team before it's merged. @rappoccio, @antoniovilela, @sextonkennedy (and backports should be raised in the release meeting by the corresponding L2)
Notice This PR was tested with additional Pull Request(s), please also merge them if necessary: cms-sw/cmssw#44519

@cmsbuild
Copy link
Contributor

REMINDER @antoniovilela, @rappoccio, @sextonkennedy: This PR was tested with cms-sw/cmssw#44519, please check if they should be merged together

@cmsbuild
Copy link
Contributor

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-e51cbd/38489/summary.html
COMMIT: a964e74
CMSSW: CMSSW_14_1_X_2024-03-27-2300/el8_amd64_gcc12
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmsdist/9093/38489/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

Summary:

  • You potentially added 124 lines to the logs
  • Reco comparison results: 42 differences found in the comparisons
  • DQMHistoTests: Total files compared: 48
  • DQMHistoTests: Total histograms compared: 3297437
  • DQMHistoTests: Total failures: 6
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 3297411
  • DQMHistoTests: Total skipped: 20
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 47 files compared)
  • Checked 202 log files, 165 edm output root files, 48 DQM output files
  • TriggerResults: no differences found

@smuzaffar
Copy link
Contributor

please test

lets see if this can go in without cmssw PR

@cmsbuild
Copy link
Contributor

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-e51cbd/38504/summary.html
COMMIT: a964e74
CMSSW: CMSSW_14_1_X_2024-03-28-2300/el8_amd64_gcc12
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmsdist/9093/38504/install.sh to create a dev area with all the needed externals and cmssw changes.

The following merge commits were also included on top of IB + this PR after doing git cms-merge-topic:

You can see more details here:
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-e51cbd/38504/git-recent-commits.json
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-e51cbd/38504/git-merge-result

Comparison Summary

Summary:

  • You potentially added 11 lines to the logs
  • Reco comparison results: 40 differences found in the comparisons
  • DQMHistoTests: Total files compared: 48
  • DQMHistoTests: Total histograms compared: 3297469
  • DQMHistoTests: Total failures: 6
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 3297443
  • DQMHistoTests: Total skipped: 20
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 47 files compared)
  • Checked 202 log files, 165 edm output root files, 48 DQM output files
  • TriggerResults: no differences found

@antoniovilela
Copy link

+1

@cmsbuild cmsbuild merged commit 5b35acb into cms-sw:IB/CMSSW_14_1_X/master Mar 31, 2024
17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants