Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix $PYTHONPATH for hermetic python in TensorFlow builds #3568

Merged
merged 1 commit into from
Feb 21, 2025

Conversation

yqshao
Copy link
Contributor

@yqshao yqshao commented Jan 27, 2025

Trying to fix #3566

@yqshao
Copy link
Contributor Author

yqshao commented Jan 27, 2025

While testing this I also realized that the -Og flag no longer works for TF>2.14 as a patch fixing it was rolled-back because that conflicts with some JAX CI (which seeems to be running python2).

Bringing that patch back does not seem to trivially fix our debug=True build, currently I'm working around this by disabling debug at our site, but not sure what's the best way to handle it in EB.

@yqshao yqshao changed the title Fix PYTHONPATH for hermetic python in TF builds with EB 5.x Fix PYTHONPATH for hermetic python in TensorFlow builds with EB 5.x Jan 28, 2025
@boegel boegel requested a review from Micket January 29, 2025 08:47
@boegel boegel added bug fix EasyBuild-5.0 EasyBuild 5.0 labels Jan 29, 2025
@boegel boegel added this to the 5.0 milestone Jan 29, 2025
Copy link
Contributor

@Micket Micket left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well, lgtm, it only impacts the build step, so it seems safe.

@Micket
Copy link
Contributor

Micket commented Feb 21, 2025

@yqshao if you could upload a test report to this i would appreciate it
eb --upload-test-report --include-easyblocks-from-pr 3568 TensorFlow-2.14...eb

@yqshao
Copy link
Contributor Author

yqshao commented Feb 21, 2025

Test report by @yqshao

Overview of tested easyconfigs (in order)

  • SUCCESS NCCL-2.18.3-GCCcore-12.3.0-CUDA-12.1.1.eb
  • SUCCESS Zip-3.0-GCCcore-12.3.0.eb
  • SUCCESS Bazel-6.1.0-GCCcore-12.3.0.eb
  • SUCCESS dill-0.3.7-GCCcore-12.3.0.eb
  • SUCCESS flatbuffers-23.5.26-GCCcore-12.3.0.eb
  • SUCCESS flatbuffers-python-23.5.26-GCCcore-12.3.0.eb
  • SUCCESS JsonCpp-1.9.5-GCCcore-12.3.0.eb
  • SUCCESS ml_dtypes-0.3.2-gfbf-2023a.eb
  • SUCCESS nsync-1.26.0-GCCcore-12.3.0.eb
  • SUCCESS OpenSSL-1.1.eb
  • SUCCESS Abseil-20230125.3-GCCcore-12.3.0.eb
  • SUCCESS protobuf-24.0-GCCcore-12.3.0.eb
  • SUCCESS protobuf-python-4.24.0-GCCcore-12.3.0.eb
  • SUCCESS RE2-2023-08-01-GCCcore-12.3.0.eb
  • SUCCESS grpcio-1.57.0-GCCcore-12.3.0.eb
  • SUCCESS tensorboard-2.15.1-gfbf-2023a.eb
  • FAIL (build issue) TensorFlow-2.15.1-foss-2023a-CUDA-12.1.1.eb (partial log available at https://gist.github.com/yqshao/e0da6eafde440dd72596d2e1179f8b8e)

Build succeeded for 16 out of 17 (1 easyconfigs in total)
vera-r05-01 - Linux Rocky Linux 9.4, x86_64, AMD EPYC 9354 32-Core Processor, 2 x NVIDIA NVIDIA H100 NVL, 565.57.01, Python 3.9.18
See https://gist.github.com/yqshao/f0638c6f7b71f68e467d9ae2bff57c12 for a full test report.

@yqshao
Copy link
Contributor Author

yqshao commented Feb 21, 2025

^ above is an example of the failure with -Og, probably there are some flags to get around this. Anyhow, TF don't seem to care to keep this working and hopefully people won't hit it so often after framework#4764, so I gave up. Giving it another go with --disable-keep-debug-symbols.

@yqshao
Copy link
Contributor Author

yqshao commented Feb 21, 2025

Test report by @yqshao

Overview of tested easyconfigs (in order)

  • SUCCESS TensorFlow-2.15.1-foss-2023a-CUDA-12.1.1.eb

Build succeeded for 1 out of 1 (1 easyconfigs in total)
vera-r05-01 - Linux Rocky Linux 9.4, x86_64, AMD EPYC 9354 32-Core Processor, 2 x NVIDIA NVIDIA H100 NVL, 565.57.01, Python 3.9.18
See https://gist.github.com/yqshao/7b1377946ac903fcb787939674172edd for a full test report.

@Micket Micket merged commit 8656ac9 into easybuilders:5.0.x Feb 21, 2025
19 checks passed
@yqshao yqshao deleted the 5.0.x branch February 22, 2025 13:33
@boegel boegel changed the title Fix PYTHONPATH for hermetic python in TensorFlow builds with EB 5.x Fix PYTHONPATH for hermetic python in TensorFlow builds Feb 22, 2025
@boegel boegel changed the title Fix PYTHONPATH for hermetic python in TensorFlow builds Fix $PYTHONPATH for hermetic python in TensorFlow builds Feb 22, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants