Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(pt): allow PT OP CXXABI different from TF #3891

Merged
merged 16 commits into from
Jun 21, 2024

Conversation

njzjz
Copy link
Member

@njzjz njzjz commented Jun 20, 2024

  • Build PT OP libraries with compatible CXXABI if PT has a different CXX ABI with TF;
  • Enable PT OP in test_cuda workflow;
  • Update documentation.

Summary by CodeRabbit

  • Documentation

    • Removed outdated instructions related to setting environment variables for enabling customized C++ OPs in PyTorch.
  • Chores

    • Updated build configuration to handle PyTorch CXX11 ABI compatibility with TensorFlow.
    • Refactored library creation processes for better handling of CUDA and ROCm toolkits.
    • Improved build scripts to dynamically adjust compile definitions and installation paths based on different build configurations.
  • CI/CD

    • Enhanced the continuous integration workflow to include PyTorch variable assignments and settings for testing with CUDA.

Signed-off-by: Jinzhe Zeng <jinzhe.zeng@rutgers.edu>
Signed-off-by: Jinzhe Zeng <jinzhe.zeng@rutgers.edu>
Signed-off-by: Jinzhe Zeng <jinzhe.zeng@rutgers.edu>
Signed-off-by: Jinzhe Zeng <jinzhe.zeng@rutgers.edu>
Signed-off-by: Jinzhe Zeng <jinzhe.zeng@rutgers.edu>
Signed-off-by: Jinzhe Zeng <jinzhe.zeng@rutgers.edu>
Copy link
Contributor

coderabbitai bot commented Jun 20, 2024

Walkthrough

Walkthrough

This update enhances the build process for managing PyTorch and TensorFlow compatibility by introducing checks and appropriate settings for the CXX11 ABI flag, refactoring GPU library creation, and adding conditional linking. These modifications support customized C++ operations without requiring explicit environment variables, ensuring seamless integration and compatibility within different build configurations.

Changes

File Change Summary
doc/install/install-from-source.md Removed instructions and explanations related to the DP_ENABLE_PYTORCH environment variable and CXX11 ABI flags.
source/CMakeLists.txt Handled PyTorch CXX11 ABI compatibility checks with TensorFlow, setting compatibility flags accordingly.
source/api_cc/CMakeLists.txt Linked libraries conditionally based on ABI flag comparisons to adjust build configurations for libraries.
source/lib/CMakeLists.txt Introduced a create_library function for dynamic library creation with suffixes based on CUDA/ROCm toolkits.
source/lib/src/gpu/CMakeLists.txt Refactored GPU library creation and linking process, removed unused compile definitions, and handled ABI compatibility.
source/op/pt/CMakeLists.txt Implemented conditional linking and adjusted compile definitions based on PyTorch ABI settings.
.github/workflows/test_cuda.yml Added environment variable exports and assignments to streamline testing configurations.

Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between c85b559 and ee0f409.

Files selected for processing (2)
  • .github/workflows/test_cuda.yml (1 hunks)
  • doc/install/install-from-source.md (1 hunks)
Additional context used
LanguageTool
doc/install/install-from-source.md

[style] ~89-~89: You have already used this phrasing in nearby sentences. Consider replacing it to add variety to your writing. (REP_WANT_TO_VB)
Context: ...e $deepmd_venv/bin/activate ``` if one wants to skip out of the virtual environment, he...


[uncategorized] ~146-~146: Did you mean: “By default,”? (BY_DEFAULT_COMMA)
Context: ... The path to TensorFlow Python library. By default the installer only finds TensorFlow und...


[uncategorized] ~190-~190: “your” (belonging to you) seems less likely than “you”. (AI_HYDRA_LEO_CP_YOUR_YOU)
Context: ...nccl pip install horovod mpi4py ``` If your work in a CPU environment, please prepa...


[style] ~295-~295: Consider using “incompatible” to avoid wordiness. (NOT_ABLE_PREMIUM)
Context: ...rch/pytorch/issues/51039), which may be not compatible with other libraries. ```bash cmake -D...

Markdownlint
doc/install/install-from-source.md

335-335: null (MD014, commands-show-output)
Dollar signs used before commands without showing output


336-336: null (MD014, commands-show-output)
Dollar signs used before commands without showing output


107-107: null (MD040, fenced-code-language)
Fenced code blocks should have a language specified


153-153: null (MD040, fenced-code-language)
Fenced code blocks should have a language specified

Additional comments not posted (1)
.github/workflows/test_cuda.yml (1)

52-58: Ensure correct environment variable assignment and usage.

The environment variables PYTORCH_ROOT and TENSORFLOW_ROOT are being set using Python commands, which is a robust way to dynamically determine their paths. However, the DP_ENABLE_PYTORCH variable is set to 1 without conditional checks, which might not be suitable for all build configurations or might conflict with other settings. Consider adding conditions or configurations to enable/disable this based on specific needs.

Verification successful

The search output confirms that DP_ENABLE_PYTORCH is set to 1 without any conditional logic in the .github/workflows/test_cuda.yml file. This suggests that the variable is always enabled in this workflow.

To ensure that this is appropriate for all configurations, we should inspect the workflow file for any other relevant configurations or steps that might influence the necessity of DP_ENABLE_PYTORCH.

Let's search for other environment variables or steps in the workflow that might provide context on whether DP_ENABLE_PYTORCH should always be enabled or if there are scenarios where it should be conditional.


To understand the context and necessity of DP_ENABLE_PYTORCH, we need to extract and analyze the sections of the workflow file containing environment variables (env:) and steps (steps:).


The extracted sections from the workflow file show that DP_ENABLE_PYTORCH is set to 1 without any conditional logic. Additionally, the presence of other environment variables and steps does not indicate any scenarios where DP_ENABLE_PYTORCH should be disabled or conditionally set.

Given this, it appears that DP_ENABLE_PYTORCH is intended to be enabled for all configurations within this workflow. Therefore, the current setup seems appropriate, and no conditional logic is necessary for DP_ENABLE_PYTORCH.

  • .github/workflows/test_cuda.yml: DP_ENABLE_PYTORCH is set to 1 unconditionally.
Scripts executed

The following scripts were executed for the analysis:


Script:

#!/bin/bash
# Description: Ensure that DP_ENABLE_PYTORCH is set correctly based on the build configurations.
# Test: Search for conditional logic related to DP_ENABLE_PYTORCH in the workflow file. Expect: Conditional logic that sets DP_ENABLE_PYTORCH based on certain parameters.
rg --type yaml 'DP_ENABLE_PYTORCH' .github/workflows/test_cuda.yml

Length of output: 95



Script:

#!/bin/bash
# Description: Search for environment variables and steps in the workflow file to understand the context of DP_ENABLE_PYTORCH.
# Test: Look for other environment variables and steps in the workflow file that might influence the necessity of DP_ENABLE_PYTORCH.
rg --type yaml 'env:' .github/workflows/test_cuda.yml
rg --type yaml 'steps:' .github/workflows/test_cuda.yml

Length of output: 175



Script:

#!/bin/bash
# Description: Extract sections containing environment variables and steps for detailed inspection.
# Test: Use rg to extract the sections containing env: and steps: from the workflow file.
rg --type yaml -A 5 'env:' .github/workflows/test_cuda.yml
rg --type yaml -A 5 'steps:' .github/workflows/test_cuda.yml

Length of output: 1340


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

Share
Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai generate interesting stats about this repository and render them as a table.
    • @coderabbitai show all the console.log statements in this repository.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (invoked as PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Additionally, you can add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.

CodeRabbit Configration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Signed-off-by: Jinzhe Zeng <jinzhe.zeng@rutgers.edu>
Signed-off-by: Jinzhe Zeng <jinzhe.zeng@rutgers.edu>
Signed-off-by: Jinzhe Zeng <jinzhe.zeng@rutgers.edu>
@njzjz
Copy link
Member Author

njzjz commented Jun 20, 2024

It seems to me that we can still not enable PyTorch by default, as the default PyTorch on the PyPI requires NVCC...

CUDA_TOOLKIT_ROOT_DIR not found or specified
CMake Warning at /home/runner/.cache/uv/environments-v0/.tmpgyfA4c/lib/python3.11/site-packages/torch/share/cmake/Caffe2/public/cuda.cmake:31 (message):
  Caffe2: CUDA cannot be found.  Depending on whether you are building Caffe2
  or a Caffe2 dependent library, the next warning / error will give you more
  info.
Call Stack (most recent call first):
  /home/runner/.cache/uv/environments-v0/.tmpgyfA4c/lib/python3.11/site-packages/torch/share/cmake/Caffe2/Caffe2Config.cmake:87 (include)
  /home/runner/.cache/uv/environments-v0/.tmpgyfA4c/lib/python3.11/site-packages/torch/share/cmake/Torch/TorchConfig.cmake:68 (find_package)
  CMakeLists.txt:1[85](https://github.com/deepmodeling/deepmd-kit/actions/runs/9605428597/job/26493048913?pr=3891#step:11:86) (find_package)


CMake Error at /home/runner/.cache/uv/environments-v0/.tmpgyfA4c/lib/python3.11/site-packages/torch/share/cmake/Caffe2/Caffe2Config.cmake:91 (message):
  Your installed Caffe2 version uses CUDA but I cannot find the CUDA
  libraries.  Please set the proper CUDA prefixes and / or install CUDA.
Call Stack (most recent call first):
  /home/runner/.cache/uv/environments-v0/.tmpgyfA4c/lib/python3.11/site-packages/torch/share/cmake/Torch/TorchConfig.cmake:68 (find_package)
  CMakeLists.txt:185 (find_package)

xref: pytorch/pytorch#78530

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Outside diff range and nitpick comments (5)
source/lib/CMakeLists.txt (1)

15-58: Approve the introduction of create_library function for modularity.

The create_library function significantly improves the modularity and reusability of the library creation process. Ensure that the function is used consistently across the project to maintain uniformity.

Consider using this pattern in other parts of the project where similar functionality is required to maintain consistency and improve maintainability.

source/lib/src/gpu/CMakeLists.txt (1)

82-118: Approve the introduction of create_gpu_lib function for GPU libraries.

The create_gpu_lib function enhances the modularity and reusability for GPU library creation. This is a positive change that aligns with best practices in modern CMake usage.

Encourage the use of this pattern in other parts of the project dealing with similar library creation requirements to maintain consistency and enhance modularity.

doc/install/install-from-source.md (3)

Line range hint 89-89: Improve clarity in virtual environment exit instructions.

The phrase "if one wants to skip out of the virtual environment" could be simplified to enhance clarity. Consider rephrasing to make the instructions more direct.

- if one wants to skip out of the virtual environment, he/she can do
+ To exit the virtual environment, simply run:

Line range hint 145-145: Clarify installation path details for TensorFlow.

The sentence regarding the TensorFlow path could benefit from a comma for better readability.

- By default the installer only finds TensorFlow under user site-package directory...
+ By default, the installer only finds TensorFlow under user site-package directory...

Line range hint 189-189: Correct grammatical error in CPU environment setup.

There's a minor grammatical mistake in the description of setting up the environment for CPU usage.

- If your work in a CPU environment, please prepare
+ If you work in a CPU environment, please prepare

source/CMakeLists.txt Show resolved Hide resolved
source/op/pt/CMakeLists.txt Show resolved Hide resolved
source/op/pt/CMakeLists.txt Show resolved Hide resolved
@njzjz njzjz added the Test CUDA Trigger test CUDA workflow label Jun 21, 2024
@github-actions github-actions bot removed the Test CUDA Trigger test CUDA workflow label Jun 21, 2024
@njzjz njzjz removed the Python label Jun 21, 2024
@njzjz njzjz marked this pull request as ready for review June 21, 2024 00:56
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Outside diff range and nitpick comments (3)
doc/install/install-from-source.md (3)

Line range hint 146-146: Clarify installation instructions for TensorFlow.

The sentence "By default the installer only finds TensorFlow under user site-package directory" might be confusing due to the lack of a comma after "By default". Adding a comma can improve the readability.

- By default the installer only finds TensorFlow under user site-package directory
+ By default, the installer only finds TensorFlow under user site-package directory

Line range hint 190-190: Correct grammatical error to enhance clarity.

The phrase "If your work in a CPU environment" should be corrected to "If you work in a CPU environment" to avoid confusion and improve the grammatical structure.

- If your work in a CPU environment, please prepare runtime as below:
+ If you work in a CPU environment, please prepare runtime as below:

Line range hint 295-295: Clarify compatibility note regarding PyTorch.

The phrase "which may be not compatible with other libraries" is somewhat awkward. Rewording it to "which may not be compatible with other libraries" could enhance clarity.

- which may be not compatible with other libraries
+ which may not be compatible with other libraries

@njzjz njzjz requested a review from wanghan-iapcm June 21, 2024 01:29
@wanghan-iapcm wanghan-iapcm added this pull request to the merge queue Jun 21, 2024
Merged via the queue into deepmodeling:devel with commit 7a0ec5d Jun 21, 2024
54 checks passed
github-merge-queue bot pushed a commit that referenced this pull request Jun 22, 2024
`TORCH_CXX_FLAGS` on macOS and Windows doesn't have
`_GLIBCXX_USE_CXX11_ABI`. This PR sets `OP_CXX_ABI_PT` to a default
value to fix the error introduced in #3891.

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **Chores**
- Updated build configuration to set `OP_CXX_ABI_PT` conditionally for
improved compatibility with macOS and Windows environments.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Signed-off-by: Jinzhe Zeng <jinzhe.zeng@rutgers.edu>
mtaillefumier pushed a commit to mtaillefumier/deepmd-kit that referenced this pull request Sep 18, 2024
- Build PT OP libraries with compatible CXXABI if PT has a different CXX
ABI with TF;
- Enable PT OP in test_cuda workflow;
- Update documentation.

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **Documentation**
- Removed outdated instructions related to setting environment variables
for enabling customized C++ OPs in PyTorch.

- **Chores**
- Updated build configuration to handle PyTorch CXX11 ABI compatibility
with TensorFlow.
- Refactored library creation processes for better handling of CUDA and
ROCm toolkits.
- Improved build scripts to dynamically adjust compile definitions and
installation paths based on different build configurations.

- **CI/CD**
- Enhanced the continuous integration workflow to include PyTorch
variable assignments and settings for testing with CUDA.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Signed-off-by: Jinzhe Zeng <jinzhe.zeng@rutgers.edu>
mtaillefumier pushed a commit to mtaillefumier/deepmd-kit that referenced this pull request Sep 18, 2024
`TORCH_CXX_FLAGS` on macOS and Windows doesn't have
`_GLIBCXX_USE_CXX11_ABI`. This PR sets `OP_CXX_ABI_PT` to a default
value to fix the error introduced in deepmodeling#3891.

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **Chores**
- Updated build configuration to set `OP_CXX_ABI_PT` conditionally for
improved compatibility with macOS and Windows environments.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Signed-off-by: Jinzhe Zeng <jinzhe.zeng@rutgers.edu>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants