Skip to content
This repository was archived by the owner on Nov 17, 2023. It is now read-only.

Fix compilation for large tensor with MKL #19067

Merged
merged 6 commits into from
Nov 18, 2020

Conversation

anko-intel
Copy link
Contributor

@anko-intel anko-intel commented Sep 1, 2020

It fixes the issue #18954
The main changes:

  • define MKL_INT to be consistent with mshadow index_t definition; This way LAPACKE functions declaration in mkl/include/mkl_lapacke.h use the same type as MxNet is used in mshadow::index_t
  • define lapack_index_t to use int64_t as index type only for MKL
  • define IndexT to use it in common code for CPU and GPU implementations. This way, when -DUSE_CUDA=1 -DUSE_INT64_TENSOR_SIZE=ON -DUSE_BLAS=mkl are set, we have 2 different instantiation for GPU and CPU.
  • convert the pivot tensor for det and slodget operators from/to int64_t on GPU path to have common API with CPU
  • eliminate needs to set MKL_USE_ILP64 separately, and force to set it consistently with USE_INT64_TENSOR_SIZE

@mxnet-bot
Copy link

Hey @anko-intel , Thanks for submitting the PR
All tests are already queued to run once. If tests fail, you can trigger one or more tests again with the following commands:

  • To trigger all jobs: @mxnet-bot run ci [all]
  • To trigger specific jobs: @mxnet-bot run ci [job1, job2]

CI supported jobs: [unix-cpu, website, windows-gpu, sanity, edge, centos-gpu, clang, miscellaneous, windows-cpu, centos-cpu, unix-gpu]


Note:
Only following 3 categories can trigger CI :PR Author, MXNet Committer, Jenkins Admin.
All CI tests must pass before the PR can be merged.

@anko-intel anko-intel force-pushed the anko_large_tensor_18954_01 branch from 08c47e6 to d2f1bd0 Compare September 15, 2020 14:14
@anko-intel anko-intel force-pushed the anko_large_tensor_18954_01 branch 2 times, most recently from 2d73d73 to 4c95d87 Compare September 15, 2020 18:45
@mxnet-bot
Copy link

Unauthorized access detected.
Only following 3 categories can trigger CI :
PR Author, MXNet Committer, Jenkins Admin.

@Zha0q1
Copy link
Contributor

Zha0q1 commented Sep 16, 2020

@mxnet_bot run ci [centos-gpu]

@anko-intel
Copy link
Contributor Author

@mxnet-bot run ci [centos-gpu, unix-gpu]

@mxnet-bot
Copy link

Jenkins CI successfully triggered : [unix-gpu, centos-gpu]

@Zha0q1
Copy link
Contributor

Zha0q1 commented Sep 16, 2020

@mxnet-bot run ci [centos-gpu, unix-gpu]

Test passed. Looks like it was just being flaky

@mxnet-bot
Copy link

Unauthorized access detected.
Only following 3 categories can trigger CI :
PR Author, MXNet Committer, Jenkins Admin.

1 similar comment
@mxnet-bot
Copy link

Unauthorized access detected.
Only following 3 categories can trigger CI :
PR Author, MXNet Committer, Jenkins Admin.

@anko-intel
Copy link
Contributor Author

@mxnet-bot run ci [centos-cpu]

@mxnet-bot
Copy link

Jenkins CI successfully triggered : [centos-cpu]

@anko-intel anko-intel force-pushed the anko_large_tensor_18954_01 branch from 67c5b08 to b4d19fc Compare September 28, 2020 06:17
@access2rohit
Copy link
Contributor

re-triggerred sanity

@anko-intel anko-intel force-pushed the anko_large_tensor_18954_01 branch from cb5dec2 to 6dd567e Compare October 7, 2020 07:32
@anko-intel anko-intel force-pushed the anko_large_tensor_18954_01 branch from 6dd567e to 9434d01 Compare October 14, 2020 14:37
@lanking520 lanking520 added the pr-work-in-progress PR is still work in progress label Oct 14, 2020
@anko-intel anko-intel changed the title [WIP] Fix compilation for large tensor with MKL Fix compilation for large tensor with MKL Oct 14, 2020
@lanking520 lanking520 added pr-awaiting-testing PR is reviewed and waiting CI build and test and removed pr-work-in-progress PR is still work in progress labels Oct 14, 2020
@lanking520 lanking520 added the pr-work-in-progress PR is still work in progress label Nov 13, 2020
@anko-intel anko-intel force-pushed the anko_large_tensor_18954_01 branch from a649634 to 0647c7f Compare November 16, 2020 14:14
@lanking520 lanking520 added pr-awaiting-testing PR is reviewed and waiting CI build and test pr-awaiting-review PR is waiting for code review and removed pr-work-in-progress PR is still work in progress pr-awaiting-testing PR is reviewed and waiting CI build and test labels Nov 16, 2020
anko-intel and others added 6 commits November 18, 2020 15:32
When Mxnet is compiled with
-DUSE_CUDA=1 -DUSE_INT64_TENSOR_SIZE=ON -DUSE_BLAS=mkl
interface for det and slogdet operators have API with int64_t type for
pivot, but internally CUDA path support only int type. It requires
additional conversion on GPU path to have common API.
@anko-intel anko-intel force-pushed the anko_large_tensor_18954_01 branch from 0647c7f to 24b698d Compare November 18, 2020 14:50
@lanking520 lanking520 added pr-awaiting-testing PR is reviewed and waiting CI build and test pr-work-in-progress PR is still work in progress and removed pr-awaiting-review PR is waiting for code review pr-awaiting-testing PR is reviewed and waiting CI build and test labels Nov 18, 2020
@anko-intel
Copy link
Contributor Author

@mxnet-bot run ci [windows-cpu]

@mxnet-bot
Copy link

Jenkins CI successfully triggered : [windows-cpu]

@lanking520 lanking520 added pr-awaiting-testing PR is reviewed and waiting CI build and test pr-awaiting-review PR is waiting for code review and removed pr-work-in-progress PR is still work in progress pr-awaiting-testing PR is reviewed and waiting CI build and test labels Nov 18, 2020
@Zha0q1
Copy link
Contributor

Zha0q1 commented Nov 18, 2020

PR is ready! Would you review @leezu

Copy link
Contributor

@leezu leezu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

@leezu leezu merged commit 6fa208b into apache:master Nov 18, 2020
leezu pushed a commit that referenced this pull request Dec 4, 2020
…19174)

This PR forces static link to libopenblas.a on all CI pipelines to avoid the name clashing issue (except for windows where visual studio does not support static link to openblas according to this https://github.com/xianyi/OpenBLAS/wiki/How-to-use-OpenBLAS-in-Microsoft-Visual-Studio see bottom). On ubuntu and centos7 we changed from apt get/ yum openblas and lapack to building openblas(blas+lapack) from source.

Also we add support LAPACKE api for non-mkl builds. The background is:

In mxnet we wrap lapack functions in c_lapack_api.h so that we hide the differences in the underlying blas/lapack libraries such as mkl, openblas, atlas, and accelerate.

For mkl we use/wrap the LAPACKE (https://www.netlib.org/lapack/lapacke.html) c interfaces which are the cleanest. For the rest of the libraries we wrap the old CLAPACK interfaces.

Openblas is by default built with lapack support and the binary will contain a full set of LAPACKE functions. Thus this poc adds a build option with which we can use the LAPACKE apis even for openblas.

This change will make ilp64 blas/lapack support easier as now we have the same wrapping logic or both mkl and openblas with USE_LAPACKE_INTERFACE = ON. Support for ilp64 mkl #19067 will automatically mean support for ilp64 openblas.

Known issue: linux distros ship openblas binaries WITHOUT lapack so this option can only work if we build openblas from source. This is not a problem for our mxnet binary distributions as we always use our own openblas build there. However, For users building mxnet from source they must be advised against using this option unless they also build openblas from source
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
MKLDNN pr-awaiting-review PR is waiting for code review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants