Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SYCL][Doc] Add device code split feature design #631

Merged

Conversation

Fznamznon
Copy link
Contributor

Signed-off-by: Mariya Podchishchaeva mariya.podchishchaeva@intel.com

Copy link
Contributor

@keryell keryell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great documentation!
But could you put an SVG version of your nice picture instead?

@Fznamznon
Copy link
Contributor Author

Great documentation!
But could you put an SVG version of your nice picture instead?

Yes, if you tell me how I can do this :)

BTW, What do you think about this feature and it's design? It's not implemented yet and we are finding the better ways how to implement it.

* Generate special meta-data with translation unit ID for each kernel in SYCL
front-end
* Link all LLVM modules using llvm-link
* Run the module splitter pass on fully linked device code
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Splitter pass is supposed to undo what llvm-link did at previous step.
This seems like an opportunity for optimization.

Did you consider using llvm-link to build the table of dependencies and importing them w/o "linking all LLVM modules using llvm-link"?
I'm not an expert here, but these llvm-link options seem relevant:

  --import=<function:filename>             - Pair of function name and filename, where function should be imported from bitcode in filename
  --import-all-index                       - Import all external functions in index.
  --only-needed                            - Link only needed symbols
  --summary-file=<string>                  - The summary file to use for function importing.
  --summary-index=<filename>               - Module summary index filename

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, linking all device modules then split them back to separate modules sounds strange, but we need it support SYCL_EXTERNAL.

Approach with using llvm-link to build the table of dependencies and importing them w/o doesn't sound flexible for extending kernels grouping scheme. For example we will still need some splitting mechanism if we decide leave one kernel per module.

I took a brief look into these options, but It looks like all of them do not allow building the table of dependencies, only using them, except only-needed which just links only needed symbols to the destination module.
So, with using llvm-link options we still need figure out how to tell llvm-link which symbols we want to import for each module.
Or run llvm-link -only-needed for all combinations of device modules which doesn't seem effective. And we will still need generate info about symbols which each module contains to feed it to clang-offload-wrapper.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approach with using llvm-link to build the table of dependencies and importing them w/o doesn't sound flexible for extending kernels grouping scheme. For example we will still need some splitting mechanism if we decide leave one kernel per module.

If user cares about separating kernels into stand-alone modules, it can be done by following "one kernel per translation unit" code organization rule. No compiler changes are required for that.

So, with using llvm-link options we still need figure out how to tell llvm-link which symbols we want to import for each module.
Or run llvm-link -only-needed for all combinations of device modules which doesn't seem effective. And we will still need generate info about symbols which each module contains to feed it to clang-offload-wrapper.

We can link all module into one big module upfront and use llvm-link -only-needed to import all the necessary dependencies. Although this doesn't sound optimal neither it doesn't require new passes - just changes to the driver, which seems to be needed for original proposed too.

Copy link
Contributor Author

@Fznamznon Fznamznon Sep 13, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, with using llvm-link options we still need figure out how to tell llvm-link which symbols we want to import for each module.
Or run llvm-link -only-needed for all combinations of device modules which doesn't seem effective. And we will still need generate info about symbols which each module contains to feed it to clang-offload-wrapper.

We can link all module into one big module upfront and use llvm-link -only-needed to import all the necessary dependencies. Although this doesn't sound optimal neither it doesn't require new passes - just changes to the driver, which seems to be needed for original proposed too.

It's unclear how to import all the necessary dependencies from fully linked module. Could you please clarify?
I'm thinking about reusing llvm-extract tool or re-using GVExtractor pass, these solves problem of extraction of functions with dependencies from some single big module.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you consider using llvm-link to build the table of dependencies and importing them w/o "linking all LLVM modules using llvm-link"?

I think the potential a solution based on llvm-link building export/import tables and their analysis can be considered, but at this point it is too raw, and we don't even know will it work or not.
Doing full llvm-link is, in contrast, known to give desired result, and the design based on it is very simple. So I think we should start with it. Besides, llvm-link is very fast, so it is not quite clear what the potential optimizations will give.

Yes, linking all device modules then split them back to separate modules sounds strange

This is not quite "splitting back". This is rather re-grouping based on callgraph knowledge, which produces very different modules compared to what was the input. Reliable building the callgraph based on just import/export tables w/o linking would be very tricky, I suppose.

Copy link
Contributor

@keryell keryell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great documentation!
But could you put an SVG version of your nice picture instead?

Yes, if you tell me how I can do this :)

I guess that if you are able to write a SYCL compiler by yourself, explaining you how to select the SVG format when you save your picture in your drawing software sounds like insulting you and I do not want to be rude... ;-)

BTW, What do you think about this feature and it's design? It's not implemented yet and we are finding the better ways how to implement it.

You are trying to solve a problem I have not thought about in this way, so I do not know.
Probably your method works for your problem.

My plan in triSYCL was to generate as many kernel IR as there are different targets by calling the device compiler with a specific preprocessor macro so the C++ code can be potentially different for each target to pick a different kernel implementation better fitted for each target. It solves the extension problem, but if you think to it you might need also a different host code to use or not these extensions, so it is not that simple.

This is why I presented an extension related somehow to this problem during the last SYCL F2F with this example
https://github.com/triSYCL/triSYCL/blob/master/tests/scope/queue.cpp
which, besides introducing the concept of scope, replace actually the runtime polymorphic SYCL object by compile-time static objects (which are actually C++ concepts) that allow you to have different host+device code for each device.

But all this is outside of the scope of this PR which is more related with a JIT approach.
I am not very inspired here since I only target embedded devices with AoT compilation flow...

sycl/doc/SYCLCompilerAndRuntimeDesign.md Show resolved Hide resolved
keryell
keryell previously approved these changes Sep 16, 2019
Copy link
Contributor

@keryell keryell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the SVG picture.
It will be compatible with the high-definition displays of the future. :-)

kbobrovs
kbobrovs previously approved these changes Sep 20, 2019
@Fznamznon Fznamznon dismissed stale reviews from kbobrovs and keryell via c2d686e September 25, 2019 09:37
@Fznamznon Fznamznon force-pushed the private/mpodchis/device-code-split-design branch from c06f874 to c2d686e Compare September 25, 2019 09:37
@Fznamznon
Copy link
Contributor Author

Rebased.

* Emitting a separate module for each kernel

The current approach is:
* Generate special meta-data with translation unit ID for each kernel in SYCL
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How this ID is used later?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This ID will be used to group kernels per translation unit.
But I think it can be better clarified in this document, I'll made a change.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ping.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

![Device code splitting](DeviceCodeSplitting.svg)

The "split" box will be implemented as dedicated tool ("sycl-split"?). The tool
will run splitter pass and generate a symbol table for each produced device module.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How kernels are going to be split with this approach? One kernel (+ its dependencies) per module?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically we will group kernels per translation unit basis and I'd like to implement possibility to emit one kernel with dependencies per module.

* Emitting a separate module for each kernel

The current approach is:
* Generate special meta-data with translation unit ID for each kernel in SYCL
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ping.

Device code splitting process:
![Device code splitting](DeviceCodeSplitting.svg)

The "split" box will be implemented as dedicated tool ("sycl-split"?). The tool
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"sycl-split"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sycl-post-link.

Signed-off-by: Mariya Podchishchaeva <mariya.podchishchaeva@intel.com>
Signed-off-by: Mariya Podchishchaeva <mariya.podchishchaeva@intel.com>
Signed-off-by: Mariya Podchishchaeva <mariya.podchishchaeva@intel.com>
Move picture to images folder
Apply comment

Signed-off-by: Mariya Podchishchaeva <mariya.podchishchaeva@intel.com>
@Fznamznon Fznamznon force-pushed the private/mpodchis/device-code-split-design branch from c2d686e to 41091ba Compare November 30, 2019 10:58
@Fznamznon Fznamznon changed the title [SYCL][Doc] Add device code splitting feature design [SYCL][Doc] Add device code split feature design Nov 30, 2019
@romanovvlad romanovvlad merged commit 3d37c91 into intel:sycl Nov 30, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants