Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] thinLTO for SYCL #15083

Open
wants to merge 10 commits into
base: sycl
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions clang/tools/clang-linker-wrapper/CMakeLists.txt
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
set(LLVM_LINK_COMPONENTS
${LLVM_TARGETS_TO_BUILD}
BitReader
BitWriter
Core
BinaryFormat
Expand Down
536 changes: 423 additions & 113 deletions clang/tools/clang-linker-wrapper/ClangLinkerWrapper.cpp

Large diffs are not rendered by default.

4 changes: 2 additions & 2 deletions libdevice/device.h
Original file line number Diff line number Diff line change
Expand Up @@ -17,9 +17,9 @@

#if defined(__SPIR__) || defined(__SPIRV__) || defined(__NVPTX__)
#ifdef __SYCL_DEVICE_ONLY__
#define DEVICE_EXTERNAL SYCL_EXTERNAL __attribute__((weak))
#define DEVICE_EXTERNAL SYCL_EXTERNAL
#else // __SYCL_DEVICE_ONLY__
#define DEVICE_EXTERNAL __attribute__((weak))
#define DEVICE_EXTERNAL
#endif // __SYCL_DEVICE_ONLY__
Comment on lines +20 to 23
Copy link
Contributor Author

@sarnex sarnex Aug 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is required to get libdevice functions linked in by the thinLTO function importing infrastructure, see here. I'm looking for a better solution for this, I just kept this here in case anybody plans on trying the prototype.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose importing devicelib symbols at compile step can be a solution (see #15114).

On the other hand, I recall discussing the possibility of linking device libraries with upstream maintainers, who expressed a preference for shifting device library linking from the "compile" to the "link" step. It would be ideal if we could discover a solution that aligns with the long-term strategy of upstream and enables us to utilize the thinLTO framework for offload code linking, thereby avoiding the use of weak symbols.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

attention to @mdtoguchi who has been looking at importing devicelib at compile step from the SYCL perspective.
Point to note: During one of the LLVM community presentation, it was mentioned that they are trying to move importing devicelib to link time.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we already perform device library linking at link time we can consider abandoning the efforts to pull them into the compilation step. My main concern with performing at the link step is the communication required from the driver to the clang-linker-wrapper informing which device libraries should be linked. The less tie-in we have between the driver and the clang-linker-wrapper at link time, the better. IMO, at the very least the linker wrapper should know a minimum default device libraries to link and any communication from the driver is manipulating that list.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @mdtoguchi. Unless user wants to change the names/location or disable linking of device libraries, driver should not have any logic to handle device code linking other than invoking clang-linker-wrapper. It makes sense to have driver options for additional configuration of device libraries, but driver's implementation should be just passing corresponding values to clang-linker-wrapper where these options should be processed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is very interesting.
While working on #15114 I've been wondering whether there is a particular reason why we link against CUDA libdevice and libclc in the compile step, but also again in the link step.
Could I get some clarification on that?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is very interesting. While working on #15114 I've been wondering whether there is a particular reason why we link against CUDA libdevice and libclc in the compile step, but also again in the link step. Could I get some clarification on that?

@Naghasan, @npmiller, are you able to help here?


#define DEVICE_EXTERN_C DEVICE_EXTERNAL EXTERN_C
Expand Down
5 changes: 0 additions & 5 deletions libdevice/fallback-cassert.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -114,9 +114,4 @@ DEVICE_EXTERN_C void __devicelib_assert_fail(const char *expr, const char *file,
__assertfail(expr, file, line, func, 1);
}

DEVICE_EXTERN_C void _wassert(const char *_Message, const char *_File,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like this is a change that can be merged and submitted separately. _wassert is a wrapper for MSVC's assert implementation to redirect it to ours, so it really shouldn't be implemented in fallback library

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't even know if it's correct, I just hit a build error on windows about _wassert defined twice, probably it works today because they're all weak symbols but I remove that as part of this PR.

unsigned _Line) {
__assertfail(_Message, _File, _Line, 0, 1);
}

#endif
1 change: 1 addition & 0 deletions llvm/include/llvm/SYCLLowerIR/ModuleSplitter.h
Original file line number Diff line number Diff line change
Expand Up @@ -321,6 +321,7 @@ splitSYCLModule(std::unique_ptr<Module> M, ModuleSplitterSettings Settings);

bool isESIMDFunction(const Function &F);
bool canBeImportedFunction(const Function &F);
bool isEntryPoint(const Function &F, bool EmitOnlyKernelsAsEntryPoints);

} // namespace module_split

Expand Down
22 changes: 22 additions & 0 deletions llvm/include/llvm/SYCLLowerIR/SYCLLinkedModuleProcessor.h
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
//===-- SYCLLinkedModuleProcessor.h - finalize a fully linked module ---===//
//
// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
// See https://llvm.org/LICENSE.txt for license information.
// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
//
//===----------------------------------------------------------------------===//
//
// The file contains a number of functions to create a pass that can be called
// by the LTO backend that will finalize a fully-linked module.
//===----------------------------------------------------------------------===//
#pragma once
#include "SpecConstants.h"
namespace llvm {

class PassRegistry;
class ModulePass;
ModulePass *
createSYCLLinkedModuleProcessorPass(llvm::SpecConstantsPass::HandlingMode);
void initializeSYCLLinkedModuleProcessorPass(PassRegistry &);

} // namespace llvm
4 changes: 2 additions & 2 deletions llvm/lib/LTO/LTO.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -1077,8 +1077,8 @@ Error LTO::addThinLTO(BitcodeModule BM, ArrayRef<InputFile::Symbol> Syms,
for (const std::string &Name : Conf.ThinLTOModulesToCompile) {
if (BM.getModuleIdentifier().contains(Name)) {
ThinLTO.ModulesToCompile->insert({BM.getModuleIdentifier(), BM});
llvm::errs() << "[ThinLTO] Selecting " << BM.getModuleIdentifier()
<< " to compile\n";
LLVM_DEBUG(dbgs() << "[ThinLTO] Selecting " << BM.getModuleIdentifier()
sarnex marked this conversation as resolved.
Show resolved Hide resolved
<< " to compile\n");
}
}
}
Expand Down
1 change: 1 addition & 0 deletions llvm/lib/SYCLLowerIR/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -65,6 +65,7 @@ add_llvm_component_library(LLVMSYCLLowerIR
SYCLDeviceRequirements.cpp
SYCLKernelParamOptInfo.cpp
SYCLJointMatrixTransform.cpp
SYCLLinkedModuleProcessor.cpp
SYCLPropagateAspectsUsage.cpp
SYCLPropagateJointMatrixUsage.cpp
SYCLVirtualFunctionsAnalysis.cpp
Expand Down
52 changes: 26 additions & 26 deletions llvm/lib/SYCLLowerIR/ModuleSplitter.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -117,32 +117,6 @@ bool isKernel(const Function &F) {
F.getCallingConv() == CallingConv::AMDGPU_KERNEL;
}

bool isEntryPoint(const Function &F, bool EmitOnlyKernelsAsEntryPoints) {
// Skip declarations, if any: they should not be included into a vector of
// entry points groups or otherwise we will end up with incorrectly generated
// list of symbols.
if (F.isDeclaration())
return false;

// Kernels are always considered to be entry points
if (isKernel(F))
return true;

if (!EmitOnlyKernelsAsEntryPoints) {
// If not disabled, SYCL_EXTERNAL functions with sycl-module-id attribute
// are also considered as entry points (except __spirv_* and __sycl_*
// functions)
return llvm::sycl::utils::isSYCLExternalFunction(&F) &&
!isSpirvSyclBuiltin(F.getName()) && !isESIMDBuiltin(F.getName()) &&
!isGenericBuiltin(F.getName());
}

// Even if we are emitting only kernels as entry points, virtual functions
// should still be treated as entry points, because they are going to be
// outlined into separate device images and linked in later.
return F.hasFnAttribute("indirectly-callable");
}

// Represents "dependency" or "use" graph of global objects (functions and
// global variables) in a module. It is used during device code split to
// understand which global variables and functions (other than entry points)
Expand Down Expand Up @@ -445,6 +419,32 @@ class ModuleSplitter : public ModuleSplitterBase {
namespace llvm {
namespace module_split {

bool isEntryPoint(const Function &F, bool EmitOnlyKernelsAsEntryPoints) {
// Skip declarations, if any: they should not be included into a vector of
// entry points groups or otherwise we will end up with incorrectly generated
// list of symbols.
if (F.isDeclaration())
return false;

// Kernels are always considered to be entry points
if (isKernel(F))
return true;

if (!EmitOnlyKernelsAsEntryPoints) {
// If not disabled, SYCL_EXTERNAL functions with sycl-module-id attribute
// are also considered as entry points (except __spirv_* and __sycl_*
// functions)
return llvm::sycl::utils::isSYCLExternalFunction(&F) &&
!isSpirvSyclBuiltin(F.getName()) && !isESIMDBuiltin(F.getName()) &&
!isGenericBuiltin(F.getName());
}

// Even if we are emitting only kernels as entry points, virtual functions
// should still be treated as entry points, because they are going to be
// outlined into separate device images and linked in later.
return F.hasFnAttribute("indirectly-callable");
}

std::optional<IRSplitMode> convertStringToSplitMode(StringRef S) {
static const StringMap<IRSplitMode> Values = {{"kernel", SPLIT_PER_KERNEL},
{"source", SPLIT_PER_TU},
Expand Down
45 changes: 45 additions & 0 deletions llvm/lib/SYCLLowerIR/SYCLLinkedModuleProcessor.cpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
//===-- SYCLLinkedModuleProcessor.cpp - finalize a fully linked module ---===//
//
// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
// See https://llvm.org/LICENSE.txt for license information.
// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
//
//===----------------------------------------------------------------------===//
// See comments in the header.
//===----------------------------------------------------------------------===//

#include "llvm/SYCLLowerIR/SYCLLinkedModuleProcessor.h"

#include "llvm/Pass.h"

#define DEBUG_TYPE "sycl-linked-module-processor"
using namespace llvm;

namespace {
class SYCLLinkedModuleProcessor : public ModulePass {
public:
static char ID;
SYCLLinkedModuleProcessor(SpecConstantsPass::HandlingMode Mode)
: ModulePass(ID), Mode(Mode) {
initializeSYCLLinkedModuleProcessorPass(*PassRegistry::getPassRegistry());
}

bool runOnModule(Module &M) override {
// TODO: determine if we need to run other passes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand correctly, that's an equivalent of what's being run by sycl-post-link after device code split is performed. If so, then we have the following other transformations applied at this stage:

  • ESIMD handling, which includes some special module fixup for invoke_simd, as well as potential additional split by ESIMD followed up by optional linking that back
  • Generation of a separate device image with default values of spec constants

If we also taking about what happens after llvm-link but before device code split, then it is also:

  • Something about invoke_simd
  • Sanitizer-related passes
  • Joint matrix passes

Copy link
Contributor Author

@sarnex sarnex Sep 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So when we do early splitting in -c we actually run sycl-post-link in full, including all those passes. So in that case, we only need to run passes here that need the fully linked module. If we decide to change the design such that we do only split in -c but no passes, then we would need every pass that sycl-post-link runs. In the current implementation ~2100/2200 E2E tests are passing, so it seems most passes don't need the full module and running it early does the right thing, at least for the test cases we have.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the current implementation ~2100/2000 E2E tests are passing, so it seems most passes don't need the full module, at least for the test cases we have.

I believe that most of E2E are single-file tests with no SYCL_EXTERNAL dependencies. Even SYCL-CTS won't help you here. I suppose that we need more or less real-life applications here to be sure and gather more data if we need it

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest adding tests similar to sycl/test-e2e/Basic/multisource.cpp.

ModuleAnalysisManager MAM;
SpecConstantsPass SCP(Mode);
auto PA = SCP.run(M, MAM);
return !PA.areAllPreserved();
}

private:
SpecConstantsPass::HandlingMode Mode;
};
} // namespace
char SYCLLinkedModuleProcessor::ID = 0;
INITIALIZE_PASS(SYCLLinkedModuleProcessor, "SYCLLinkedModuleProcessor",
"Finalize a fully linked SYCL module", false, false)
ModulePass *llvm::createSYCLLinkedModuleProcessorPass(
SpecConstantsPass::HandlingMode Mode) {
return new SYCLLinkedModuleProcessor(Mode);
}
3 changes: 3 additions & 0 deletions sycl/doc/design/CompilerAndRuntimeDesign.md
Original file line number Diff line number Diff line change
Expand Up @@ -550,6 +550,9 @@ unit)
- `off` - disables device code split. If `-fno-sycl-rdc` is specified, the behavior is
the same as `per_source`

If ThinLTO is enabled, device code splitting is run during the compilation
stage. See [here](ThinLTO.md) for more information.

##### Symbol table generation

TBD
Expand Down
149 changes: 149 additions & 0 deletions sycl/doc/design/ThinLTO.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,149 @@
# ThinLTO for SYCL

This document describes the purpose and design of ThinLTO for SYCL.

**NOTE**: This is not the final version. The document is still in progress.

## Background

With traditional SYCL device code linking, all user code is linked together
along with device libraries into a single huge module and then split and
processed by `sycl-post-link`. This requires sequential processing, has a large
memory footprint, and differs from the linking flow for AMD and NVIDIA devices.

## Summary

SYCL ThinLTO will hook into the existing community mechanism to run LTO as part
of device linking inside `clang-linker-wrapper`. We split the device images
early at compilation time, and at link time we use ThinLTO's function importing
feature to bring in the definitions for referenced functions. Only the new
offload model is supported.

## Device code compilation time changes

Most of the changes for ThinLTO occur during device link time, however there is
one major change during compilation (-c) time: we now run device code split
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
one major change during compilation (-c) time: we now run device code split
one major change during compilation (`-c`) time: we now run device code split

during compilation instead of linking. The main reason for doing this is
increased parallelization. Many compilation jobs can be run at the same time,
but linking happens once total for the application. Device code split is
currently a common source of performance issues.

Splitting early means that the resulting IR after splitting is not complete, it
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that early split leads to incomplete modules. Lack of linking leads to incomplete modules, i.e. a single translation unit with a call to undefined SYCL_EXTERNAL function is incomplete regardless of whether we split it or not.

If it was complete, then current version of the splitting mechanism won't break it, i.e. every resulting module would still be complete (at the cost of code duplication). However, I'm not 100% about the latter, if shared libraries are involved, that behavior definitely changes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, you're right, let me reword it

still may contain calls to functions (user code and/or the SYCL device
libraries) defined in other translation units.

We rely on the assumption that all function definitions matching a declaration
will be the same and we can let ThinLTO pull in any one.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

C++ one definition rule guarantees this property of the code, doesn't it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it depends what the original IR linkage is. if the original IR is linkonce_odr or something similar I think yes, but I don't know if we can guarantee every SYCL function will have that linkage (at least for libdevice it not this way in syclos HEAD)


For example, let's start with user device code that defines a `SYCL_EXTERNAL`
function `foo` in translation unit `tu_foo`. There is also another translation
unit `tu_bar` that references `foo`. During the early device code splitting run
of `tu_foo`, we may find that more than one of the resultant device images
contain a definition for `foo`.
Comment on lines +40 to +42
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we clarify the reason why foo could be duplicated? I'm not sure if that's obvious for a person who is not familiar with device code split details.

Also, we have a flag (originally introduced to support shared libraries) which allows to disable cloning of functions which are shared between device images produced by code splitting. We could probably use that mode for thinLTO to avoid unnecessary duplicates.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will add this, thanks


We assert that any function definition for `foo` that is deemed a match by the
ThinLTO infrastructure during the processing of `tu_bar` is valid.

As a result of running early device code split, the fat object file generated as
part of device compilation may contain multiple device code images.

## Device code link time changes

Before we go into the link time changes for SYCL, let's understand the device
linking flow for AMD/NVIDIA devices:
Comment on lines +52 to +53
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I read this phrase as SYCL linking flow for AMD/NVIDIA devices, but if that reading is correct, then requirement (2) looks weird, because it is not dependent on a device.

Perhaps this phrase was referring to upstream handling of OpenMP offloading?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah it's for omp, will reword. thanks


![Community linking flow](images/ThinLTOCommunityFlow.svg)

SYCL has two differentiating requirements:

1) The SPIR-V backend is not production ready and the SPIR-V translator is used.
2) The SYCL runtime requires metadata (module properties and module symbol
table) computed from device images that will be stored along the device images
in the fat executable.

The effect of requirement 1) is that instead of letting ThinLTO call the SPIR-V
backend, we add a callback that runs right before CodeGen would run. In that
callback, we call the SPIR-V translator and store the resultant file path for
use later, and we instruct the ThinLTO framework to not perform CodeGen.
Comment on lines +66 to +67
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: we should be able to call the translator as a library and store the final binary (in form of a MemoryBuffer, I assume) instead of file path. The former is supposedly more in line with what happens in the upstream.


An interesting additional fact about requirement 2) is that we actually need to
process fully linked module to accurate compute the module properties. One
example where we need the full module is to [compute the required devicelib
mask](https://github.com/intel/llvm/blob/sycl/llvm/lib/SYCLLowerIR/SYCLDeviceLibReqMask.cpp).
If we only process the device code that was included in the original fat object
input to `clang-linker-wrapper`, we will miss devicelib calls in referenced
`SYCL_EXTERNAL` functions.

The effect of requirement 2) is that we store the fully linked device image for
metadata computation in the SYCL-specific handing code after the ThinLTO
framework has completed. Another option would be to try to compute the metadata
inside the ThinLTO framework callbacks, but this would require SYCL-specific
arguments to many caller functions in the stack and pollute community code.

Here is the current ThinLTO flow for SYCL:

![SYCL linking flow](images/ThinLTOSYCLFlow.svg)

We add a `PreCodeGenModuleHook` function to the `LTOConfig` object so that we
can process the fully linked module without running the backend.

However, the flow is not ideal for many reasons:

1) We are relying on the external `llvm-spirv` tool instead of the SPIR-V
backend. We could slightly improve this issue by using a library call to the
SPIR-V translator instead of the tool, however the library API requires setting
up an object to represent the arguments while we only have strings, and it's
non-trivial to parse the strings to figure out how to create the argument
Comment on lines +94 to +96
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Am I right that you are talking about extensions here?

I don't think that it is necessarily non-trivial. If what we get from clang driver is what we should pass to llvm-spirv, then we can simply copy an existing function from llvm-spirv which does this parsing:

https://github.com/KhronosGroup/SPIRV-LLVM-Translator/blob/main/tools/llvm-spirv/llvm-spirv.cpp#L516

Copy link
Contributor Author

@sarnex sarnex Oct 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's not just that, if it were the extensions it would be not that bad.

let's consider the flag llvm-spirv flag --spirv-max-version. This is a cl::opt global variable MaxSPIRVVersion and is set in llvm-spirv's main by parsing argv as is standard.

In clang-linker-wrapper, this is just a string from the driver
If we want to set the same flag, we would need to somehow get that same global variable set, either by externing it and setting it ourselves (which means we need to parse the strings from the driver and map them to a global variable as well as all the enum mappings for the options). Doing that for all options is what I meant by non-trivial.

If I missed some obvious easy way of doing it let me know

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's consider the flag llvm-spirv flag --spirv-max-version. This is a cl::opt global variable MaxSPIRVVersion and is set in llvm-spirv's main by parsing argv as is standard.

Do we have a public clang driver flag which allows to set maximum SPIR-V version? If not, do we pass different values of that option to the translator based on some other flags?

I guess my high-level point is: do we really have to pass all those flags through the linker-wrapper, or can we just come up with their values within clang-linker wrapper based on the module itself?
That point of view stems from the fact that I don't really know which flags to the translator are currently "dynamic" and my assumption (which could be wrong one) that we don't have that many "dynamic" options which would actually require any handling.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, --spirv-max-version was just a random option I picked, that one in particular might always have the same value. But even if it does, we need to implement some way to get that global variable inside the llvm-spirv library set.
Even if clang-linker-wrapper knows all the options that need to be passed, we need to come up with how to set all the globals which seems flaky.

The consensuses from other reviewers here is to just use the SPIR-V backend now since it is apparently pretty good, so I'm not planning to use the translator anymore.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You seem to be talking about how to initialize globals within llvm-spirv/main.cpp (or whatever name of the file is), but that shouldn't be necessary. API entry points accept all those options via TranslatorOpts object. The translator library does not use any globals to access options

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, you're right, I'm too tried. Still, having to parse the strings from the driver and map it to the getter/setter for the object seems not ideal.

object. Since we plan to use the SPIR-V backend in the long term, this does not
seem to be worth the effort.

2) We manually run passes inside `PreCodeGenModuleHook`. This is because we
don't run CodeGen, so we can't take advantage of the `PreCodeGenPassesHook`
field of `LTOConfig` to run some custom passes, as those passes are only run
when we actually are going to run CodeGen.

3) We have to store the fully linked module. This is needed because we need a
fully linked module to accurately compute metadata, see the above explanation of
SYCL requirement 2). We could get around storing the module by computing the
metadata inside the LTO framework and storing it for late use by the SYCL
bundling code, but doing this would require even more SYCL-only customizations
including even more new function arguments and modifications of the
`OffloadFile` class. There are also compilations because the LTO framework is
Comment on lines +109 to +111
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would it require modifications to OffloadFile? I mean, it doesn't seem to have an accessor to StringData map stored by OffloadBinary contained within OffloadFile, but adding such a getter method doesn't seem too SYCL-specific

multithreaded, and not all LLVM data structures are thread safe.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not clear to me what is there that requires multithreading when each thinLTO thread should be working on its own resulting module alone. Or am I missing something?


The proposed long-term SYCL ThinLTO flow is as follows:

![SYCL SPIR-V backend linking flow](images/ThinLTOSYCLSPIRVBackendFlow.svg)

The biggest difference here is that we are running CodeGen using the SPIR-V
backend.

Also, instead of using a lambda function in the `PreCodeGenModuleHook` callback
to run SYCL finalization passes, we can take advantage of the
`PreCodeGenPassesHook` field to add passes to the pass manager that the LTO
framework will run.
Comment on lines +121 to +124
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got confused by this paragraph, because it contradicts some previous statement. Does it make sense to create sub-sections for short- and long-term designs of thinLTO to better split those two descriptions?


It is possible that the number of device images in the fat executable and which
device image contains which kernel is different with ThinLTO enabled, but we do
expect this to have any impact on correctness or performance, nor we do expect
Comment on lines +127 to +128
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"but we do expect this..." - missing "not"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, thx

users to care.

## Current limitations

`-O0`: Compiling with `-O0` prevent clang from generating ThinLTO metadata
during the compilation phase. In the current implementation, this is an error.
In the final version, we could either silently fall back to full LTO or generate
ThinLTO metadata even for `-O0`.

SYCL libdevice: Current all `libdevice` functions are explicitly marked to be
weak symbols. The ThinLTO framework does not consider a definition of function
with weak linkage as it cannot be sure that this definition is the correct one.
Ideally we could remove the weak symbol annotation.

No binary linkage: The SPIR-V target does not currently have a production
quality binary linker. This means that we must generate a fully linked image as
part of device linkage. At least for AMD devices, this is not a requirement as
Comment on lines +143 to +145
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SPIR-V may be incomplete, there is nothing in the standard which disallows that. However, producing incomplete SPIR-V modules in the compiler will require SYCL RT changes so that SYCL RT can pass a set of SPIR-V modules into device JIT compiler which when linked together produce a complete module.

`lld` is used for the final link which can resolve any unresolved symbols.
`-fno-gpu-rdc` is default for AMD, so in that case it can call `lld` during
compile, but if `-fno-gpu-rdc` is passed, the lld call happens as part of
`clang-linker-wrapper` to resolve any symbols not resolved by ThinLTO.
Loading
Loading