LLVM 17.0.0git
AMDGPULowerModuleLDSPass.cpp
Go to the documentation of this file.
1//===-- AMDGPULowerModuleLDSPass.cpp ------------------------------*- C++ -*-=//
2//
3// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
4// See https://llvm.org/LICENSE.txt for license information.
5// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
6//
7//===----------------------------------------------------------------------===//
8//
9// This pass eliminates local data store, LDS, uses from non-kernel functions.
10// LDS is contiguous memory allocated per kernel execution.
11//
12// Background.
13//
14// The programming model is global variables, or equivalently function local
15// static variables, accessible from kernels or other functions. For uses from
16// kernels this is straightforward - assign an integer to the kernel for the
17// memory required by all the variables combined, allocate them within that.
18// For uses from functions there are performance tradeoffs to choose between.
19//
20// This model means the GPU runtime can specify the amount of memory allocated.
21// If this is more than the kernel assumed, the excess can be made available
22// using a language specific feature, which IR represents as a variable with
23// no initializer. This feature is not yet implemented for non-kernel functions.
24// This lowering could be extended to handle that use case, but would probably
25// require closer integration with promoteAllocaToLDS.
26//
27// Consequences of this GPU feature:
28// - memory is limited and exceeding it halts compilation
29// - a global accessed by one kernel exists independent of other kernels
30// - a global exists independent of simultaneous execution of the same kernel
31// - the address of the global may be different from different kernels as they
32// do not alias, which permits only allocating variables they use
33// - if the address is allowed to differ, functions need help to find it
34//
35// Uses from kernels are implemented here by grouping them in a per-kernel
36// struct instance. This duplicates the variables, accurately modelling their
37// aliasing properties relative to a single global representation. It also
38// permits control over alignment via padding.
39//
40// Uses from functions are more complicated and the primary purpose of this
41// IR pass. Several different lowering are chosen between to meet requirements
42// to avoid allocating any LDS where it is not necessary, as that impacts
43// occupancy and may fail the compilation, while not imposing overhead on a
44// feature whose primary advantage over global memory is performance. The basic
45// design goal is to avoid one kernel imposing overhead on another.
46//
47// Implementation.
48//
49// LDS variables with constant annotation or non-undef initializer are passed
50// through unchanged for simplification or error diagnostics in later passes.
51// Non-undef initializers are not yet implemented for LDS.
52//
53// LDS variables that are always allocated at the same address can be found
54// by lookup at that address. Otherwise runtime information/cost is required.
55//
56// The simplest strategy possible is to group all LDS variables in a single
57// struct and allocate that struct in every kernel such that the original
58// variables are always at the same address. LDS is however a limited resource
59// so this strategy is unusable in practice. It is not implemented here.
60//
61// Strategy | Precise allocation | Zero runtime cost | General purpose |
62// --------+--------------------+-------------------+-----------------+
63// Module | No | Yes | Yes |
64// Table | Yes | No | Yes |
65// Kernel | Yes | Yes | No |
66// Hybrid | Yes | Partial | Yes |
67//
68// Module spends LDS memory to save cycles. Table spends cycles and global
69// memory to save LDS. Kernel is as fast as kernel allocation but only works
70// for variables that are known reachable from a single kernel. Hybrid picks
71// between all three. When forced to choose between LDS and cycles it minimises
72// LDS use.
73
74// The "module" lowering implemented here finds LDS variables which are used by
75// non-kernel functions and creates a new struct with a field for each of those
76// LDS variables. Variables that are only used from kernels are excluded.
77// Kernels that do not use this struct are annoteated with the attribute
78// amdgpu-elide-module-lds which allows the back end to elide the allocation.
79//
80// The "table" lowering implemented here has three components.
81// First kernels are assigned a unique integer identifier which is available in
82// functions it calls through the intrinsic amdgcn_lds_kernel_id. The integer
83// is passed through a specific SGPR, thus works with indirect calls.
84// Second, each kernel allocates LDS variables independent of other kernels and
85// writes the addresses it chose for each variable into an array in consistent
86// order. If the kernel does not allocate a given variable, it writes undef to
87// the corresponding array location. These arrays are written to a constant
88// table in the order matching the kernel unique integer identifier.
89// Third, uses from non-kernel functions are replaced with a table lookup using
90// the intrinsic function to find the address of the variable.
91//
92// "Kernel" lowering is only applicable for variables that are unambiguously
93// reachable from exactly one kernel. For those cases, accesses to the variable
94// can be lowered to ConstantExpr address of a struct instance specific to that
95// one kernel. This is zero cost in space and in compute. It will raise a fatal
96// error on any variable that might be reachable from multiple kernels and is
97// thus most easily used as part of the hybrid lowering strategy.
98//
99// Hybrid lowering is a mixture of the above. It uses the zero cost kernel
100// lowering where it can. It lowers the variable accessed by the greatest
101// number of kernels using the module strategy as that is free for the first
102// variable. Any futher variables that can be lowered with the module strategy
103// without incurring LDS memory overhead are. The remaining ones are lowered
104// via table.
105//
106// Consequences
107// - No heuristics or user controlled magic numbers, hybrid is the right choice
108// - Kernels that don't use functions (or have had them all inlined) are not
109// affected by any lowering for kernels that do.
110// - Kernels that don't make indirect function calls are not affected by those
111// that do.
112// - Variables which are used by lots of kernels, e.g. those injected by a
113// language runtime in most kernels, are expected to have no overhead
114// - Implementations that instantiate templates per-kernel where those templates
115// use LDS are expected to hit the "Kernel" lowering strategy
116// - The runtime properties impose a cost in compiler implementation complexity
117//
118//===----------------------------------------------------------------------===//
119
120#include "AMDGPU.h"
121#include "Utils/AMDGPUBaseInfo.h"
123#include "llvm/ADT/BitVector.h"
124#include "llvm/ADT/DenseMap.h"
125#include "llvm/ADT/DenseSet.h"
126#include "llvm/ADT/STLExtras.h"
128#include "llvm/ADT/SetVector.h"
130#include "llvm/IR/Constants.h"
131#include "llvm/IR/DerivedTypes.h"
132#include "llvm/IR/IRBuilder.h"
133#include "llvm/IR/InlineAsm.h"
134#include "llvm/IR/Instructions.h"
135#include "llvm/IR/IntrinsicsAMDGPU.h"
136#include "llvm/IR/MDBuilder.h"
139#include "llvm/Pass.h"
141#include "llvm/Support/Debug.h"
145
146#include <tuple>
147#include <vector>
148
149#include <cstdio>
150
151#define DEBUG_TYPE "amdgpu-lower-module-lds"
152
153using namespace llvm;
154
155namespace {
156
157cl::opt<bool> SuperAlignLDSGlobals(
158 "amdgpu-super-align-lds-globals",
159 cl::desc("Increase alignment of LDS if it is not on align boundary"),
160 cl::init(true), cl::Hidden);
161
162enum class LoweringKind { module, table, kernel, hybrid };
163cl::opt<LoweringKind> LoweringKindLoc(
164 "amdgpu-lower-module-lds-strategy",
165 cl::desc("Specify lowering strategy for function LDS access:"), cl::Hidden,
166 cl::init(LoweringKind::hybrid),
168 clEnumValN(LoweringKind::table, "table", "Lower via table lookup"),
169 clEnumValN(LoweringKind::module, "module", "Lower via module struct"),
171 LoweringKind::kernel, "kernel",
172 "Lower variables reachable from one kernel, otherwise abort"),
173 clEnumValN(LoweringKind::hybrid, "hybrid",
174 "Lower via mixture of above strategies")));
175
176bool isKernelLDS(const Function *F) {
177 // Some weirdness here. AMDGPU::isKernelCC does not call into
178 // AMDGPU::isKernel with the calling conv, it instead calls into
179 // isModuleEntryFunction which returns true for more calling conventions
180 // than AMDGPU::isKernel does. There's a FIXME on AMDGPU::isKernel.
181 // There's also a test that checks that the LDS lowering does not hit on
182 // a graphics shader, denoted amdgpu_ps, so stay with the limited case.
183 // Putting LDS in the name of the function to draw attention to this.
184 return AMDGPU::isKernel(F->getCallingConv());
185}
186
187class AMDGPULowerModuleLDS : public ModulePass {
188
189 static void
190 removeLocalVarsFromUsedLists(Module &M,
191 const DenseSet<GlobalVariable *> &LocalVars) {
192 // The verifier rejects used lists containing an inttoptr of a constant
193 // so remove the variables from these lists before replaceAllUsesWith
194 SmallPtrSet<Constant *, 8> LocalVarsSet;
195 for (GlobalVariable *LocalVar : LocalVars)
196 LocalVarsSet.insert(cast<Constant>(LocalVar->stripPointerCasts()));
197
199 M, [&LocalVarsSet](Constant *C) { return LocalVarsSet.count(C); });
200
201 for (GlobalVariable *LocalVar : LocalVars)
202 LocalVar->removeDeadConstantUsers();
203 }
204
205 static void markUsedByKernel(IRBuilder<> &Builder, Function *Func,
206 GlobalVariable *SGV) {
207 // The llvm.amdgcn.module.lds instance is implicitly used by all kernels
208 // that might call a function which accesses a field within it. This is
209 // presently approximated to 'all kernels' if there are any such functions
210 // in the module. This implicit use is redefined as an explicit use here so
211 // that later passes, specifically PromoteAlloca, account for the required
212 // memory without any knowledge of this transform.
213
214 // An operand bundle on llvm.donothing works because the call instruction
215 // survives until after the last pass that needs to account for LDS. It is
216 // better than inline asm as the latter survives until the end of codegen. A
217 // totally robust solution would be a function with the same semantics as
218 // llvm.donothing that takes a pointer to the instance and is lowered to a
219 // no-op after LDS is allocated, but that is not presently necessary.
220
221 // This intrinsic is eliminated shortly before instruction selection. It
222 // does not suffice to indicate to ISel that a given global which is not
223 // immediately used by the kernel must still be allocated by it. An
224 // equivalent target specific intrinsic which lasts until immediately after
225 // codegen would suffice for that, but one would still need to ensure that
226 // the variables are allocated in the anticpated order.
227
228 LLVMContext &Ctx = Func->getContext();
229
230 Builder.SetInsertPoint(Func->getEntryBlock().getFirstNonPHI());
231
233
234 Function *Decl =
235 Intrinsic::getDeclaration(Func->getParent(), Intrinsic::donothing, {});
236
237 Value *UseInstance[1] = {Builder.CreateInBoundsGEP(
238 SGV->getValueType(), SGV, ConstantInt::get(Type::getInt32Ty(Ctx), 0))};
239
240 Builder.CreateCall(FTy, Decl, {},
241 {OperandBundleDefT<Value *>("ExplicitUse", UseInstance)},
242 "");
243 }
244
245 static bool eliminateConstantExprUsesOfLDSFromAllInstructions(Module &M) {
246 // Constants are uniqued within LLVM. A ConstantExpr referring to a LDS
247 // global may have uses from multiple different functions as a result.
248 // This pass specialises LDS variables with respect to the kernel that
249 // allocates them.
250
251 // This is semantically equivalent to (the unimplemented as slow):
252 // for (auto &F : M.functions())
253 // for (auto &BB : F)
254 // for (auto &I : BB)
255 // for (Use &Op : I.operands())
256 // if (constantExprUsesLDS(Op))
257 // replaceConstantExprInFunction(I, Op);
258
259 SmallVector<Constant *> LDSGlobals;
260 for (auto &GV : M.globals())
261 if (AMDGPU::isLDSVariableToLower(GV))
262 LDSGlobals.push_back(&GV);
263
264 return convertUsersOfConstantsToInstructions(LDSGlobals);
265 }
266
267public:
268 static char ID;
269
270 AMDGPULowerModuleLDS() : ModulePass(ID) {
272 }
273
274 using FunctionVariableMap = DenseMap<Function *, DenseSet<GlobalVariable *>>;
275
276 using VariableFunctionMap = DenseMap<GlobalVariable *, DenseSet<Function *>>;
277
278 static void getUsesOfLDSByFunction(CallGraph const &CG, Module &M,
279 FunctionVariableMap &kernels,
280 FunctionVariableMap &functions) {
281
282 // Get uses from the current function, excluding uses by called functions
283 // Two output variables to avoid walking the globals list twice
284 for (auto &GV : M.globals()) {
285 if (!AMDGPU::isLDSVariableToLower(GV)) {
286 continue;
287 }
288
289 SmallVector<User *, 16> Stack(GV.users());
290 for (User *V : GV.users()) {
291 if (auto *I = dyn_cast<Instruction>(V)) {
292 Function *F = I->getFunction();
293 if (isKernelLDS(F)) {
294 kernels[F].insert(&GV);
295 } else {
296 functions[F].insert(&GV);
297 }
298 }
299 }
300 }
301 }
302
303 struct LDSUsesInfoTy {
304 FunctionVariableMap direct_access;
305 FunctionVariableMap indirect_access;
306 };
307
308 static LDSUsesInfoTy getTransitiveUsesOfLDS(CallGraph const &CG, Module &M) {
309
310 FunctionVariableMap direct_map_kernel;
311 FunctionVariableMap direct_map_function;
312 getUsesOfLDSByFunction(CG, M, direct_map_kernel, direct_map_function);
313
314 // Collect variables that are used by functions whose address has escaped
315 DenseSet<GlobalVariable *> VariablesReachableThroughFunctionPointer;
316 for (Function &F : M.functions()) {
317 if (!isKernelLDS(&F))
318 if (F.hasAddressTaken(nullptr,
319 /* IgnoreCallbackUses */ false,
320 /* IgnoreAssumeLikeCalls */ false,
321 /* IgnoreLLVMUsed */ true,
322 /* IgnoreArcAttachedCall */ false)) {
323 set_union(VariablesReachableThroughFunctionPointer,
324 direct_map_function[&F]);
325 }
326 }
327
328 auto functionMakesUnknownCall = [&](const Function *F) -> bool {
329 assert(!F->isDeclaration());
330 for (CallGraphNode::CallRecord R : *CG[F]) {
331 if (!R.second->getFunction()) {
332 return true;
333 }
334 }
335 return false;
336 };
337
338 // Work out which variables are reachable through function calls
339 FunctionVariableMap transitive_map_function = direct_map_function;
340
341 // If the function makes any unknown call, assume the worst case that it can
342 // access all variables accessed by functions whose address escaped
343 for (Function &F : M.functions()) {
344 if (!F.isDeclaration() && functionMakesUnknownCall(&F)) {
345 if (!isKernelLDS(&F)) {
346 set_union(transitive_map_function[&F],
347 VariablesReachableThroughFunctionPointer);
348 }
349 }
350 }
351
352 // Direct implementation of collecting all variables reachable from each
353 // function
354 for (Function &Func : M.functions()) {
355 if (Func.isDeclaration() || isKernelLDS(&Func))
356 continue;
357
358 DenseSet<Function *> seen; // catches cycles
360
361 while (!wip.empty()) {
362 Function *F = wip.pop_back_val();
363
364 // Can accelerate this by referring to transitive map for functions that
365 // have already been computed, with more care than this
366 set_union(transitive_map_function[&Func], direct_map_function[F]);
367
368 for (CallGraphNode::CallRecord R : *CG[F]) {
369 Function *ith = R.second->getFunction();
370 if (ith) {
371 if (!seen.contains(ith)) {
372 seen.insert(ith);
373 wip.push_back(ith);
374 }
375 }
376 }
377 }
378 }
379
380 // direct_map_kernel lists which variables are used by the kernel
381 // find the variables which are used through a function call
382 FunctionVariableMap indirect_map_kernel;
383
384 for (Function &Func : M.functions()) {
385 if (Func.isDeclaration() || !isKernelLDS(&Func))
386 continue;
387
388 for (CallGraphNode::CallRecord R : *CG[&Func]) {
389 Function *ith = R.second->getFunction();
390 if (ith) {
391 set_union(indirect_map_kernel[&Func], transitive_map_function[ith]);
392 } else {
393 set_union(indirect_map_kernel[&Func],
394 VariablesReachableThroughFunctionPointer);
395 }
396 }
397 }
398
399 return {std::move(direct_map_kernel), std::move(indirect_map_kernel)};
400 }
401
402 struct LDSVariableReplacement {
403 GlobalVariable *SGV = nullptr;
404 DenseMap<GlobalVariable *, Constant *> LDSVarsToConstantGEP;
405 };
406
407 // remap from lds global to a constantexpr gep to where it has been moved to
408 // for each kernel
409 // an array with an element for each kernel containing where the corresponding
410 // variable was remapped to
411
412 static Constant *getAddressesOfVariablesInKernel(
414 DenseMap<GlobalVariable *, Constant *> &LDSVarsToConstantGEP) {
415 // Create a ConstantArray containing the address of each Variable within the
416 // kernel corresponding to LDSVarsToConstantGEP, or poison if that kernel
417 // does not allocate it
418 // TODO: Drop the ptrtoint conversion
419
420 Type *I32 = Type::getInt32Ty(Ctx);
421
422 ArrayType *KernelOffsetsType = ArrayType::get(I32, Variables.size());
423
425 for (size_t i = 0; i < Variables.size(); i++) {
426 GlobalVariable *GV = Variables[i];
427 if (LDSVarsToConstantGEP.count(GV) != 0) {
428 auto elt = ConstantExpr::getPtrToInt(LDSVarsToConstantGEP[GV], I32);
429 Elements.push_back(elt);
430 } else {
431 Elements.push_back(PoisonValue::get(I32));
432 }
433 }
434 return ConstantArray::get(KernelOffsetsType, Elements);
435 }
436
437 static GlobalVariable *buildLookupTable(
439 ArrayRef<Function *> kernels,
441 if (Variables.empty()) {
442 return nullptr;
443 }
444 LLVMContext &Ctx = M.getContext();
445
446 const size_t NumberVariables = Variables.size();
447 const size_t NumberKernels = kernels.size();
448
449 ArrayType *KernelOffsetsType =
450 ArrayType::get(Type::getInt32Ty(Ctx), NumberVariables);
451
452 ArrayType *AllKernelsOffsetsType =
453 ArrayType::get(KernelOffsetsType, NumberKernels);
454
455 std::vector<Constant *> overallConstantExprElts(NumberKernels);
456 for (size_t i = 0; i < NumberKernels; i++) {
457 LDSVariableReplacement Replacement = KernelToReplacement[kernels[i]];
458 overallConstantExprElts[i] = getAddressesOfVariablesInKernel(
459 Ctx, Variables, Replacement.LDSVarsToConstantGEP);
460 }
461
462 Constant *init =
463 ConstantArray::get(AllKernelsOffsetsType, overallConstantExprElts);
464
465 return new GlobalVariable(
466 M, AllKernelsOffsetsType, true, GlobalValue::InternalLinkage, init,
467 "llvm.amdgcn.lds.offset.table", nullptr, GlobalValue::NotThreadLocal,
469 }
470
471 void replaceUsesInInstructionsWithTableLookup(
472 Module &M, ArrayRef<GlobalVariable *> ModuleScopeVariables,
473 GlobalVariable *LookupTable) {
474
475 LLVMContext &Ctx = M.getContext();
476 IRBuilder<> Builder(Ctx);
477 Type *I32 = Type::getInt32Ty(Ctx);
478
479 for (size_t Index = 0; Index < ModuleScopeVariables.size(); Index++) {
480 auto *GV = ModuleScopeVariables[Index];
481
482 for (Use &U : make_early_inc_range(GV->uses())) {
483 auto *I = dyn_cast<Instruction>(U.getUser());
484 if (!I)
485 continue;
486
487 Value *tableKernelIndex =
488 getTableLookupKernelIndex(M, I->getFunction());
489
490 // So if the phi uses this value multiple times, what does this look
491 // like?
492 if (auto *Phi = dyn_cast<PHINode>(I)) {
493 BasicBlock *BB = Phi->getIncomingBlock(U);
494 Builder.SetInsertPoint(&(*(BB->getFirstInsertionPt())));
495 } else {
496 Builder.SetInsertPoint(I);
497 }
498
499 Value *GEPIdx[3] = {
500 ConstantInt::get(I32, 0),
501 tableKernelIndex,
503 };
504
505 Value *Address = Builder.CreateInBoundsGEP(
506 LookupTable->getValueType(), LookupTable, GEPIdx, GV->getName());
507
508 Value *loaded = Builder.CreateLoad(I32, Address);
509
510 Value *replacement =
511 Builder.CreateIntToPtr(loaded, GV->getType(), GV->getName());
512
513 U.set(replacement);
514 }
515 }
516 }
517
518 static DenseSet<Function *> kernelsThatIndirectlyAccessAnyOfPassedVariables(
519 Module &M, LDSUsesInfoTy &LDSUsesInfo,
520 DenseSet<GlobalVariable *> const &VariableSet) {
521
522 DenseSet<Function *> KernelSet;
523
524 if (VariableSet.empty())
525 return KernelSet;
526
527 for (Function &Func : M.functions()) {
528 if (Func.isDeclaration() || !isKernelLDS(&Func))
529 continue;
530 for (GlobalVariable *GV : LDSUsesInfo.indirect_access[&Func]) {
531 if (VariableSet.contains(GV)) {
532 KernelSet.insert(&Func);
533 break;
534 }
535 }
536 }
537
538 return KernelSet;
539 }
540
541 static GlobalVariable *
542 chooseBestVariableForModuleStrategy(const DataLayout &DL,
543 VariableFunctionMap &LDSVars) {
544 // Find the global variable with the most indirect uses from kernels
545
546 struct CandidateTy {
547 GlobalVariable *GV = nullptr;
548 size_t UserCount = 0;
549 size_t Size = 0;
550
551 CandidateTy() = default;
552
553 CandidateTy(GlobalVariable *GV, uint64_t UserCount, uint64_t AllocSize)
554 : GV(GV), UserCount(UserCount), Size(AllocSize) {}
555
556 bool operator<(const CandidateTy &Other) const {
557 // Fewer users makes module scope variable less attractive
558 if (UserCount < Other.UserCount) {
559 return true;
560 }
561 if (UserCount > Other.UserCount) {
562 return false;
563 }
564
565 // Bigger makes module scope variable less attractive
566 if (Size < Other.Size) {
567 return false;
568 }
569
570 if (Size > Other.Size) {
571 return true;
572 }
573
574 // Arbitrary but consistent
575 return GV->getName() < Other.GV->getName();
576 }
577 };
578
579 CandidateTy MostUsed;
580
581 for (auto &K : LDSVars) {
582 GlobalVariable *GV = K.first;
583 if (K.second.size() <= 1) {
584 // A variable reachable by only one kernel is best lowered with kernel
585 // strategy
586 continue;
587 }
588 CandidateTy Candidate(
589 GV, K.second.size(),
590 DL.getTypeAllocSize(GV->getValueType()).getFixedValue());
591 if (MostUsed < Candidate)
592 MostUsed = Candidate;
593 }
594
595 return MostUsed.GV;
596 }
597
598 static void recordLDSAbsoluteAddress(Module *M, GlobalVariable *GV,
599 uint32_t Address) {
600 // Write the specified address into metadata where it can be retrieved by
601 // the assembler. Format is a half open range, [Address Address+1)
602 LLVMContext &Ctx = M->getContext();
603 auto *IntTy =
604 M->getDataLayout().getIntPtrType(Ctx, AMDGPUAS::LOCAL_ADDRESS);
605 auto *MinC = ConstantAsMetadata::get(ConstantInt::get(IntTy, Address));
606 auto *MaxC = ConstantAsMetadata::get(ConstantInt::get(IntTy, Address + 1));
607 GV->setMetadata(LLVMContext::MD_absolute_symbol,
608 MDNode::get(Ctx, {MinC, MaxC}));
609 }
610
611 DenseMap<Function *, Value *> tableKernelIndexCache;
612 Value *getTableLookupKernelIndex(Module &M, Function *F) {
613 // Accesses from a function use the amdgcn_lds_kernel_id intrinsic which
614 // lowers to a read from a live in register. Emit it once in the entry
615 // block to spare deduplicating it later.
616 if (tableKernelIndexCache.count(F) == 0) {
617 LLVMContext &Ctx = M.getContext();
618 IRBuilder<> Builder(Ctx);
620 Function *Decl =
621 Intrinsic::getDeclaration(&M, Intrinsic::amdgcn_lds_kernel_id, {});
622
624 F->getEntryBlock().getFirstNonPHIOrDbgOrAlloca();
625 Instruction &i = *it;
626 Builder.SetInsertPoint(&i);
627
628 tableKernelIndexCache[F] = Builder.CreateCall(FTy, Decl, {});
629 }
630
631 return tableKernelIndexCache[F];
632 }
633
634 static std::vector<Function *> assignLDSKernelIDToEachKernel(
635 Module *M, DenseSet<Function *> const &KernelsThatAllocateTableLDS) {
636 // Associate kernels in the set with an arbirary but reproducible order and
637 // annotate them with that order in metadata. This metadata is recognised by
638 // the backend and lowered to a SGPR which can be read from using
639 // amdgcn_lds_kernel_id.
640
641 std::vector<Function *> OrderedKernels;
642
643 for (Function &Func : M->functions()) {
644 if (Func.isDeclaration())
645 continue;
646 if (!isKernelLDS(&Func))
647 continue;
648
649 if (KernelsThatAllocateTableLDS.contains(&Func)) {
650 assert(Func.hasName()); // else fatal error earlier
651 OrderedKernels.push_back(&Func);
652 }
653 }
654
655 // Put them in an arbitrary but reproducible order
656 llvm::sort(OrderedKernels.begin(), OrderedKernels.end(),
657 [](const Function *lhs, const Function *rhs) -> bool {
658 return lhs->getName() < rhs->getName();
659 });
660
661 // Annotate the kernels with their order in this vector
662 LLVMContext &Ctx = M->getContext();
663 IRBuilder<> Builder(Ctx);
664
665 if (OrderedKernels.size() > UINT32_MAX) {
666 // 32 bit keeps it in one SGPR. > 2**32 kernels won't fit on the GPU
667 report_fatal_error("Unimplemented LDS lowering for > 2**32 kernels");
668 }
669
670 for (size_t i = 0; i < OrderedKernels.size(); i++) {
671 Metadata *AttrMDArgs[1] = {
672 ConstantAsMetadata::get(Builder.getInt32(i)),
673 };
674 OrderedKernels[i]->setMetadata("llvm.amdgcn.lds.kernel.id",
675 MDNode::get(Ctx, AttrMDArgs));
676 }
677
678 return OrderedKernels;
679 }
680
681 static void partitionVariablesIntoIndirectStrategies(
682 Module &M, LDSUsesInfoTy const &LDSUsesInfo,
683 VariableFunctionMap &LDSToKernelsThatNeedToAccessItIndirectly,
684 DenseSet<GlobalVariable *> &ModuleScopeVariables,
685 DenseSet<GlobalVariable *> &TableLookupVariables,
686 DenseSet<GlobalVariable *> &KernelAccessVariables) {
687
688 GlobalVariable *HybridModuleRoot =
689 LoweringKindLoc != LoweringKind::hybrid
690 ? nullptr
691 : chooseBestVariableForModuleStrategy(
692 M.getDataLayout(), LDSToKernelsThatNeedToAccessItIndirectly);
693
694 DenseSet<Function *> const EmptySet;
695 DenseSet<Function *> const &HybridModuleRootKernels =
696 HybridModuleRoot
697 ? LDSToKernelsThatNeedToAccessItIndirectly[HybridModuleRoot]
698 : EmptySet;
699
700 for (auto &K : LDSToKernelsThatNeedToAccessItIndirectly) {
701 // Each iteration of this loop assigns exactly one global variable to
702 // exactly one of the implementation strategies.
703
704 GlobalVariable *GV = K.first;
705 assert(AMDGPU::isLDSVariableToLower(*GV));
706 assert(K.second.size() != 0);
707
708 switch (LoweringKindLoc) {
709 case LoweringKind::module:
710 ModuleScopeVariables.insert(GV);
711 break;
712
713 case LoweringKind::table:
714 TableLookupVariables.insert(GV);
715 break;
716
717 case LoweringKind::kernel:
718 if (K.second.size() == 1) {
719 KernelAccessVariables.insert(GV);
720 } else {
722 "cannot lower LDS '" + GV->getName() +
723 "' to kernel access as it is reachable from multiple kernels");
724 }
725 break;
726
727 case LoweringKind::hybrid: {
728 if (GV == HybridModuleRoot) {
729 assert(K.second.size() != 1);
730 ModuleScopeVariables.insert(GV);
731 } else if (K.second.size() == 1) {
732 KernelAccessVariables.insert(GV);
733 } else if (set_is_subset(K.second, HybridModuleRootKernels)) {
734 ModuleScopeVariables.insert(GV);
735 } else {
736 TableLookupVariables.insert(GV);
737 }
738 break;
739 }
740 }
741 }
742
743 // All LDS variables accessed indirectly have now been partitioned into
744 // the distinct lowering strategies.
745 assert(ModuleScopeVariables.size() + TableLookupVariables.size() +
746 KernelAccessVariables.size() ==
747 LDSToKernelsThatNeedToAccessItIndirectly.size());
748 }
749
750 static GlobalVariable *lowerModuleScopeStructVariables(
751 Module &M, DenseSet<GlobalVariable *> const &ModuleScopeVariables,
752 DenseSet<Function *> const &KernelsThatAllocateModuleLDS) {
753 // Create a struct to hold the ModuleScopeVariables
754 // Replace all uses of those variables from non-kernel functions with the
755 // new struct instance Replace only the uses from kernel functions that will
756 // allocate this instance. That is a space optimisation - kernels that use a
757 // subset of the module scope struct and do not need to allocate it for
758 // indirect calls will only allocate the subset they use (they do so as part
759 // of the per-kernel lowering).
760 if (ModuleScopeVariables.empty()) {
761 return nullptr;
762 }
763
764 LLVMContext &Ctx = M.getContext();
765
766 LDSVariableReplacement ModuleScopeReplacement =
767 createLDSVariableReplacement(M, "llvm.amdgcn.module.lds",
768 ModuleScopeVariables);
769
770 appendToCompilerUsed(M, {static_cast<GlobalValue *>(
772 cast<Constant>(ModuleScopeReplacement.SGV),
773 Type::getInt8PtrTy(Ctx)))});
774
775 // module.lds will be allocated at zero in any kernel that allocates it
776 recordLDSAbsoluteAddress(&M, ModuleScopeReplacement.SGV, 0);
777
778 // historic
779 removeLocalVarsFromUsedLists(M, ModuleScopeVariables);
780
781 // Replace all uses of module scope variable from non-kernel functions
782 replaceLDSVariablesWithStruct(
783 M, ModuleScopeVariables, ModuleScopeReplacement, [&](Use &U) {
784 Instruction *I = dyn_cast<Instruction>(U.getUser());
785 if (!I) {
786 return false;
787 }
788 Function *F = I->getFunction();
789 return !isKernelLDS(F);
790 });
791
792 // Replace uses of module scope variable from kernel functions that
793 // allocate the module scope variable, otherwise leave them unchanged
794 // Record on each kernel whether the module scope global is used by it
795
796 IRBuilder<> Builder(Ctx);
797
798 for (Function &Func : M.functions()) {
799 if (Func.isDeclaration() || !isKernelLDS(&Func))
800 continue;
801
802 if (KernelsThatAllocateModuleLDS.contains(&Func)) {
803 replaceLDSVariablesWithStruct(
804 M, ModuleScopeVariables, ModuleScopeReplacement, [&](Use &U) {
805 Instruction *I = dyn_cast<Instruction>(U.getUser());
806 if (!I) {
807 return false;
808 }
809 Function *F = I->getFunction();
810 return F == &Func;
811 });
812
813 markUsedByKernel(Builder, &Func, ModuleScopeReplacement.SGV);
814
815 } else {
816 Func.addFnAttr("amdgpu-elide-module-lds");
817 }
818 }
819
820 return ModuleScopeReplacement.SGV;
821 }
822
824 lowerKernelScopeStructVariables(
825 Module &M, LDSUsesInfoTy &LDSUsesInfo,
826 DenseSet<GlobalVariable *> const &ModuleScopeVariables,
827 DenseSet<Function *> const &KernelsThatAllocateModuleLDS,
828 GlobalVariable *MaybeModuleScopeStruct) {
829
830 // Create a struct for each kernel for the non-module-scope variables.
831
833 for (Function &Func : M.functions()) {
834 if (Func.isDeclaration() || !isKernelLDS(&Func))
835 continue;
836
837 DenseSet<GlobalVariable *> KernelUsedVariables;
838 // Allocating variables that are used directly in this struct to get
839 // alignment aware allocation and predictable frame size.
840 for (auto &v : LDSUsesInfo.direct_access[&Func]) {
841 KernelUsedVariables.insert(v);
842 }
843
844 // Allocating variables that are accessed indirectly so that a lookup of
845 // this struct instance can find them from nested functions.
846 for (auto &v : LDSUsesInfo.indirect_access[&Func]) {
847 KernelUsedVariables.insert(v);
848 }
849
850 // Variables allocated in module lds must all resolve to that struct,
851 // not to the per-kernel instance.
852 if (KernelsThatAllocateModuleLDS.contains(&Func)) {
853 for (GlobalVariable *v : ModuleScopeVariables) {
854 KernelUsedVariables.erase(v);
855 }
856 }
857
858 if (KernelUsedVariables.empty()) {
859 // Either used no LDS, or the LDS it used was all in the module struct
860 continue;
861 }
862
863 // The association between kernel function and LDS struct is done by
864 // symbol name, which only works if the function in question has a
865 // name This is not expected to be a problem in practice as kernels
866 // are called by name making anonymous ones (which are named by the
867 // backend) difficult to use. This does mean that llvm test cases need
868 // to name the kernels.
869 if (!Func.hasName()) {
870 report_fatal_error("Anonymous kernels cannot use LDS variables");
871 }
872
873 std::string VarName =
874 (Twine("llvm.amdgcn.kernel.") + Func.getName() + ".lds").str();
875
876 auto Replacement =
877 createLDSVariableReplacement(M, VarName, KernelUsedVariables);
878
879 // This struct is allocated at a predictable address that can be
880 // calculated now, recorded in metadata then used to lower references to
881 // it during codegen.
882 {
883 // frame layout, starting from 0
884 //{
885 // module.lds
886 // alignment padding
887 // kernel instance
888 //}
889
890 if (!MaybeModuleScopeStruct ||
891 Func.hasFnAttribute("amdgpu-elide-module-lds")) {
892 // There's no module.lds for this kernel so this replacement struct
893 // goes first
894 recordLDSAbsoluteAddress(&M, Replacement.SGV, 0);
895 } else {
896 const DataLayout &DL = M.getDataLayout();
897 TypeSize ModuleSize =
898 DL.getTypeAllocSize(MaybeModuleScopeStruct->getValueType());
899 GlobalVariable *KernelStruct = Replacement.SGV;
900 Align KernelAlign = AMDGPU::getAlign(DL, KernelStruct);
901 recordLDSAbsoluteAddress(&M, Replacement.SGV,
902 alignTo(ModuleSize, KernelAlign));
903 }
904 }
905
906 // remove preserves existing codegen
907 removeLocalVarsFromUsedLists(M, KernelUsedVariables);
908 KernelToReplacement[&Func] = Replacement;
909
910 // Rewrite uses within kernel to the new struct
911 replaceLDSVariablesWithStruct(
912 M, KernelUsedVariables, Replacement, [&Func](Use &U) {
913 Instruction *I = dyn_cast<Instruction>(U.getUser());
914 return I && I->getFunction() == &Func;
915 });
916 }
917 return KernelToReplacement;
918 }
919
920 bool runOnModule(Module &M) override {
921 CallGraph CG = CallGraph(M);
922 bool Changed = superAlignLDSGlobals(M);
923
924 Changed |= eliminateConstantExprUsesOfLDSFromAllInstructions(M);
925
926 Changed = true; // todo: narrow this down
927
928 // For each kernel, what variables does it access directly or through
929 // callees
930 LDSUsesInfoTy LDSUsesInfo = getTransitiveUsesOfLDS(CG, M);
931
932 // For each variable accessed through callees, which kernels access it
933 VariableFunctionMap LDSToKernelsThatNeedToAccessItIndirectly;
934 for (auto &K : LDSUsesInfo.indirect_access) {
935 Function *F = K.first;
936 assert(isKernelLDS(F));
937 for (GlobalVariable *GV : K.second) {
938 LDSToKernelsThatNeedToAccessItIndirectly[GV].insert(F);
939 }
940 }
941
942 DenseSet<GlobalVariable *> ModuleScopeVariables;
943 DenseSet<GlobalVariable *> TableLookupVariables;
944 DenseSet<GlobalVariable *> KernelAccessVariables;
945 partitionVariablesIntoIndirectStrategies(
946 M, LDSUsesInfo, LDSToKernelsThatNeedToAccessItIndirectly,
947 ModuleScopeVariables, TableLookupVariables, KernelAccessVariables);
948
949 // If the kernel accesses a variable that is going to be stored in the
950 // module instance through a call then that kernel needs to allocate the
951 // module instance
952 DenseSet<Function *> KernelsThatAllocateModuleLDS =
953 kernelsThatIndirectlyAccessAnyOfPassedVariables(M, LDSUsesInfo,
954 ModuleScopeVariables);
955 DenseSet<Function *> KernelsThatAllocateTableLDS =
956 kernelsThatIndirectlyAccessAnyOfPassedVariables(M, LDSUsesInfo,
957 TableLookupVariables);
958
959 GlobalVariable *MaybeModuleScopeStruct = lowerModuleScopeStructVariables(
960 M, ModuleScopeVariables, KernelsThatAllocateModuleLDS);
961
963 lowerKernelScopeStructVariables(M, LDSUsesInfo, ModuleScopeVariables,
964 KernelsThatAllocateModuleLDS,
965 MaybeModuleScopeStruct);
966
967 // Lower zero cost accesses to the kernel instances just created
968 for (auto &GV : KernelAccessVariables) {
969 auto &funcs = LDSToKernelsThatNeedToAccessItIndirectly[GV];
970 assert(funcs.size() == 1); // Only one kernel can access it
971 LDSVariableReplacement Replacement =
972 KernelToReplacement[*(funcs.begin())];
973
975 Vec.insert(GV);
976
977 // TODO: Looks like a latent bug, Replacement may not be marked
978 // UsedByKernel here
979 replaceLDSVariablesWithStruct(M, Vec, Replacement, [](Use &U) {
980 return isa<Instruction>(U.getUser());
981 });
982 }
983
984 if (!KernelsThatAllocateTableLDS.empty()) {
985 LLVMContext &Ctx = M.getContext();
986 IRBuilder<> Builder(Ctx);
987
988 // The ith element of this vector is kernel id i
989 std::vector<Function *> OrderedKernels =
990 assignLDSKernelIDToEachKernel(&M, KernelsThatAllocateTableLDS);
991
992 for (size_t i = 0; i < OrderedKernels.size(); i++) {
993 markUsedByKernel(Builder, OrderedKernels[i],
994 KernelToReplacement[OrderedKernels[i]].SGV);
995 }
996
997 // The order must be consistent between lookup table and accesses to
998 // lookup table
999 std::vector<GlobalVariable *> TableLookupVariablesOrdered(
1000 TableLookupVariables.begin(), TableLookupVariables.end());
1001 llvm::sort(TableLookupVariablesOrdered.begin(),
1002 TableLookupVariablesOrdered.end(),
1003 [](const GlobalVariable *lhs, const GlobalVariable *rhs) {
1004 return lhs->getName() < rhs->getName();
1005 });
1006
1007 GlobalVariable *LookupTable = buildLookupTable(
1008 M, TableLookupVariablesOrdered, OrderedKernels, KernelToReplacement);
1009 replaceUsesInInstructionsWithTableLookup(M, TableLookupVariablesOrdered,
1010 LookupTable);
1011 }
1012
1013 for (auto &GV : make_early_inc_range(M.globals()))
1014 if (AMDGPU::isLDSVariableToLower(GV)) {
1015 // probably want to remove from used lists
1017 if (GV.use_empty())
1018 GV.eraseFromParent();
1019 }
1020
1021 return Changed;
1022 }
1023
1024private:
1025 // Increase the alignment of LDS globals if necessary to maximise the chance
1026 // that we can use aligned LDS instructions to access them.
1027 static bool superAlignLDSGlobals(Module &M) {
1028 const DataLayout &DL = M.getDataLayout();
1029 bool Changed = false;
1030 if (!SuperAlignLDSGlobals) {
1031 return Changed;
1032 }
1033
1034 for (auto &GV : M.globals()) {
1036 // Only changing alignment of LDS variables
1037 continue;
1038 }
1039 if (!GV.hasInitializer()) {
1040 // cuda/hip extern __shared__ variable, leave alignment alone
1041 continue;
1042 }
1043
1044 Align Alignment = AMDGPU::getAlign(DL, &GV);
1045 TypeSize GVSize = DL.getTypeAllocSize(GV.getValueType());
1046
1047 if (GVSize > 8) {
1048 // We might want to use a b96 or b128 load/store
1049 Alignment = std::max(Alignment, Align(16));
1050 } else if (GVSize > 4) {
1051 // We might want to use a b64 load/store
1052 Alignment = std::max(Alignment, Align(8));
1053 } else if (GVSize > 2) {
1054 // We might want to use a b32 load/store
1055 Alignment = std::max(Alignment, Align(4));
1056 } else if (GVSize > 1) {
1057 // We might want to use a b16 load/store
1058 Alignment = std::max(Alignment, Align(2));
1059 }
1060
1061 if (Alignment != AMDGPU::getAlign(DL, &GV)) {
1062 Changed = true;
1063 GV.setAlignment(Alignment);
1064 }
1065 }
1066 return Changed;
1067 }
1068
1069 static LDSVariableReplacement createLDSVariableReplacement(
1070 Module &M, std::string VarName,
1071 DenseSet<GlobalVariable *> const &LDSVarsToTransform) {
1072 // Create a struct instance containing LDSVarsToTransform and map from those
1073 // variables to ConstantExprGEP
1074 // Variables may be introduced to meet alignment requirements. No aliasing
1075 // metadata is useful for these as they have no uses. Erased before return.
1076
1077 LLVMContext &Ctx = M.getContext();
1078 const DataLayout &DL = M.getDataLayout();
1079 assert(!LDSVarsToTransform.empty());
1080
1082 LayoutFields.reserve(LDSVarsToTransform.size());
1083 {
1084 // The order of fields in this struct depends on the order of
1085 // varables in the argument which varies when changing how they
1086 // are identified, leading to spurious test breakage.
1087 std::vector<GlobalVariable *> Sorted(LDSVarsToTransform.begin(),
1088 LDSVarsToTransform.end());
1089 llvm::sort(Sorted.begin(), Sorted.end(),
1090 [](const GlobalVariable *lhs, const GlobalVariable *rhs) {
1091 return lhs->getName() < rhs->getName();
1092 });
1093 for (GlobalVariable *GV : Sorted) {
1095 DL.getTypeAllocSize(GV->getValueType()),
1096 AMDGPU::getAlign(DL, GV));
1097 LayoutFields.emplace_back(F);
1098 }
1099 }
1100
1101 performOptimizedStructLayout(LayoutFields);
1102
1103 std::vector<GlobalVariable *> LocalVars;
1104 BitVector IsPaddingField;
1105 LocalVars.reserve(LDSVarsToTransform.size()); // will be at least this large
1106 IsPaddingField.reserve(LDSVarsToTransform.size());
1107 {
1108 uint64_t CurrentOffset = 0;
1109 for (size_t I = 0; I < LayoutFields.size(); I++) {
1110 GlobalVariable *FGV = static_cast<GlobalVariable *>(
1111 const_cast<void *>(LayoutFields[I].Id));
1112 Align DataAlign = LayoutFields[I].Alignment;
1113
1114 uint64_t DataAlignV = DataAlign.value();
1115 if (uint64_t Rem = CurrentOffset % DataAlignV) {
1116 uint64_t Padding = DataAlignV - Rem;
1117
1118 // Append an array of padding bytes to meet alignment requested
1119 // Note (o + (a - (o % a)) ) % a == 0
1120 // (offset + Padding ) % align == 0
1121
1122 Type *ATy = ArrayType::get(Type::getInt8Ty(Ctx), Padding);
1123 LocalVars.push_back(new GlobalVariable(
1124 M, ATy, false, GlobalValue::InternalLinkage, UndefValue::get(ATy),
1126 false));
1127 IsPaddingField.push_back(true);
1128 CurrentOffset += Padding;
1129 }
1130
1131 LocalVars.push_back(FGV);
1132 IsPaddingField.push_back(false);
1133 CurrentOffset += LayoutFields[I].Size;
1134 }
1135 }
1136
1137 std::vector<Type *> LocalVarTypes;
1138 LocalVarTypes.reserve(LocalVars.size());
1139 std::transform(
1140 LocalVars.cbegin(), LocalVars.cend(), std::back_inserter(LocalVarTypes),
1141 [](const GlobalVariable *V) -> Type * { return V->getValueType(); });
1142
1143 StructType *LDSTy = StructType::create(Ctx, LocalVarTypes, VarName + ".t");
1144
1145 Align StructAlign = AMDGPU::getAlign(DL, LocalVars[0]);
1146
1147 GlobalVariable *SGV = new GlobalVariable(
1148 M, LDSTy, false, GlobalValue::InternalLinkage, UndefValue::get(LDSTy),
1150 false);
1151 SGV->setAlignment(StructAlign);
1152
1154 Type *I32 = Type::getInt32Ty(Ctx);
1155 for (size_t I = 0; I < LocalVars.size(); I++) {
1156 GlobalVariable *GV = LocalVars[I];
1157 Constant *GEPIdx[] = {ConstantInt::get(I32, 0), ConstantInt::get(I32, I)};
1158 Constant *GEP = ConstantExpr::getGetElementPtr(LDSTy, SGV, GEPIdx, true);
1159 if (IsPaddingField[I]) {
1160 assert(GV->use_empty());
1161 GV->eraseFromParent();
1162 } else {
1163 Map[GV] = GEP;
1164 }
1165 }
1166 assert(Map.size() == LDSVarsToTransform.size());
1167 return {SGV, std::move(Map)};
1168 }
1169
1170 template <typename PredicateTy>
1171 static void replaceLDSVariablesWithStruct(
1172 Module &M, DenseSet<GlobalVariable *> const &LDSVarsToTransformArg,
1173 LDSVariableReplacement Replacement, PredicateTy Predicate) {
1174 LLVMContext &Ctx = M.getContext();
1175 const DataLayout &DL = M.getDataLayout();
1176
1177 // A hack... we need to insert the aliasing info in a predictable order for
1178 // lit tests. Would like to have them in a stable order already, ideally the
1179 // same order they get allocated, which might mean an ordered set container
1180 std::vector<GlobalVariable *> LDSVarsToTransform(
1181 LDSVarsToTransformArg.begin(), LDSVarsToTransformArg.end());
1182 llvm::sort(LDSVarsToTransform.begin(), LDSVarsToTransform.end(),
1183 [](const GlobalVariable *lhs, const GlobalVariable *rhs) {
1184 return lhs->getName() < rhs->getName();
1185 });
1186
1187 // Create alias.scope and their lists. Each field in the new structure
1188 // does not alias with all other fields.
1189 SmallVector<MDNode *> AliasScopes;
1190 SmallVector<Metadata *> NoAliasList;
1191 const size_t NumberVars = LDSVarsToTransform.size();
1192 if (NumberVars > 1) {
1193 MDBuilder MDB(Ctx);
1194 AliasScopes.reserve(NumberVars);
1196 for (size_t I = 0; I < NumberVars; I++) {
1198 AliasScopes.push_back(Scope);
1199 }
1200 NoAliasList.append(&AliasScopes[1], AliasScopes.end());
1201 }
1202
1203 // Replace uses of ith variable with a constantexpr to the corresponding
1204 // field of the instance that will be allocated by AMDGPUMachineFunction
1205 for (size_t I = 0; I < NumberVars; I++) {
1206 GlobalVariable *GV = LDSVarsToTransform[I];
1207 Constant *GEP = Replacement.LDSVarsToConstantGEP[GV];
1208
1209 GV->replaceUsesWithIf(GEP, Predicate);
1210
1211 APInt APOff(DL.getIndexTypeSizeInBits(GEP->getType()), 0);
1212 GEP->stripAndAccumulateInBoundsConstantOffsets(DL, APOff);
1213 uint64_t Offset = APOff.getZExtValue();
1214
1215 Align A =
1216 commonAlignment(Replacement.SGV->getAlign().valueOrOne(), Offset);
1217
1218 if (I)
1219 NoAliasList[I - 1] = AliasScopes[I - 1];
1220 MDNode *NoAlias =
1221 NoAliasList.empty() ? nullptr : MDNode::get(Ctx, NoAliasList);
1222 MDNode *AliasScope =
1223 AliasScopes.empty() ? nullptr : MDNode::get(Ctx, {AliasScopes[I]});
1224
1225 refineUsesAlignmentAndAA(GEP, A, DL, AliasScope, NoAlias);
1226 }
1227 }
1228
1229 static void refineUsesAlignmentAndAA(Value *Ptr, Align A,
1230 const DataLayout &DL, MDNode *AliasScope,
1231 MDNode *NoAlias, unsigned MaxDepth = 5) {
1232 if (!MaxDepth || (A == 1 && !AliasScope))
1233 return;
1234
1235 for (User *U : Ptr->users()) {
1236 if (auto *I = dyn_cast<Instruction>(U)) {
1237 if (AliasScope && I->mayReadOrWriteMemory()) {
1238 MDNode *AS = I->getMetadata(LLVMContext::MD_alias_scope);
1239 AS = (AS ? MDNode::getMostGenericAliasScope(AS, AliasScope)
1240 : AliasScope);
1241 I->setMetadata(LLVMContext::MD_alias_scope, AS);
1242
1243 MDNode *NA = I->getMetadata(LLVMContext::MD_noalias);
1244 NA = (NA ? MDNode::intersect(NA, NoAlias) : NoAlias);
1245 I->setMetadata(LLVMContext::MD_noalias, NA);
1246 }
1247 }
1248
1249 if (auto *LI = dyn_cast<LoadInst>(U)) {
1250 LI->setAlignment(std::max(A, LI->getAlign()));
1251 continue;
1252 }
1253 if (auto *SI = dyn_cast<StoreInst>(U)) {
1254 if (SI->getPointerOperand() == Ptr)
1255 SI->setAlignment(std::max(A, SI->getAlign()));
1256 continue;
1257 }
1258 if (auto *AI = dyn_cast<AtomicRMWInst>(U)) {
1259 // None of atomicrmw operations can work on pointers, but let's
1260 // check it anyway in case it will or we will process ConstantExpr.
1261 if (AI->getPointerOperand() == Ptr)
1262 AI->setAlignment(std::max(A, AI->getAlign()));
1263 continue;
1264 }
1265 if (auto *AI = dyn_cast<AtomicCmpXchgInst>(U)) {
1266 if (AI->getPointerOperand() == Ptr)
1267 AI->setAlignment(std::max(A, AI->getAlign()));
1268 continue;
1269 }
1270 if (auto *GEP = dyn_cast<GetElementPtrInst>(U)) {
1271 unsigned BitWidth = DL.getIndexTypeSizeInBits(GEP->getType());
1272 APInt Off(BitWidth, 0);
1273 if (GEP->getPointerOperand() == Ptr) {
1274 Align GA;
1275 if (GEP->accumulateConstantOffset(DL, Off))
1276 GA = commonAlignment(A, Off.getLimitedValue());
1277 refineUsesAlignmentAndAA(GEP, GA, DL, AliasScope, NoAlias,
1278 MaxDepth - 1);
1279 }
1280 continue;
1281 }
1282 if (auto *I = dyn_cast<Instruction>(U)) {
1283 if (I->getOpcode() == Instruction::BitCast ||
1284 I->getOpcode() == Instruction::AddrSpaceCast)
1285 refineUsesAlignmentAndAA(I, A, DL, AliasScope, NoAlias, MaxDepth - 1);
1286 }
1287 }
1288 }
1289};
1290
1291} // namespace
1292char AMDGPULowerModuleLDS::ID = 0;
1293
1294char &llvm::AMDGPULowerModuleLDSID = AMDGPULowerModuleLDS::ID;
1295
1296INITIALIZE_PASS(AMDGPULowerModuleLDS, DEBUG_TYPE,
1297 "Lower uses of LDS variables from non-kernel functions", false,
1298 false)
1299
1301 return new AMDGPULowerModuleLDS();
1302}
1303
1306 return AMDGPULowerModuleLDS().runOnModule(M) ? PreservedAnalyses::none()
1308}
MachineBasicBlock MachineBasicBlock::iterator DebugLoc DL
#define DEBUG_TYPE
amdgpu propagate attributes Late propagate attributes from kernels to functions
assume Assume Builder
This file implements the BitVector class.
static GCRegistry::Add< ErlangGC > A("erlang", "erlang-compatible garbage collector")
This file provides interfaces used to build and manipulate a call graph, which is a very useful tool ...
#define clEnumValN(ENUMVAL, FLAGNAME, DESC)
Definition: CommandLine.h:678
This file contains the declarations for the subclasses of Constant, which represent the different fla...
This file defines the DenseMap class.
This file defines the DenseSet and SmallDenseSet classes.
uint64_t Size
std::optional< std::vector< StOtherPiece > > Other
Definition: ELFYAML.cpp:1260
Hexagon Common GEP
static const unsigned MaxDepth
#define F(x, y, z)
Definition: MD5.cpp:55
#define I(x, y, z)
Definition: MD5.cpp:58
This file provides an interface for laying out a sequence of fields as a struct in a way that attempt...
#define INITIALIZE_PASS(passName, arg, name, cfg, analysis)
Definition: PassSupport.h:38
@ SI
assert(ImpDefSCC.getReg()==AMDGPU::SCC &&ImpDefSCC.isDef())
This file contains some templates that are useful if you are working with the STL at all.
This file defines generic set operations that may be used on set's of different types,...
This file implements a set that has insertion order iteration characteristics.
Class for arbitrary precision integers.
Definition: APInt.h:75
uint64_t getZExtValue() const
Get zero extended value.
Definition: APInt.h:1494
A container for analyses that lazily runs them and caches their results.
Definition: PassManager.h:620
ArrayRef - Represent a constant reference to an array (0 or more elements consecutively in memory),...
Definition: ArrayRef.h:41
size_t size() const
size - Get the array size.
Definition: ArrayRef.h:163
bool empty() const
empty - Check if the array is empty.
Definition: ArrayRef.h:158
static ArrayType * get(Type *ElementType, uint64_t NumElements)
This static method is the primary way to construct an ArrayType.
Definition: Type.cpp:658
LLVM Basic Block Representation.
Definition: BasicBlock.h:56
const_iterator getFirstInsertionPt() const
Returns an iterator to the first instruction in this block that is suitable for inserting a non-PHI i...
Definition: BasicBlock.cpp:254
InstListType::iterator iterator
Instruction iterators...
Definition: BasicBlock.h:87
void reserve(unsigned N)
Definition: BitVector.h:348
void push_back(bool Val)
Definition: BitVector.h:466
std::pair< std::optional< WeakTrackingVH >, CallGraphNode * > CallRecord
A pair of the calling instruction (a call or invoke) and the call graph node being called.
Definition: CallGraph.h:178
The basic data container for the call graph of a Module of IR.
Definition: CallGraph.h:72
static Constant * get(ArrayType *T, ArrayRef< Constant * > V)
Definition: Constants.cpp:1242
static ConstantAsMetadata * get(Constant *C)
Definition: Metadata.h:419
static Constant * getPointerBitCastOrAddrSpaceCast(Constant *C, Type *Ty)
Create a BitCast or AddrSpaceCast for a pointer type depending on the address space.
Definition: Constants.cpp:2047
static Constant * getPtrToInt(Constant *C, Type *Ty, bool OnlyIfReduced=false)
Definition: Constants.cpp:2192
static Constant * getGetElementPtr(Type *Ty, Constant *C, ArrayRef< Constant * > IdxList, bool InBounds=false, std::optional< unsigned > InRangeIndex=std::nullopt, Type *OnlyIfReducedTy=nullptr)
Getelementptr form.
Definition: Constants.h:1232
static Constant * get(Type *Ty, uint64_t V, bool IsSigned=false)
If Ty is a vector type, return a Constant with a splat of the given value.
Definition: Constants.cpp:888
This is an important base class in LLVM.
Definition: Constant.h:41
void removeDeadConstantUsers() const
If there are any dead constant users dangling off of this constant, remove them.
Definition: Constants.cpp:708
A parsed version of the target data layout string in and methods for querying it.
Definition: DataLayout.h:110
size_type count(const_arg_type_t< KeyT > Val) const
Return 1 if the specified key is in the map, 0 otherwise.
Definition: DenseMap.h:151
Implements a dense probed hash-table based set.
Definition: DenseSet.h:271
static FunctionType * get(Type *Result, ArrayRef< Type * > Params, bool isVarArg)
This static method is the primary way of constructing a FunctionType.
void setMetadata(unsigned KindID, MDNode *Node)
Set a particular kind of metadata attachment.
Definition: Metadata.cpp:1325
void setAlignment(Align Align)
Sets the alignment attribute of the GlobalObject.
Definition: Globals.cpp:130
PointerType * getType() const
Global values are always pointers.
Definition: GlobalValue.h:290
@ InternalLinkage
Rename collisions when linking (static functions).
Definition: GlobalValue.h:55
Type * getValueType() const
Definition: GlobalValue.h:292
bool hasInitializer() const
Definitions have initializers, declarations don't.
void eraseFromParent()
eraseFromParent - This method unlinks 'this' from the containing module and deletes it.
Definition: Globals.cpp:468
This provides a uniform API for creating instructions and inserting them into a basic block: either a...
Definition: IRBuilder.h:2564
This is an important class for using LLVM in a threaded context.
Definition: LLVMContext.h:67
MDNode * createAnonymousAliasScope(MDNode *Domain, StringRef Name=StringRef())
Return metadata appropriate for an alias scope root node.
Definition: MDBuilder.h:159
MDNode * createAnonymousAliasScopeDomain(StringRef Name=StringRef())
Return metadata appropriate for an alias scope domain node.
Definition: MDBuilder.h:152
Metadata node.
Definition: Metadata.h:943
static MDNode * getMostGenericAliasScope(MDNode *A, MDNode *B)
Definition: Metadata.cpp:1032
static MDTuple * get(LLVMContext &Context, ArrayRef< Metadata * > MDs)
Definition: Metadata.h:1399
static MDNode * intersect(MDNode *A, MDNode *B)
Definition: Metadata.cpp:1019
Root of the metadata hierarchy.
Definition: Metadata.h:61
ModulePass class - This class is used to implement unstructured interprocedural optimizations and ana...
Definition: Pass.h:248
virtual bool runOnModule(Module &M)=0
runOnModule - Virtual method overriden by subclasses to process the module being operated on.
A Module instance is used to store all the information related to an LLVM module.
Definition: Module.h:65
A container for an operand bundle being viewed as a set of values rather than a set of uses.
Definition: InstrTypes.h:1139
static PassRegistry * getPassRegistry()
getPassRegistry - Access the global registry object, which is automatically initialized at applicatio...
static PoisonValue * get(Type *T)
Static factory methods - Return an 'poison' object of the specified type.
Definition: Constants.cpp:1750
A set of analyses that are preserved following a run of a transformation pass.
Definition: PassManager.h:152
static PreservedAnalyses none()
Convenience factory function for the empty preserved set.
Definition: PassManager.h:155
static PreservedAnalyses all()
Construct a special preserved set that preserves all passes.
Definition: PassManager.h:158
size_type count(ConstPtrType Ptr) const
count - Return 1 if the specified pointer is in the set, 0 otherwise.
Definition: SmallPtrSet.h:383
std::pair< iterator, bool > insert(PtrType Ptr)
Inserts Ptr if and only if there is no element in the container equal to Ptr.
Definition: SmallPtrSet.h:365
SmallPtrSet - This class implements a set which is optimized for holding SmallSize or less elements.
Definition: SmallPtrSet.h:450
bool empty() const
Definition: SmallVector.h:94
size_t size() const
Definition: SmallVector.h:91
reference emplace_back(ArgTypes &&... Args)
Definition: SmallVector.h:941
void reserve(size_type N)
Definition: SmallVector.h:667
void append(ItTy in_start, ItTy in_end)
Add the specified range to the end of the SmallVector.
Definition: SmallVector.h:687
void push_back(const T &Elt)
Definition: SmallVector.h:416
This is a 'vector' (really, a variable-sized array), optimized for the case when the array is small.
Definition: SmallVector.h:1200
Class to represent struct types.
Definition: DerivedTypes.h:213
static StructType * create(LLVMContext &Context, StringRef Name)
This creates an identified struct.
Definition: Type.cpp:533
Twine - A lightweight data structure for efficiently representing the concatenation of temporary valu...
Definition: Twine.h:81
The instances of the Type class are immutable: once they are created, they are never changed.
Definition: Type.h:45
unsigned getPointerAddressSpace() const
Get the address space of this pointer or pointer vector type.
static Type * getVoidTy(LLVMContext &C)
static IntegerType * getInt8Ty(LLVMContext &C)
static PointerType * getInt8PtrTy(LLVMContext &C, unsigned AS=0)
static IntegerType * getInt32Ty(LLVMContext &C)
static UndefValue * get(Type *T)
Static factory methods - Return an 'undef' object of the specified type.
Definition: Constants.cpp:1731
A Use represents the edge between a Value definition and its users.
Definition: Use.h:43
LLVM Value Representation.
Definition: Value.h:74
Type * getType() const
All values are typed, get the type of this value.
Definition: Value.h:255
void replaceUsesWithIf(Value *New, llvm::function_ref< bool(Use &U)> ShouldReplace)
Go through the uses list for this definition and make each use point to "V" if the callback ShouldRep...
Definition: Value.cpp:540
bool use_empty() const
Definition: Value.h:344
StringRef getName() const
Return a constant reference to the value's name.
Definition: Value.cpp:308
std::pair< iterator, bool > insert(const ValueT &V)
Definition: DenseSet.h:206
size_type size() const
Definition: DenseSet.h:81
bool contains(const_arg_type_t< ValueT > V) const
Check if the set contains the given element.
Definition: DenseSet.h:185
bool erase(const ValueT &V)
Definition: DenseSet.h:101
@ LOCAL_ADDRESS
Address space for local memory.
Definition: AMDGPU.h:381
@ CONSTANT_ADDRESS
Address space for constant memory (VTX2).
Definition: AMDGPU.h:380
Align getAlign(DataLayout const &DL, const GlobalVariable *GV)
unsigned ID
LLVM IR allows to use arbitrary numbers as calling convention identifiers.
Definition: CallingConv.h:24
@ C
The default llvm calling convention, compatible with C.
Definition: CallingConv.h:34
Function * getDeclaration(Module *M, ID id, ArrayRef< Type * > Tys=std::nullopt)
Create or insert an LLVM Function declaration for an intrinsic, and return it.
Definition: Function.cpp:1506
ValuesClass values(OptsTy... Options)
Helper to build a ValuesClass by forwarding a variable number of arguments as an initializer list to ...
Definition: CommandLine.h:703
initializer< Ty > init(const Ty &Val)
Definition: CommandLine.h:445
This is an optimization pass for GlobalISel generic memory operations.
Definition: AddressRanges.h:18
@ Offset
Definition: DWP.cpp:406
bool operator<(int64_t V1, const APSInt &V2)
Definition: APSInt.h:361
bool convertUsersOfConstantsToInstructions(ArrayRef< Constant * > Consts)
Replace constant expressions users of the given constants with instructions.
char & AMDGPULowerModuleLDSID
bool set_is_subset(const S1Ty &S1, const S2Ty &S2)
set_is_subset(A, B) - Return true iff A in B
iterator_range< early_inc_iterator_impl< detail::IterOfRange< RangeT > > > make_early_inc_range(RangeT &&Range)
Make a range that does early increment to allow mutation of the underlying range without disrupting i...
Definition: STLExtras.h:748
ModulePass * createAMDGPULowerModuleLDSPass()
void sort(IteratorTy Start, IteratorTy End)
Definition: STLExtras.h:1744
void initializeAMDGPULowerModuleLDSPass(PassRegistry &)
void report_fatal_error(Error Err, bool gen_crash_diag=true)
Report a serious error, calling any installed error handler.
Definition: Error.cpp:145
bool set_union(S1Ty &S1, const S2Ty &S2)
set_union(A, B) - Compute A := A u B, return whether A changed.
Definition: SetOperations.h:23
void removeFromUsedLists(Module &M, function_ref< bool(Constant *)> ShouldRemove)
Removes global values from the llvm.used and llvm.compiler.used arrays.
void appendToCompilerUsed(Module &M, ArrayRef< GlobalValue * > Values)
Adds global values to the llvm.compiler.used list.
std::pair< uint64_t, Align > performOptimizedStructLayout(MutableArrayRef< OptimizedStructLayoutField > Fields)
Compute a layout for a struct containing the given fields, making a best-effort attempt to minimize t...
uint64_t alignTo(uint64_t Size, Align A)
Returns a multiple of A needed to store Size bytes.
Definition: Alignment.h:155
constexpr unsigned BitWidth
Definition: BitmaskEnum.h:184
Align commonAlignment(Align A, uint64_t Offset)
Returns the alignment that satisfies both alignments.
Definition: Alignment.h:212
PreservedAnalyses run(Module &M, ModuleAnalysisManager &AM)
This struct is a compact representation of a valid (non-zero power of two) alignment.
Definition: Alignment.h:39
uint64_t value() const
This is a hole in the type system and should not be abused.
Definition: Alignment.h:85