Go to the documentation of this file.
16 #ifndef DEMANGLE_STRINGVIEW_H
17 #define DEMANGLE_STRINGVIEW_H
35 :
First(First_), Last(Last_) {}
37 :
First(First_), Last(First_ + Len) {}
43 if (Len >
size() - Pos)
102 if (Str.size() >
size())
104 return std::strncmp(Str.begin(),
begin(), Str.size()) == 0;
110 const char *
end()
const {
return Last; }
111 size_t size()
const {
return static_cast<size_t>(Last -
First); }
116 return LHS.size() ==
RHS.size() &&
117 std::strncmp(
LHS.begin(),
RHS.begin(),
LHS.size()) == 0;
Itanium Name Demangler i e convert the string _Z1fv into and both[sub] projects need to demangle but neither can depend on each other *libcxxabi needs the demangler to implement which is part of the itanium ABI spec *LLVM needs a copy for a bunch of and cannot rely on the system s __cxa_demangle because it a might not be and b may not be up to date on the latest language features The copy of the demangler in LLVM has some extra stuff that aren t needed in which depend on the shared generic components Despite these we want to keep the core generic demangling library identical between both copies to simplify development and testing If you re working on the generic then do the work first in libcxxabi
gets compiled into this on rsp movaps rsp movaps rsp movaps rsp movaps rsp movaps rsp movaps xmm2
Improvements to the multiply shift add algorithm
cond_true lis lo16() lo16() lo16() LCPI1_1(r2) fsub f0
esp eax movl ecx ecx cvtsi2ss xmm0 eax cvtsi2ss xmm1 xmm0 addss xmm1
to esp esp setne al movzbw ax esp setg cl movzbw cx cmove cx cl jne LBB1_2 esp which is much esp edx eax decl edx jle L7 L5
We currently generate a but we really shouldn eax ecx xorl edx divl ecx eax divl ecx movl eax ret A similar code sequence works for division We currently compile i32 v2 eax eax jo LBB1_2 atomic and others It is also currently not done for read modify write instructions It is also current not done if the OF or CF flags are needed The shift operators have the complication that when the shift count is EFLAGS is not set
zeros_impl< sizeof(T)> zeros(const T &)
Common register allocation spilling lr str ldr sxth r3 ldr mla r4 can lr mov lr str ldr sxth r3 mla r4 and then merge mul and lr str ldr sxth r3 mla r4 It also increase the likelihood the store may become dead bb27 noreg
should just be implemented with a CLZ instruction Since there are other e that share this behavior
Generic address nodes are lowered to some combination of target independent and machine specific ABI
should just be implemented with a CLZ instruction Since there are other e that share this it would be best to implement this in a target independent as zero is the default value for the binary encoder e add r0 add r5 Register operands should be distinct That is
compiles conv shl5 shl ret i32 or10 it would be better as
Instead of the following for memset char edx edx edx It might be better to generate eax movl edx movl edx movw edx when we can spare a register It reduces code size Evaluate what the best way to codegen sdiv C is For we currently get ret i32 Y eax movl ecx ecx ecx addl ecx
entry stw r5 blr GCC produces
RUN< i32 >< i1 > br i1 label then
should just be implemented with a CLZ instruction Since there are other targets
the custom lowered code happens to be right
We generate the following IR with i32 b nounwind readnone
partially inline libcalls
The same transformation can work with an even modulo with the addition of a and shrink the compare RHS by the same amount Unless the target supports that transformation probably isn t worthwhile The transformation can also easily be made to work with non zero equality comparisons
we should consider alternate ways to model stack dependencies Lots of things could be done in WebAssemblyTargetTransformInfo cpp there are numerous optimization related hooks that can be overridden in WebAssemblyTargetLowering Instead of the OptimizeReturned which should consider preserving the returned attribute through to MachineInstrs and extending the MemIntrinsicResults pass to do this optimization on calls too That would also let the WebAssemblyPeephole pass clean up dead defs for such as it does for stores Consider implementing and or getMachineCombinerPatterns Find a clean way to fix the problem which leads to the Shrink Wrapping pass being run after the WebAssembly PEI pass When setting multiple variables to the same constant
Common register allocation spilling lr str ldr sxth r3 ldr mla r4 can lr mov lr str ldr sxth r3 mla r4 and then merge mul and lr str ldr sxth r3 mla r4 It also increase the likelihood the store may become dead bb27 Successors according to LLVM ID Predecessors according to mbb< bb27, 0x8b0a7c0 > Note ADDri is not a two address instruction its result reg1037 is an operand of the PHI node in bb76 and its operand reg1039 is the result of the PHI node We should treat it as a two address code and make sure the ADDri is scheduled after any node that reads reg1039 Use info(i.e. register scavenger) to assign it a free register to allow reuse the collector could move the objects and invalidate the derived pointer This is bad enough in the first but safe points can crop up unpredictably **array_addr i32 n y store obj obj **nth_el If the i64 division is lowered to a then a safe point array and nth_el no longer point into the correct object The fix for this is to copy address calculations so that dependent pointers are never live across safe point boundaries But the loads cannot be copied like this if there was an intervening so may be hard to get right Only a concurrent mutator can trigger a collection at the libcall safe point So single threaded programs do not have this requirement
This is an optimization pass for GlobalISel generic memory operations.
gets compiled into this on rsp movaps rsp movaps rsp movaps rsp movaps rsp movaps rsp movaps rsp movaps rsp movaps rsp movq rsp movq rsp movq rcx
QP Compare Ordered outs ins xscmpudp No intrinsic
We currently emits eax Perhaps this is what we really should generate is Is imull three or four cycles eax eax The current instruction priority is based on pattern complexity The former is more complex because it folds a load so the latter will not be emitted Perhaps we should use AddedComplexity to give LEA32r a higher priority We should always try to match LEA first since the LEA matching code does some estimate to determine whether the match is profitable if we care more about code then imull is better It s two bytes shorter than movl leal On a Pentium M
Add support for conditional and other related patterns Instead eax eax je LBB16_2 eax edi eax movl eax
This might compile to this xmm1 xorps xmm0 movss xmm0 ret Now consider if the code caused xmm1 to get spilled This might produce this xmm1 movaps c2(%esp) ... xorps %xmm0
gets compiled into this on rsp movaps rsp movaps rsp movaps rsp movaps rsp movaps rsp movaps rsp movaps rsp movaps rsp movq rsp movq rsp movq rsp movq rsp movq rsp rax movq rsp rax movq rsp rsp rsp eax eax jbe LBB1_3 rcx rax movq rsp eax rsp ret ecx eax rcx movl rsp jmp LBB1_2 gcc rsp rax movq rsp rsp movq rsp rax movq rsp eax eax jb L6 rdx eax rsp ret p2align edx rdx eax movl rsp eax rsp ret and it gets compiled into this on ebp esp eax movl ebp eax movl ebp eax esp popl ebp ret gcc ebp eax popl ebp ret Teach tblgen not to check bitconvert source type in some cases This allows us to consolidate the following patterns in X86InstrMMX v2i32(MMX_MOVDQ2Qrr VR128:$src))>
Improvements to the multiply shift add e g in LLVM
StringView(const char *First_, size_t Len)
Should compile r2 movcc movcs str strb mov lr r1 movcs movcc mov lr r1 str mov mov cmp r1 movlo r2 str bx lr r0 mov mov cmp r0 movhs r2 mov r1 bx lr Some of the NEON intrinsics may be appropriate for more general use
Add support for conditional and other related patterns Instead eax cmpl
We currently generate a but we really shouldn eax ecx xorl edx divl ecx eax divl ecx movl eax ret A similar code sequence works for division We currently compile i32 v2 eax eax jo LBB1_2 atomic and others It is also currently not done for read modify write instructions It is also current not done if the OF or CF flags are needed The shift operators have the complication that when the shift count is zero
const char & operator[](size_t Idx) const
We have fiadd patterns now but the followings have the same cost and complexity We need a way to specify the later is more profitable def def The FP stackifier should handle simple permutates to reduce number of shuffle e g ceil
add sub stmia L5 ldr r0 bl L_printf $stub Instead of a ldmia
The object format emitted by the WebAssembly backed is documented see the home and packaging for producing WebAssembly applications that can run in browsers and other environments wasi sdk provides a more minimal C C SDK based on llvm and a libc based on for producing WebAssemmbly applictions that use the WASI ABI Rust provides WebAssembly support integrated into Cargo There are two main which provides a relatively minimal environment that has an emphasis on being native wasm32 unknown which uses Emscripten internally and provides standard C C filesystem GL and SDL bindings For more see
we should consider alternate ways to model stack dependencies Lots of things could be done in WebAssemblyTargetTransformInfo cpp Similarly
We currently emits eax Perhaps this is what we really should generate is Is imull three or four cycles Note
gets compiled into this on rsp movaps xmm7
Replace within non kernel function use of LDS with pointer
This currently compiles esp xmm0 movsd esp eax eax esp ret We should use not the dag combiner This is because dagcombine2 needs to be able to see through the X86ISD::Wrapper which DAGCombine can t really do The code for turning x load into a single vector load is target independent and should be moved to the dag combiner The code for turning x load into a vector load can only handle a direct load from a global or a direct load from the stack It should be generalized to handle any load from P
< i32 > br label bb114 eax ecx movl ebp subl ebp eax movl ebp subl ebp eax movl ebp subl ebp eax subl ecx movl ebp eax movl ebp eax movl ebp ebp This appears to be bad because the RA is not folding the store to the stack slot into the movl The above instructions could ebp ebp This seems like a cross between remat and spill folding This has redundant subtractions of eax from a stack slot ecx doesn t change
Should compile edi setae al movzbl eax ret on instead of the rather stupid edi setb al xorb
We currently generate a but we really shouldn eax ecx xorl edx divl ecx eax divl ecx movl eax ret A similar code sequence works for division We currently compile i32 v2 eax eax jo LBB1_2 atomic ops
_bar mov r0 mov r1 fldd LCPI1_0 fmrrd d0 bl _foo fmdrr r5 fmsr r0 fsitod s2 faddd d0 fmrrd d0 ldmfd r0 mov r1 the copys to callee save registers and the fact they are only being used by the fmdrr instruction It would have been better had the fmdrr been scheduled before the call and place the result in a callee save DPR register The two mov ops would not have been necessary Calling convention related stuff
Should compile r2 movcc movcs str strb mov lr r1 movcs movcc mov lr r1 str mov mov cmp r1 movlo r2 str bx lr r0 mov mov cmp r0 movhs r2 mov r1 bx lr Some of the NEON intrinsics may be appropriate for more general either as target independent intrinsics or perhaps elsewhere in the ARM backend Some of them may also be lowered to target independent and perhaps some new SDNodes could be added For minimum
Common register allocation spilling lr str ldr sxth r3 ldr mla r4 can lr mov lr str ldr sxth r3 mla r4 and then merge mul and lr str ldr sxth r3 mla r4 It also increase the likelihood the store may become dead bb27 Successors according to LLVM ID Predecessors according to mbb< bb27, 0x8b0a7c0 > Note ADDri is not a two address instruction its result reg1037 is an operand of the PHI node in bb76 and its operand reg1039 is the result of the PHI node We should treat it as a two address code and make sure the ADDri is scheduled after any node that reads reg1039 Use info(i.e. register scavenger) to assign it a free register to allow reuse the collector could move the objects and invalidate the derived pointer This is bad enough in the first but safe points can crop up unpredictably **array_addr i32 n y store obj obj **nth_el If the i64 division is lowered to a then a safe point array and nth_el no longer point into the correct object The fix for this is to copy address calculations so that dependent pointers are never live across safe point boundaries But the loads cannot be copied like this if there was an intervening store
return AArch64::GPR64RegClass contains(Reg)
The object format emitted by the WebAssembly backed is documented see the home and packaging for producing WebAssembly applications that can run in browsers and other environments wasi sdk provides a more minimal C C SDK based on llvm and a libc based on for producing WebAssemmbly applictions that use the WASI ABI Rust provides WebAssembly support integrated into Cargo There are two main which provides a relatively minimal environment that has an emphasis on being native wasm32 unknown which uses Emscripten internally and provides standard C C filesystem GL and SDL bindings For more br_if
into xmm2 addss xmm2 xmm1 xmm3 addss xmm3 movaps xmm0 unpcklps xmm0 ret seems silly when it could just be one addps Expand libm rounding functions main should enable SSE DAZ mode and other fast SSE modes Think about doing i64 math in SSE regs on x86 This testcase should have no SSE instructions in and only one load from a constant double
QP Compare Ordered outs ins xscmpudp No builtin are required Or llvm fcmp order unorder compare DP QP Compare builtin are required DP xscmp *dp write to VSX register Use int_ppc_vsx_xscmpeqdp int_ppc_vsx_xscmpgedp int_ppc_vsx_xscmpgtdp int_ppc_vsx_xscmpnedp v2i64
This might compile to this xmm1 xorps xmm0 movss xmm0 ret Now consider if the code caused xmm1 to get spilled This might produce this xmm1 movaps xmm0 movaps xmm1 movss xmm0 ret since the reload is only used by these we could fold it into the uses
Since we know that Vector is byte aligned and we know the element offset of we should change the load into a lve *x instead of doing a load store lve *x sequence Implement passing vectors by value into calls and receiving them as arguments GCC apparently tries to codegen
BinaryOp_match< LHS, RHS, Instruction::Add > m_Add(const LHS &L, const RHS &R)
AMD64 Optimization Manual has some nice information about optimizing integer multiplication by a constant How much of it applies to Intel s X86 implementation There are definite trade offs to xmm0 cvttss2siq rdx jb L3 subss xmm0 rax cvttss2siq rdx xorq rdx rax ret instead of xmm1 cvttss2siq rcx movaps xmm2 subss xmm2 cvttss2siq rax rdx xorq rax ucomiss xmm0 cmovb rax ret Seems like the jb branch has high likelihood of being taken It would have saved a few instructions It s not possible to reference and DH registers in an instruction requiring REX prefix divb and mulb both produce results in AH If isel emits a CopyFromReg which gets turned into a movb and that can be allocated a r8b r15b To get around isel emits a CopyFromReg from AX and then right shift it down by and truncate it It s not pretty but it works We need some register allocation magic to make the hack go away(e.g. putting additional constraints on the result of the movb). The x86-64 ABI for hidden-argument struct returns requires that the incoming value of %rdi be copied into %rax by the callee upon return. The idea is that it saves callers from having to remember this value
The object format emitted by the WebAssembly backed is documented see the home tools
entry mr r2 blr This could be reduced to the much simpler
Instead of the following for memset char edx edx edx It might be better to generate eax movl edx movl edx movw edx when we can spare a register It reduces code size Evaluate what the best way to codegen sdiv C is For we currently get ret i32 Y _test1
The object format emitted by the WebAssembly backed is documented see the home and packaging for producing WebAssembly applications that can run in browsers and other environments wasi sdk provides a more minimal C C SDK based on llvm and a libc based on for producing WebAssemmbly applictions that use the WASI ABI Rust provides WebAssembly support integrated into Cargo There are two main which provides a relatively minimal environment that has an emphasis on being native wasm32 unknown which uses Emscripten internally and provides standard C C filesystem GL and SDL bindings For more information
Common register allocation spilling lr str ldr sxth r3 ldr mla r4 can lr mov lr str ldr sxth r3 mla r4 and then merge mul and lr str ldr sxth r3 mla r4 It also increase the likelihood the store may become dead bb27 Successors according to LLVM ID Predecessors according to mbb< bb27, 0x8b0a7c0 > Note ADDri is not a two address instruction its result reg1037 is an operand of the PHI node in bb76 and its operand reg1039 is the result of the PHI node We should treat it as a two address code and make sure the ADDri is scheduled after any node that reads reg1039 Use info(i.e. register scavenger) to assign it a free register to allow reuse the collector could move the objects and invalidate the derived pointer This is bad enough in the first but safe points can crop up unpredictably **array_addr nth_el
_test eax xmm0 eax xmm1 comiss xmm1 setae al movzbl ecx eax edx ecx cmove eax ret Note the movzbl
@ System
Synchronized with respect to all concurrently executing threads.
Implement PPCInstrInfo::isLoadFromStackSlot isStoreToStackSlot for vector to generate better spill code The first should be a single lvx from the constant pool
For the entry BB esp pxor xmm0 movsd(%esp)
Itanium Name Demangler Library
So we should use XX3Form_Rcr to implement intrinsic Convert DP outs ins xscvdpsp No builtin are required Round &Convert QP DP(dword[1] is set to zero) No builtin are required Round to Quad Precision because you need to assign rounding mode in instruction Provide builtin(set f128:$vT,(int_ppc_vsx_xsrqpi f128:$vB))(set f128 yields< n x< ty > >< result > yields< ty >< result > No builtin are required Load Store load store see def memrix16 in PPCInstrInfo td Load Store Vector load store outs ins lxsdx set load store with conversion from to outs ins lxsspx set load store outs ins lxsiwzx set PPClfiwzx outs
This returns a reference to a raw_fd_ostream for standard output.
static Error split(StringRef Str, char Separator, std::pair< StringRef, StringRef > &Split)
Checked version of split, to ensure mandatory subparts.
into llvm powi allowing the code generator to produce balanced multiplication trees the intrinsic needs to be extended to support integers
it should cost the same as a move shift on any modern but it s a lot shorter Downside is that it puts more pressure on register allocation because it has fixed operands Example
http eax xorl edx cl sete al setne dl sall eax sall edx But that requires good bit subreg support this might be better It s an extra but it s one instruction and doesn t stress bit subreg eax eax movl edx xorl
Should compile to something r4 addze r3 instead we r3 cmplw cr7
The legalization code for mul with overflow needs to be made more robust before this can be implemented though Get the C front end to expand hypot(x, y) -> llvm.sqrt(x *x+y *y) when errno and precision don 't matter(ffastmath). Misc/mandel will like this. :) This isn 't safe in general, even on darwin. See the libm implementation of hypot for examples(which special case when x/y are exactly zero to get signed zeros etc right). On targets with expensive 64-bit multiply, we could LSR this:for(i=...
const_iterator end(StringRef path)
Get end iterator over path.
void cpy(struct foo *a, struct foo *b)
PassInstrumentationCallbacks PIC
So we should use XX3Form_Rcr to implement intrinsic Convert DP outs ins xscvdpsp No builtin are required Round &Convert QP DP(dword[1] is set to zero) No builtin are required Round to Quad Precision Integer
http eax xorl edx cl sete al setne dl sall eax sall edx But that requires good bit subreg support this might be better It s an extra but it s one instruction and doesn t stress bit subreg eax shrl
the custom lowered code happens to be but we shouldn t have to custom lower anything This is probably related to< 2 x i64 > ops being so bad LLVM currently generates stack realignment when it is not necessary needed The problem is that we need to know about stack alignment too before RA runs At that point we don t know
StringView(const char(&Str)[N])
Since we know that Vector is byte aligned and we know the element offset of we should change the load into a lve *x instead of doing a load store lve *x sequence Implement passing vectors by value into calls and receiving them as arguments GCC apparently tries to then a load and vperm of Variable We need a way to teach tblgen that some operands of an intrinsic are required to be constants The verifier should enforce this constraint We currently codegen SCALAR_TO_VECTOR as a store of the scalar to a byte aligned stack slot
@ Itanium
Windows CE ARM, PowerPC, SH3, SH4.
Common register allocation spilling lr str ldr sxth r3 ldr mla r4 can lr mov lr str ldr sxth r3 mla r4 and then merge mul and lr str ldr sxth r3 mla r4 It also increase the likelihood the store may become dead bb27 Successors according to LLVM ID Predecessors according to mbb< bb27, 0x8b0a7c0 > Note ADDri is not a two address instruction its result reg1037 is an operand of the PHI node in bb76 and its operand reg1039 is the result of the PHI node We should treat it as a two address code and make sure the ADDri is scheduled after any node that reads reg1039 Use info(i.e. register scavenger) to assign it a free register to allow reuse the collector could move the objects and invalidate the derived pointer This is bad enough in the first but safe points can crop up unpredictably **array_addr i32 n y store obj obj **nth_el If the i64 division is lowered to a then a safe point array and nth_el no longer point into the correct object The fix for this is to copy address calculations so that dependent pointers are never live across safe point boundaries But the loads cannot be copied like this if there was an intervening so may be hard to get right Only a concurrent mutator can trigger a collection at the libcall safe point So single threaded programs do not have this even with a copying collector LLVM optimizations would probably undo a front end s careful work The ocaml frametable structure supports liveness information It would be good to support it The FIXME in ComputeCommonTailLength in BranchFolding cpp needs to be revisited The check is there to work around a misuse of directives in assembly It would be good to detect collector target compatibility instead of silently doing the wrong thing It would be really nice to be able to write patterns in td files for copies
This would be a win on ppc32
Decimal Convert From to National Zoned Signed int_ppc_altivec_bcdcfno i1
to esp esp setne al movzbw ax esp setg cl movzbw cx cmove cx cl jne LBB1_2 esp ret(also really horrible code on ppc). This is due to the expand code for 64-bit compares. GCC produces multiple branches
the resulting code requires compare and branches when and if the revised code is with conditional branches instead of More there is a byte word extend before each comparison
This is blocked on not handling X *X *X which is the same number of multiplies and is because the *X has multiple uses Here s a simple X1 B ret i32 C Reassociate should handle the example in GCC a3
compiles ldr LCPI1_0 ldr ldr mov lsr tst moveq r1 ldr LCPI1_1 and r0 bx lr It would be better to do something like to fold the shift into the conditional ldr LCPI1_0 ldr ldr tst movne lsr ldr LCPI1_1 and r0 bx lr it saves an instruction and a register It might be profitable to cse MOVi16 if there are lots of bit immediates with the same bottom half Robert Muth started working on an alternate jump table implementation that does not put the tables in line in the text This is more like the llvm default jump table implementation This might be useful sometime Several revisions of patches are on the mailing beginning at
std::error_code access(const Twine &Path, AccessMode Mode)
Can the file be accessed?
Expected< ExpressionValue > max(const ExpressionValue &Lhs, const ExpressionValue &Rhs)
raw_fd_ostream & errs()
This returns a reference to a raw_ostream for standard error.
should just be implemented with a CLZ instruction Since there are other e g
urem i32 %X, 255 ret i32 %tmp1 } Currently it compiles to:... movl $2155905153, %ecx movl 8(%esp), %esi movl %esi, %eax mull %ecx ... This could be "reassociated" into:movl $2155905153, %eax movl 8(%esp), %ecx mull %ecx to avoid the copy. In fact, the existing two-address stuff would do this except that mul isn 't a commutative 2-addr instruction. I guess this has to be done at isel time based on the #uses to mul? Make sure the instruction which starts a loop does not cross a cacheline boundary. This requires knowning the exact length of each machine instruction. That is somewhat complicated, but doable. Example 256.bzip2:In the new trace, the hot loop has an instruction which crosses a cacheline boundary. In addition to potential cache misses, this can 't help decoding as I imagine there has to be some kind of complicated decoder reset and realignment to grab the bytes from the next cacheline. 532 532 0x3cfc movb(1809(%esp, %esi), %bl<<<--- spans 2 64 byte lines 942 942 0x3d03 movl %dh,(1809(%esp, %esi) 937 937 0x3d0a incl %esi 3 3 0x3d0b cmpb %bl, %dl 27 27 0x3d0d jnz 0x000062db< main+11707 > In c99 mode, the preprocessor doesn 't like assembly comments like #TRUNCATE. This could be a single 16-bit load. int f(char *p) { if((p[0]==1) &(p[1]==2)) return 1 tmp1
we should consider alternate ways to model stack dependencies Lots of things could be done in WebAssemblyTargetTransformInfo cpp there are numerous optimization related hooks that can be overridden in WebAssemblyTargetLowering Instead of the OptimizeReturned pass
we should consider alternate ways to model stack dependencies Lots of things could be done in WebAssemblyTargetTransformInfo cpp there are numerous optimization related hooks that can be overridden in WebAssemblyTargetLowering Instead of the OptimizeReturned which should consider preserving the returned attribute through to MachineInstrs and extending the MemIntrinsicResults pass to do this optimization on calls too That would also let the WebAssemblyPeephole pass clean up dead defs for such as it does for stores Consider implementing and or getMachineCombinerPatterns Find a clean way to fix the problem which leads to the Shrink Wrapping pass being run after the WebAssembly PEI pass When setting multiple variables to the same we currently get code like const It could be done with a smaller encoding like local tee $pop5 local $pop6 WebAssembly registers are implicitly initialized to zero Explicit zeroing is therefore often redundant and could be optimized away Small indices may use smaller encodings than large indices WebAssemblyRegColoring and or WebAssemblyRegRenumbering should sort registers according to their usage frequency to maximize the usage of smaller encodings Many cases of irreducible control flow could be transformed more optimally than via the transform in WebAssemblyFixIrreducibleControlFlow cpp It may also be worthwhile to do transforms before register particularly when duplicating to allow register coloring to be aware of the duplication WebAssemblyRegStackify could use AliasAnalysis to reorder loads and stores more aggressively WebAssemblyRegStackify is currently a greedy algorithm This means that
we compile this esp call L1 $pb L1 esp je LBB1_2 esp ret but is currently always computed in the entry block It would be better to sink the picbase computation down into the block for the assertion
alloca< 16 x float >, align 16 %tmp2=alloca< 16 x float >, align 16 store< 16 x float > %A,< 16 x float > *%tmp %s=bitcast< 16 x float > *%tmp to i8 *%s2=bitcast< 16 x float > *%tmp2 to i8 *call void @llvm.memcpy.i64(i8 *%s, i8 *%s2, i64 64, i32 16) %R=load< 16 x float > *%tmp2 ret< 16 x float > %R } declare void @llvm.memcpy.i64(i8 *nocapture, i8 *nocapture, i64, i32) nounwind which compiles to:_foo:subl $140, %esp movaps %xmm3, 112(%esp) movaps %xmm2, 96(%esp) movaps %xmm1, 80(%esp) movaps %xmm0, 64(%esp) movl 60(%esp), %eax movl %eax, 124(%esp) movl 56(%esp), %eax movl %eax, 120(%esp) movl 52(%esp), %eax< many many more 32-bit copies > movaps(%esp), %xmm0 movaps 16(%esp), %xmm1 movaps 32(%esp), %xmm2 movaps 48(%esp), %xmm3 addl $140, %esp ret On Nehalem, it may even be cheaper to just use movups when unaligned than to fall back to lower-granularity chunks. Implement processor-specific optimizations for parity with GCC on these processors. GCC does two optimizations:1. ix86_pad_returns inserts a noop before ret instructions if immediately preceded by a conditional branch or is the target of a jump. 2. ix86_avoid_jump_misspredicts inserts noops in cases where a 16-byte block of code contains more than 3 branches. The first one is done for all AMDs, Core2, and "Generic" The second one is done for:Atom, Pentium Pro, all AMDs, Pentium 4, Nocona, Core 2, and "Generic" Testcase:int x(int a) { return(a &0xf0)> >4 tmp
Should compile r2 movcc movcs str strb mov lr r1 movcs movcc mov lr r1 str mov mov cmp r1 movlo r2 str bx lr r0 mov mov cmp r0 movhs r2 mov r1 bx lr Some of the NEON intrinsics may be appropriate for more general either as target independent intrinsics or perhaps elsewhere in the ARM backend Some of them may also be lowered to target independent and perhaps some new SDNodes could be added For and absolute value operations are well defined and standard both for vector and scalar types The current NEON specific intrinsics for count leading zeros and count one bits could perhaps be replaced by the target independent ctlz and ctpop intrinsics It may also make sense to add a target independent ctls intrinsic for count leading sign bits Likewise
then ret i32 result Tail recursion elimination should handle
gets compiled into this on rsp movaps rsp movaps rsp movaps rsp movaps rsp movaps rsp movaps rsp movaps rsp movaps rsp movq rsp movq rsp movq rsp movq rsp movq rsp rax movq rsp rax movq rsp rsp rsp eax eax jbe LBB1_3 rcx rax movq rsp eax rsp ret ecx eax rcx movl rsp jmp LBB1_2 gcc generates
static void replace(Module &M, GlobalVariable *Old, GlobalVariable *New)
We currently generate a but we really shouldn eax ecx xorl edx divl ecx eax divl ecx movl eax ret A similar code sequence works for division We currently compile i32 v2 eax eax jo LBB1_2 and
Note that only the low bits of effective_addr2 are used On bit we don t eliminate the computation of the top half of effective_addr2 because we don t have whole function selection dags On x86
Should compile to use and accumulates this with a bit value We currently get this with both v4 and r0 smull r2 adds r1 adc r0 bx lr bool full_add(unsigned a, unsigned b)
riscv prera expand pseudo
the resulting code requires compare and branches when and if * p
The following code is currently eax eax ecx jb LBB1_2 eax movzbl eax ret eax ret We could change the eax into movzwl(%esp)
Predicate any(Predicate P0, Predicate P1)
True iff P0 or P1 are true.
This is blocked on not handling X *X *X which is the same number of multiplies and is because the *X has multiple uses Here s a simple X1 B ret i32 C Reassociate should handle the example in GCC a4
Clang compiles this i1 i64 store i64 i64 store i64 i64 store i64 i64 store i64 align Which gets codegen d xmm0 movaps rbp movaps rbp movaps rbp movaps rbp rbp rbp rbp rbp It would be better to have movq s of instead of the movaps s http
Should compile to something like
Itanium Name Demangler i e convert the string _Z1fv into and both[sub] projects need to demangle but neither can depend on each other *libcxxabi needs the demangler to implement which is part of the itanium ABI spec *LLVM needs a copy for a bunch of and cannot rely on the system s __cxa_demangle because it a might not be and b may not be up to date on the latest language features The copy of the demangler in LLVM has some extra stuff that aren t needed in which depend on the shared generic components Despite these we want to keep the core generic demangling library identical between both copies to simplify development and testing If you re working on the generic library
itanium_demangle::ManglingParser< DefaultAllocator > Demangler
Instead of the following for memset char edx edx edx It might be better to generate eax movl edx movl edx movw edx when we can spare a register It reduces code size Evaluate what the best way to codegen sdiv C is For we currently get ret i32 Y eax movl ecx ecx ecx addl eax eax ret GCC knows several different ways to codegen one of which is eax eax ecx cmovle eax eax ret which is probably but it s interesting at least
Analysis the ScalarEvolution expression for r is< loop > Outside the loop
is currently compiled esp esp jne LBB1_1 esp ret esp esp jne L_abort $stub esp ret This can be applied to any no return function call that takes no arguments etc the stack save restore logic could be shrink wrapped
we should consider alternate ways to model stack dependencies Lots of things could be done in WebAssemblyTargetTransformInfo cpp there are numerous optimization related hooks that can be overridden in WebAssemblyTargetLowering Instead of the OptimizeReturned which should consider preserving the returned attribute through to MachineInstrs and extending the MemIntrinsicResults pass to do this optimization on calls too That would also let the WebAssemblyPeephole pass clean up dead defs for such as it does for stores Consider implementing and or getMachineCombinerPatterns Find a clean way to fix the problem which leads to the Shrink Wrapping pass being run after the WebAssembly PEI pass When setting multiple variables to the same we currently get code like const It could be done with a smaller encoding like local tee $pop5 local $pop6 WebAssembly registers are implicitly initialized to zero Explicit zeroing is therefore often redundant and could be optimized away Small indices may use smaller encodings than large indices WebAssemblyRegColoring and or WebAssemblyRegRenumbering should sort registers according to their usage frequency to maximize the usage of smaller encodings Many cases of irreducible control flow could be transformed more optimally than via the transform in WebAssemblyFixIrreducibleControlFlow cpp It may also be worthwhile to do transforms before register particularly when duplicating to allow register coloring to be aware of the duplication WebAssemblyRegStackify could use AliasAnalysis to reorder loads and stores more aggressively WebAssemblyRegStackify is currently a greedy algorithm This means for a binary however wasm doesn t actually require this WebAssemblyRegStackify could be extended
print must be executed print the must be executed context for all instructions
div rem Hoist decompose integer division and remainder
=0.0 ? 0.0 :(a > 0.0 ? 1.0 :-1.0) a
We have fiadd patterns now but the followings have the same cost and complexity We need a way to specify the later is more profitable def def The FP stackifier should handle simple permutates to reduce number of shuffle e g floor
bar al al movzbl eax ret Missed when stored in a memory are stored as single byte objects the value of which is always(false) or 1(true). We are not using this fact
ppc ctr loops PowerPC CTR Loops Verify
Code Generation Notes for reduce the size of the ISel and reduce repetition in the implementation In a small number of this can cause different(semantically equivalent) instructions to be used in place of the requested instruction
should just be implemented with a CLZ instruction Since there are other e that share this it would be best to implement this in a target independent way
The object format emitted by the WebAssembly backed is documented see the home and packaging for producing WebAssembly applications that can run in browsers and other environments wasi sdk provides a more minimal C C SDK based on clang
static void clear(coro::Shape &Shape)
Add support for conditional and other related patterns Instead eax eax je LBB16_2 eax edi sbbl
gets compiled into this on rsp movaps rsp movaps rsp movaps rsp movaps rsp movaps rsp movaps rsp movaps rsp movaps rsp movq rsp movq rsp movq rsp movq rsp movq rsp rax movq rax
A predicate compare being used in a select_cc should have the same peephole applied to it as a predicate compare used by a br_cc There should be no mfcr here
http eax xorl edx cl sete al setne dl sall eax sall edx But that requires good bit subreg support this might be better It s an extra but it s one instruction and doesn t stress bit subreg eax eax movl edx edx sall eax sall cl edx bit we should expand to a conditional branch like GCC produces Some isel and Sequencing of Instructions Scheduling for reduced register pressure E g Minimum Register Instruction Sequence load p Because the compare isn t it is not matched with the load on both sides The dag combiner should be made smart enough to canonicalize the load into the RHS of a compare when it can invert the result of the compare for free In many LLVM generates code like eax cmpl esp setl al movzbl eax ret on some processors(which ones?)
bool equivalent(file_status A, file_status B)
Do file_status's represent the same thing?
Common register allocation spilling lr str ldr sxth r3 ldr mla r4 can lr mov lr str ldr sxth r3 mla r4 and then merge mul and lr str ldr sxth r3 mla r4 It also increase the likelihood the store may become dead bb27 reg1039
Inductive range check elimination
StringView substr(size_t Pos, size_t Len=npos) const
entry stw r5 blr GCC r3 srawi xor r4 subf r0 stw r5 blr which is much nicer This theoretically may help improve twolf li blt LBB1_2
hexagon loop Recognize Hexagon specific loop idioms
Common register allocation spilling r4
< i1 > br i1 label label bb bb
hexagon bit Hexagon bit simplification
separate const offset from Split GEPs to a variadic base and a constant offset for better CSE
We lo16(.CPI_X_0)(r2) lis r2
We have fiadd patterns now but the followings have the same cost and complexity We need a way to specify the later is more profitable def def The FP stackifier should handle simple permutates to reduce number of shuffle e g turning
bool no_overflow(unsigned a, unsigned b)
into xmm2 addss xmm2 xmm1 xmm3 addss xmm3 movaps xmm0 unpcklps xmm0 ret seems silly when it could just be one addps Expand libm rounding functions main should enable SSE DAZ mode and other fast SSE modes Think about doing i64 math in SSE regs on x86 This testcase should have no SSE instructions in and only one load from a constant double ret double C the select is being which prevents the dag combiner from turning select(load CPI1)
Instead of the following for memset char edx edx movw
cond_true lis lo16() LCPI1_0(r2) lis r2
To do *Keep the address of the constant pool in a register instead of forming its address all of the time *We can fold small constant offsets into the hi lo references to constant pool addresses as well *When in V9 register allocate icc *[0-3] Add support for isel ing UMUL_LOHI instead of marking it as Expand *Emit the Branch on Integer Register with Prediction instructions It s not clear how to write a pattern for this int o6 subcc l0 bne LBBt1_2 ! F nop l0 st l0
@ SELECT
Select(COND, TRUEVAL, FALSEVAL).
bool match(Val *V, const Pattern &P)
This requires reassociating to forms of expressions that are already available
gets compiled into this on rsp movaps rsp movaps rsp movaps rsp movaps rsp movaps rsp movaps rsp movaps rsp movaps rsp movq rsp movq rsp movq rsp movq rsp movq rsi
_bar mov r0 mov r1 fldd LCPI1_0 fmrrd d0 bl _foo fmdrr r5 fmsr s2
We currently generate a but we really shouldn eax ecx xorl edx divl ecx eax divl ecx movl eax ret A similar code sequence works for division We currently compile i32 v2 eax eax jo LBB1_2 shl
Common register allocation spilling lr str ldr sxth r3 ldr mla r4 can lr mov lr str ldr sxth r3 mla r4 and then merge mul and lr str ldr sxth r3 mla r4 It also increase the likelihood the store may become dead bb27 Successors according to LLVM ID Predecessors according to mbb< bb27, 0x8b0a7c0 > Note ADDri is not a two address instruction its result reg1037 is an operand of the PHI node in bb76 and its operand reg1039 is the result of the PHI node We should treat it as a two address code and make sure the ADDri is scheduled after any node that reads reg1039 Use info(i.e. register scavenger) to assign it a free register to allow reuse the collector could move the objects and invalidate the derived pointer This is bad enough in the first but safe points can crop up unpredictably **array_addr i32 n y store obj obj **nth_el If the i64 division is lowered to a then a safe point array and nth_el no longer point into the correct object The fix for this is to copy address calculations so that dependent pointers are never live across safe point boundaries But the loads cannot be copied like this if there was an intervening so may be hard to get right Only a concurrent mutator can trigger a collection at the libcall safe point So single threaded programs do not have this even with a copying collector LLVM optimizations would probably undo a front end s careful work The ocaml frametable structure supports liveness information It would be good to support it The FIXME in ComputeCommonTailLength in BranchFolding cpp needs to be revisited The check is there to work around a misuse of directives in assembly It would be good to detect collector target compatibility instead of silently doing the wrong thing It would be really nice to be able to write patterns in td files for which would eliminate a bunch of predicates on them(e.g. no side effects). Once this is in place
Itanium Name Demangler i e convert the string _Z1fv into f()". You can also use the CRTP base ManglingParser to perform some simple analysis on the mangled name
static GCRegistry::Add< CoreCLRGC > E("coreclr", "CoreCLR-compatible GC")
esp eax movl ecx ecx cvtsi2ss xmm0 eax cvtsi2ss xmm1 mulss
sar eax, 31) more aggressively edx
bool consumeFront(char C)
(vector float) vec_cmpeq(*A, *B) C
we compile this esp call L1 $pb L1 esp je LBB1_2 LBB1_1
static uint64_t round(uint64_t Acc, uint64_t Input)
Clang compiles this i1 i64 store i64 i64 store i64 i64 store i64 i64 store i64 align Which gets codegen d xmm0 movaps rbp movaps rbp movaps rbp movaps rbp rbp rbp rbp rbp It would be better to have movq s of instead of the movaps s LLVM produces ret int
Common register allocation spilling lr str ldr sxth r3 ldr mla r4 can lr mov lr str ldr sxth r3 mla r4 and then merge mul and lr str ldr sxth r3 mla r4 It also increase the likelihood the store may become dead bb27 Successors according to LLVM ID Predecessors according to mbb< bb27, 0x8b0a7c0 > Note ADDri is not a two address instruction its result reg1037 is an operand of the PHI node in bb76 and its operand reg1039 is the result of the PHI node We should treat it as a two address code and make sure the ADDri is scheduled after any node that reads reg1039 Use info(i.e. register scavenger) to assign it a free register to allow reuse the collector could move the objects and invalidate the derived pointer This is bad enough in the first but safe points can crop up unpredictably **array_addr i32 n old
static GCMetadataPrinterRegistry::Add< OcamlGCMetadataPrinter > Y("ocaml", "ocaml 3.10-compatible collector")
http eax xorl edx cl sete al setne dl sall eax sall edx But that requires good bit subreg support this might be better It s an extra but it s one instruction shorter
bitcast float %x to i32 %s=and i32 %t, 2147483647 %d=bitcast i32 %s to float ret float %d } declare float @fabsf(float %n) define float @bar(float %x) nounwind { %d=call float @fabsf(float %x) ret float %d } This IR(from PR6194):target datalayout="e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v64:64:64-v128:128:128-a0:0:64-s0:64:64-f80:128:128-n8:16:32:64-S128" target triple="x86_64-apple-darwin10.0.0" %0=type { double, double } %struct.float3=type { float, float, float } define void @test(%0, %struct.float3 *nocapture %res) nounwind noinline ssp { entry:%tmp18=extractvalue %0 %0, 0 t
the resulting code requires compare and branches when and if the revised code is with conditional branches instead of More there is a byte word extend before each where there should be only and the condition codes are not remembered when the same two values are compared twice More LSR enhancements i8 and i32 load store addressing modes are identical int b
block Block Frequency Analysis
@ TRUNCATE
TRUNCATE - Completely drop the high bits.
This is blocked on not handling X *X *X which is the same number of multiplies and is because the *X has multiple uses Here s a simple X1 * C
bb420 i The CBE manages to mtctr r0 r11 stbx r9 addi bdz later b loop This could be much the loop would be a single dispatch group
int test(U32 *inst, U64 *regs)
bar al al movzbl eax ret GCC produces bar
Instead of the following for memset char edx edx edx It might be better to generate eax movl edx movl edx movw edx when we can spare a register It reduces code size Evaluate what the best way to codegen sdiv C is For we currently get ret i32 Y eax movl ecx ecx ecx addl eax eax ret GCC knows several different ways to codegen one of which is eax eax leal(%eax)
static LoopDeletionResult merge(LoopDeletionResult A, LoopDeletionResult B)
into llvm powi allowing the code generator to produce balanced multiplication trees First
Should compile r2 movcc movcs str strb mov lr r1 movcs movcc mov lr r1 str mov mov cmp r1 movlo r2 str bx lr r0 mov mov cmp r0 movhs r2 mov r1 bx lr Some of the NEON intrinsics may be appropriate for more general either as target independent intrinsics or perhaps elsewhere in the ARM backend Some of them may also be lowered to target independent SDNodes
So that lo16() r2 stb r3 blr Becomes r3 they should compile to something better r3 subfic cmpwi bgt LBB2_2
The object format emitted by the WebAssembly backed is documented in
Analysis the ScalarEvolution expression for r is< loop > Outside the this could be evaluated simply however ScalarEvolution currently evaluates it it involves i65 which is very inefficient when expanded into code In formatValue in test CodeGen X86 lsr delayed fold ll
Common register allocation spilling lr str ldr sxth r3 ldr mla r4 can be
include(LLVM-Build) add_subdirectory(IR) add_subdirectory(FuzzMutate) add_subdirectory(FileCheck) add_subdirectory(InterfaceStub) add_subdirectory(IRPrinter) add_subdirectory(IRReader) add_subdirectory(CodeGen) add_subdirectory(BinaryFormat) add_subdirectory(Bitcode) add_subdirectory(Bitstream) add_subdirectory(DWARFLinker) add_subdirectory(DWARFLinkerParallel) add_subdirectory(Extensions) add_subdirectory(Frontend) add_subdirectory(Transforms) add_subdirectory(Linker) add_subdirectory(Analysis) add_subdirectory(LTO) add_subdirectory(MC) add_subdirectory(MCA) add_subdirectory(ObjCopy) add_subdirectory(Object) add_subdirectory(ObjectYAML) add_subdirectory(Option) add_subdirectory(Remarks) add_subdirectory(Debuginfod) add_subdirectory(DebugInfo) add_subdirectory(DWP) add_subdirectory(ExecutionEngine) add_subdirectory(Target) add_subdirectory(AsmParser) add_subdirectory(LineEditor) add_subdirectory(ProfileData) add_subdirectory(Passes) add_subdirectory(TargetParser) add_subdirectory(TextAPI) add_subdirectory(ToolDrivers) add_subdirectory(XRay) if(LLVM_INCLUDE_TESTS) add_subdirectory(Testing) endif() add_subdirectory(WindowsDriver) add_subdirectory(WindowsManifest) set(LLVMCONFIGLIBRARYDEPENDENCIESINC "$
To do *Keep the address of the constant pool in a register instead of forming its address all of the time *We can fold small constant offsets into the hi lo references to constant pool addresses as well *When in V9 register allocate icc *[0-3] Add support for isel ing UMUL_LOHI instead of marking it as Expand *Emit the Branch on Integer Register with Prediction instructions It s not clear how to write a pattern for this int o6 subcc l0 bne LBBt1_2 ! F nop l0 st g0 retl nop should be replaced with a brz in V9 mode *Same as above
__Z6slow4bii r1 movgt r0 mov r0
Moving arg1 onto the stack slot of callee function would overwrite arg2 of the caller Possible optimizations
Common register allocation spilling lr str lr
We don t delete this output free because trip count analysis doesn t realize that it is finite(if it were infinite, it would be undefined). Not having this blocks Loop Idiom from matching strlen and friends. void foo(char *C)
Statically lint checks LLVM IR
demanded Demanded bits analysis
@ AND
Bitwise operators - logical and, logical or, logical xor.
is currently compiled esp esp jne LBB1_1 esp ret esp esp jne L_abort $stub esp ret This can be applied to any no return function call that takes no arguments etc the stack save restore logic could be shrink producing something like esp jne LBB1_1 ret esp call L_abort $stub Both are useful in different situations Finally
we should consider alternate ways to model stack dependencies Lots of things could be done in WebAssemblyTargetTransformInfo cpp there are numerous optimization related hooks that can be overridden in WebAssemblyTargetLowering Instead of the OptimizeReturned which should consider preserving the returned attribute through to MachineInstrs and extending the MemIntrinsicResults pass to do this optimization on calls too That would also let the WebAssemblyPeephole pass clean up dead defs for such as it does for stores Consider implementing and or getMachineCombinerPatterns Find a clean way to fix the problem which leads to the Shrink Wrapping pass being run after the WebAssembly PEI pass When setting multiple variables to the same we currently get code like const It could be done with a smaller encoding like local tee $pop5 local $pop6 WebAssembly registers are implicitly initialized to zero Explicit zeroing is therefore often redundant and could be optimized away Small indices may use smaller encodings than large indices WebAssemblyRegColoring and or WebAssemblyRegRenumbering should sort registers according to their usage frequency to maximize the usage of smaller encodings Many cases of irreducible control flow could be transformed more optimally than via the transform in WebAssemblyFixIrreducibleControlFlow cpp It may also be worthwhile to do transforms before register coloring
*Add support for compiling functions in both ARM and Thumb then taking the smallest *Add support for compiling individual basic blocks in thumb when in a larger ARM function This can be used for presumed cold like paths to EH handling etc *Thumb doesn t have normal pre post increment addressing modes
The initial backend is deliberately restricted to z10 We should add support for later architectures at some point If an asm ties an i32 r result to an i64 input
Promote Memory to Register
cond_true lis lo16() lo16() LCPI1_2(r3) lfs f3
therefore end up llgh r3 lr r0 br r14 but truncating the load would lh r3 br r14 Functions ret i64 and ought to be implemented ngr r0 br r14 but two address optimizations reverse the order of the AND and force
the resulting code requires compare and branches when and if the revised code is with conditional branches instead of More there is a byte word extend before each where there should be only and the condition codes are not remembered when the same two values are compared twice More LSR enhancements i8 and i32 load store addressing modes are identical int int c
Instead we get xmm1 addss xmm1 pshufd
Common register allocation spilling lr str ldr sxth r3 ldr mla r4 can lr mov lr str ldr sxth r3 mla r4 and then merge mul and lr str ldr sxth r3 mla r4 It also increase the likelihood the store may become dead bb27 Successors according to LLVM ID Predecessors according to mbb< bb27, 0x8b0a7c0 > Note ADDri is not a two address instruction its result reg1037 is an operand of the PHI node in bb76 and its operand reg1039 is the result of the PHI node We should treat it as a two address code and make sure the ADDri is scheduled after any node that reads reg1039 Use info(i.e. register scavenger) to assign it a free register to allow reuse the collector could move the objects and invalidate the derived pointer This is bad enough in the first place
gets compiled into this on rsp movaps rsp movaps rsp movaps rsp movaps rsp movaps rsp movaps rsp movaps rsp movaps rsp movq rsp movq rsp movq rsp movq rsp movq rsp rax movq rsp rax movq rsp rsp rsp eax eax jbe LBB1_3 rcx rax movq rsp eax rsp ret ecx eax rcx movl rsp jmp LBB1_2 gcc rsp rax movq rsp rsp movq rsp rax movq rsp eax eax jb L6 rdx eax rsp ret p2align L6
gets compiled into this on rsp movaps rsp movaps xmm6
gcc compiles this esp xorl eax jmp L2 eax je L10 eax eax jne L3 call L_abort $stub L10
std::string demangle(const std::string &MangledName)
Attempt to demangle a string using different demangling schemes.
The initial backend is deliberately restricted to z10 We should add support for later architectures at some point If an asm ties an i32 r result to an i64 the input will be treated as an leaving the upper bits uninitialised For i64 store i32 i32 *dst ret void from CodeGen SystemZ asm ll will use LHI rather than LGHI to load This seems to be a general target independent problem The tuning of the choice between LOAD XC and CLC for constant length block operations We could extend them to variable length operations too
Common register allocation spilling lr str ldr sxth r3 ldr mla r4 can lr mov lr str ldr sxth r3 mla r4 and then merge mul and lr str ldr sxth r3 mla r4 It also increase the likelihood the store may become dead bb27 Successors according to LLVM ID Predecessors according to mbb< bb27, 0x8b0a7c0 > Note ADDri is not a two address instruction its result reg1037 is an operand of the PHI node in bb76 and its operand reg1039 is the result of the PHI node We should treat it as a two address code and make sure the ADDri is scheduled after any node that reads reg1039 Use info(i.e. register scavenger) to assign it a free register to allow reuse the collector could move the objects and invalidate the derived pointer This is bad enough in the first but safe points can crop up unpredictably **array_addr i32 n y store obj obj **nth_el If the i64 division is lowered to a then a safe point will(must) appear for the call site. If a collection occurs
the resulting code requires compare and branches when and if the revised code is with conditional branches instead of More there is a byte word extend before each where there should be only one
PointerTypeMap run(const Module &M)
Compute the PointerTypeMap for the module M.
bool startsWith(StringView Str) const
The object format emitted by the WebAssembly backed is documented see the home and packaging for producing WebAssembly applications that can run in browsers and other environments wasi sdk provides a more minimal C C SDK based on llvm and a libc based on for producing WebAssemmbly applictions that use the WASI ABI Rust provides WebAssembly support integrated into Cargo There are two main which provides a relatively minimal environment that has an emphasis on being native wasm32 unknown which uses Emscripten internally and provides standard C C filesystem GL and SDL bindings For more and br_table instructions can support having a value on the value stack across the jump(sometimes). We should(a) model this
Current eax eax eax ret Ideal eax eax ret Re implement atomic builtins x86 does not have to use add to implement these Instead
Common register allocation spilling lr str ldr sxth r3 ldr mla r4 can lr mov lr str ldr sxth r3 mla r4 and then merge mul and lr str ldr sxth r3 mla r4 It also increase the likelihood the store may become dead bb27 Successors according to LLVM ID Predecessors according to mbb< bb27, 0x8b0a7c0 > Note ADDri is not a two address instruction its result reg1037 is an operand of the PHI node in bb76 and its operand reg1039 is the result of the PHI node We should treat it as a two address code and make sure the ADDri is scheduled after any node that reads reg1039 Use info(i.e. register scavenger) to assign it a free register to allow reuse the collector could move the objects and invalidate the derived pointer This is bad enough in the first but safe points can crop up unpredictably **array_addr i32
gets compiled into this on rsp movaps rsp movaps rsp movaps xmm5
void sort(IteratorTy Start, IteratorTy End)
this could be done in SelectionDAGISel along with other special cases
We currently generate an sqrtsd and divsd instructions This is bad
The initial backend is deliberately restricted to z10 We should add support for later architectures at some point If an asm ties an i32 r result to an i64 the input will be treated as an leaving the upper bits uninitialised For i64 store i32 val
Clang compiles this i1 i64 store i64 i64 store i64 i64 store i64 i64 store i64 align Which gets codegen d xmm0 movaps xmm0
The initial backend is deliberately restricted to z10 We should add support for later architectures at some point If an asm ties an i32 r result to an i64 the input will be treated as an leaving the upper bits uninitialised For i64 store i32 i32 *dst ret void from CodeGen SystemZ asm ll will use LHI rather than LGHI to load This seems to be a general target independent problem The tuning of the choice between LOAD ADDRESS(LA) and addition in SystemZISelDAGToDAG.cpp is suspect. It should be tweaked based on performance measurements. -- There is no scheduling support. -- We don 't use the BRANCH ON INDEX instructions. -- We only use MVC
We currently generate a but we really shouldn t
def v8i8(MMX_MOVDQ2Qrr VR128:$src))>
This currently compiles esp xmm0 movsd esp eax eax esp ret We should use not the dag combiner This is because dagcombine2 needs to be able to see through the X86ISD::Wrapper which DAGCombine can t really do The code for turning x load into a single vector load is target independent and should be moved to the dag combiner The code for turning x load into a vector load can only handle a direct load from a global or a direct load from the stack It should be generalized to handle any load from where P can be anything The alignment inference code cannot handle loads from globals in static non mode because it doesn t look through the extra dyld stub load If you try vec_align ll without relocation you ll see what I mean We should lower which eliminates a constant pool load For float z nounwind readonly
Eliminate PHI nodes for register allocation
Reimplement select in terms of SEL *We would really like to support UXTAB16
Should compile to use SMLAL(Signed Multiply Accumulate Long) which multiplies two signed 32-bit values to produce a 64-bit value
Error backend(const Config &C, AddStreamFn AddStream, unsigned ParallelCodeGenParallelismLevel, Module &M, ModuleSummaryIndex &CombinedIndex)
Runs a regular LTO backend.
We should do a little better with eliminating dead stores The stores to the stack are dead since a and b are not needed
bar al al movzbl eax ret Missed when stored in a memory object
#define DEMANGLE_NAMESPACE_END
ValuesClass values(OptsTy... Options)
Helper to build a ValuesClass by forwarding a variable number of arguments as an initializer list to ...
AMD64 Optimization Manual has some nice information about optimizing integer multiplication by a constant How much of it applies to Intel s X86 implementation There are definite trade offs to xmm0 cvttss2siq rdx jb L3 subss xmm0 rax cvttss2siq rdx xorq rdx rax ret instead of xmm1 cvttss2siq rcx movaps xmm2 subss xmm2 cvttss2siq rax rdx xorq rax ucomiss xmm0 cmovb rax ret Seems like the jb branch has high likelihood of being taken It would have saved a few instructions It s not possible to reference and DH registers in an instruction requiring REX prefix divb and mulb both produce results in AH If isel emits a CopyFromReg which gets turned into a movb and that can be allocated a r8b r15b To get around isel emits a CopyFromReg from AX and then right shift it down by and truncate it It s not pretty but it works We need some register allocation magic to make the hack go which would often require a callee saved register Callees usually need to keep this value live for most of their body so it doesn t add a significant burden on them We currently implement this in however this is suboptimal because it means that it would be quite awkward to implement the optimization for callers A better implementation would be to relax the LLVM IR rules for sret arguments to allow a function with an sret argument to have a non void return type
Current eax eax eax ret Ideal eax eax ret Re implement atomic builtins x86 does not have to use add to implement these it can use inc
static Expected< BitVector > expand(StringRef S, StringRef Original)
This is blocked on not handling X *X *X which is the same number of multiplies and is because the *X has multiple uses Here s a simple X1 B ret i32 C Reassociate should handle the example in GCC a1
auto count(R &&Range, const E &Element)
Wrapper function around std::count to count the number of times an element Element occurs in the give...
This is equivalent to the following
static GCRegistry::Add< StatepointGC > D("statepoint-example", "an example strategy for statepoint")
expand Expand reduction intrinsics
@ LOAD
LOAD and STORE have token chains as their first operand, then the same operands as an LLVM load/store...
the multiplication has a latency of four as opposed to two cycles for the movl lea variant It appears gcc place string data with linkonce linkage in section __const_coal
Predicate all(Predicate P0, Predicate P1)
True iff P0 and P1 are true.
multiplies can be turned into SHL s
_test eax xmm0 eax xmm1 comiss xmm1 setae al movzbl ecx eax edx ecx cmove eax ret Note the cmove can be replaced with a single cmovae There are a number of issues We are introducing a setcc between the result of the intrisic call and select The intrinsic is expected to produce a i32 value so a any extend(which becomes a zero extend) is added. We probably need some kind of target DAG combine hook to fix this. We generate significantly worse code for this than GCC
compiles ldr LCPI1_0 ldr ldr mov lsr tst moveq r1 ldr LCPI1_1 and r0 bx lr It would be better to do something like to fold the shift into the conditional move
test_demangle and llvm unittest Demangle The llvm directory should only get tests for stuff not included in the core library In the future though
StringView(const char *Str)
Itanium Name Demangler i e convert the string _Z1fv into and both[sub] projects need to demangle symbols
Note that only the low bits of effective_addr2 are used On bit we don t eliminate the computation of the top half of effective_addr2 because we don t have whole function selection dags On this means we use one extra register for the function when effective_addr2 is declared as U64 than when it is declared U32 PHI Slicing could be extended to do this Tail call elim should be more checking to see if the call is followed by an uncond branch to an exit block
the multiplication has a latency of four as opposed to two cycles for the movl lea variant It appears gcc place string data with linkonce linkage in section coalesced instead of section __DATA
Vector Shift Left don t map to llvm shl and because they have different semantics
This might compile to this xmm1 xorps xmm0 movss xmm0 ret Now consider if the code caused xmm1 to get spilled This might produce this xmm1 movaps xmm0 movaps xmm1 movss xmm0 ret since the reload is only used by these we could fold it into the producing something like xmm1 movaps xmm0 ret saving two instructions The basic idea is that a reload from a spill if only one byte chunk is bring in zeros the one element instead of elements This can be used to simplify a variety of shuffle where the elements are fixed zeros This code generates ugly probably due to costs being off or something
Combine AArch64 machine instrs before legalization
we currently eax ecx subl eax ret We would use one fewer register if codegen d eax neg eax eax ret Note that this isn t beneficial if the load can be folded into the sub In this we want a sub
S is passed via registers r2 But gcc stores them to the and then reload them to and r3 before issuing the call(r0 contains the address of the format string)
initializer< Ty > init(const Ty &Val)
This should be recognized as CLZ
This currently compiles esp xmm0 movsd esp eax eax esp ret We should use not the dag combiner This is because dagcombine2 needs to be able to see through the X86ISD::Wrapper node
it should cost the same as a move shift on any modern processor
llvm lib Support Unix README
Itanium Name Demangler i e convert the string _Z1fv into and both[sub] projects need to demangle but neither can depend on each other *libcxxabi needs the demangler to implement which is part of the itanium ABI spec *LLVM needs a copy for a bunch of places
The following code is currently generated
This is blocked on not handling X *X *X which is the same number of multiplies and is because the *X has multiple uses Here s a simple X1 B ret i32 C Reassociate should handle the example in GCC a2
Common register allocation spilling lr str ldr sxth r3 ldr mla r4 can lr mov lr str ldr sxth r3 mla r4 and then merge mul and lr str ldr sxth r3 mla r4 It also increase the likelihood the store may become dead bb27 reg1038
> ldr r0, pc, #((LCPI1_0-(LPCRELL0+4))&0xfffffffc) We compile the following:define i16 @func_entry_2E_ce(i32 %i) { switch i32 %i, label %bb12.exitStub[i32 0, label %bb4.exitStub i32 1, label %bb9.exitStub i32 2, label %bb4.exitStub i32 3, label %bb4.exitStub i32 7, label %bb9.exitStub i32 8, label %bb.exitStub i32 9, label %bb9.exitStub] bb12.exitStub:ret i16 0 bb4.exitStub:ret i16 1 bb9.exitStub:ret i16 2 bb.exitStub:ret i16 3 } into:_func_entry_2E_ce:mov r2, #1 lsl r2, r0 cmp r0, #9 bhi LBB1_4 @bb12.exitStub LBB1_1:@newFuncRoot mov r1, #13 tst r2, r1 bne LBB1_5 @bb4.exitStub LBB1_2:@newFuncRoot ldr r1, LCPI1_0 tst r2, r1 bne LBB1_6 @bb9.exitStub LBB1_3:@newFuncRoot mov r1, #1 lsl r1, r1, #8 tst r2, r1 bne LBB1_7 @bb.exitStub LBB1_4:@bb12.exitStub mov r0, #0 bx lr LBB1_5:@bb4.exitStub mov r0, #1 bx lr LBB1_6:@bb9.exitStub mov r0, #2 bx lr LBB1_7:@bb.exitStub mov r0, #3 bx lr LBB1_8:.align 2 LCPI1_0:.long 642 gcc compiles to:cmp r0, #9 @ lr needed for prologue bhi L2 ldr r3, L11 mov r2, #1 mov r1, r2, asl r0 ands r0, r3, r2, asl r0 movne r0, #2 bxne lr tst r1, #13 beq L9 L3:mov r0, r2 bx lr L9:tst r1, #256 movne r0, #3 bxne lr L2:mov r0, #0 bx lr L12:.align 2 L11:.long 642 GCC is doing a couple of clever things here:1. It is predicating one of the returns. This isn 't a clear win though:in cases where that return isn 't taken, it is replacing one condbranch with two 'ne' predicated instructions. 2. It is sinking the shift of "1 << i" into the tst, and using ands instead of tst. This will probably require whole function isel. 3. GCC emits:tst r1, #256 we emit:mov r1, #1 lsl r1, r1, #8 tst r2, r1 When spilling in thumb mode and the sp offset is too large to fit in the ldr/str offset field, we load the offset from a constpool entry and add it to sp:ldr r2, LCPI add r2, sp ldr r2,[r2] These instructions preserve the condition code which is important if the spill is between a cmp and a bcc instruction. However, we can use the(potentially) cheaper sequence if we know it 's ok to clobber the condition register. add r2, sp, #255 *4 add r2, #132 ldr r2,[r2, #7 *4] This is especially bad when dynamic alloca is used. The all fixed size stack objects are referenced off the frame pointer with negative offsets. See oggenc for an example. Poor codegen test/CodeGen/ARM/select.ll f7:ldr r5, LCPI1_0 LPC0:add r5, pc ldr r6, LCPI1_1 ldr r2, LCPI1_2 mov r3, r6 mov lr, pc bx r5 Make register allocator/spiller smarter so we can re-materialize "mov r, imm", etc. Almost all Thumb instructions clobber condition code. Thumb load/store address mode offsets are scaled. The values kept in the instruction operands are pre-scale values. This probably ought to be changed to avoid extra work when we convert Thumb2 instructions to Thumb1 instructions. We need to make(some of the) Thumb1 instructions predicable. That will allow shrinking of predicated Thumb2 instructions. To allow this, we need to be able to toggle the 's' bit since they do not set CPSR when they are inside IT blocks. Make use of hi register variants of cmp:tCMPhir/tCMPZhir. Thumb1 immediate field sometimes keep pre-scaled values. See ThumbRegisterInfo::eliminateFrameIndex. This is inconsistent from ARM and Thumb2. Rather than having tBR_JTr print a ".align 2" and constant island pass pad it, add a target specific ALIGN instruction instead. That way, getInstSizeInBytes won 't have to over-estimate. It can also be used for loop alignment pass. We generate conditional code for icmp when we don 't need to. This code:int foo(int s) { return s==1 LCPI1_0
So we should use XX3Form_Rcr to implement intrinsic Convert DP outs ins xscvdpsp So
the multiplication has a latency of four as opposed to two cycles for the movl lea variant It appears gcc place string data with linkonce linkage in section __TEXT
add sub stmia L5 ldr r0 bl L_printf $stub Instead of a and a ldr
pentium2/3/4/m/etc.:abs:movl 4(%esp), %eax cltd xorl %edx, %eax subl %edx, %eax ret Take the following code(from http:extern unsigned char first_one[65536] mtune
Should compile r2 movcc movcs str strb mov lr r1 movcs movcc mov lr r1 str mov mov cmp r1 movlo r2 str bx lr r0 mov mov cmp r0 movhs r2 mov r1 bx lr Some of the NEON intrinsics may be appropriate for more general either as target independent intrinsics or perhaps elsewhere in the ARM backend Some of them may also be lowered to target independent and perhaps some new SDNodes could be added For maximum
Clang compiles this i1 false
_test eax xmm0 eax xmm1 comiss xmm1 setae al movzbl ecx eax edx ecx cmove eax ret Note the setae
The object format emitted by the WebAssembly backed is documented see the home and packaging for producing WebAssembly applications that can run in browsers and other environments wasi sdk provides a more minimal C C SDK based on llvm and a libc based on for producing WebAssemmbly applictions that use the WASI ABI Rust provides WebAssembly support integrated into Cargo There are two main which provides a relatively minimal environment that has an emphasis on being native wasm32 unknown which uses Emscripten internally and provides standard C C filesystem emulation
Should compile r2 movcc movcs str strb mov lr _Z11no_overflowjj
assert(ImpDefSCC.getReg()==AMDGPU::SCC &&ImpDefSCC.isDef())
size_t find(char C, size_t From=0) const
Common register allocation spilling lr str ldr sxth r3 ldr mla r4 can lr mov lr str ldr sxth r3 mla r4 and then merge mul and lr str ldr sxth r3 mla r4 It also increase the likelihood the store may become dead bb27 Successors according to LLVM ID Predecessors according to mbb< bb27, 0x8b0a7c0 > Note ADDri is not a two address instruction its result reg1037 is an operand of the PHI node in bb76 and its operand reg1039 is the result of the PHI node We should treat it as a two address code and make sure the ADDri is scheduled after any node that reads reg1039 Use info(i.e. register scavenger) to assign it a free register to allow reuse the collector could move the objects and invalidate the derived pointer This is bad enough in the first but safe points can crop up unpredictably **array_addr i32 n y store obj obj **nth_el If the i64 division is lowered to a then a safe point array and nth_el no longer point into the correct object The fix for this is to copy address calculations so that dependent pointers are never live across safe point boundaries But the loads cannot be copied like this if there was an intervening so may be hard to get right Only a concurrent mutator can trigger a collection at the libcall safe point So single threaded programs do not have this even with a copying collector Still
is currently compiled esp esp jne LBB1_1 esp ret esp esp jne L_abort $stub esp ret This can be applied to any no return function call that takes no arguments etc the stack save restore logic could be shrink producing something like esp jne LBB1_1 ret esp call L_abort $stub Both are useful in different situations it could be shrink wrapped and tail called
LLVM currently emits rax rax movq rax rax ret It could narrow the loads and stores to emit rax rax movq rax rax ret The trouble is that there is a TokenFactor between the store and the load
<%struct.s * > cast struct s *S to sbyte *< sbyte * > sbyte uint cast struct s *agg result to sbyte *< sbyte * > sbyte uint cast struct s *memtmp to sbyte *< sbyte * > sbyte uint ret void llc ends up issuing two memcpy or custom lower memcpy(of small size) to be ldmia/stmia. I think option 2 is better but the current register allocator cannot allocate a chunk of registers at a time. A feasible temporary solution is to use specific physical registers at the lowering time for small(<
therefore end up llgh r3 lr r0 br r14 but truncating the load would lh r3 br r14 Functions ret i64 and ought to be implemented ngr r0 br r14 but two address optimizations reverse the order of the AND and ngr r2 lgr r0 br r14 CodeGen SystemZ and ll has several examples of this Out of range displacements are usually handled by loading the full address into a register In many cases it would be better to create an anchor point instead E g i64 base
int compare(DigitsT LDigits, int16_t LScale, DigitsT RDigits, int16_t RScale)
Compare two scaled numbers.
Straight line strength reduction
< i32 > ret i32 tmp10 ecx esp je LBB0_2 eax addl eax ret edx movl eax subl eax ret There s an obviously unnecessary movl in and we could eliminate a couple more movls by putting(%esp) into %eax instead of %ecx. See rdar< i1 * > define fastcc void abort_gzip() noreturn nounwind
currently compiles eax eax je LBB0_3 testl eax jne LBB0_4 the testl could be removed
To do *Keep the address of the constant pool in a register instead of forming its address all of the time *We can fold small constant offsets into the hi lo references to constant pool addresses as well *When in V9 register allocate icc *[0-3] Add support for isel ing UMUL_LOHI instead of marking it as Expand *Emit the Branch on Integer Register with Prediction instructions It s not clear how to write a pattern for this int o6
< i32 > ret i32 conv5 And the following x86 eax movsbl ecx cmpl ecx sete al movzbl eax ret It should be possible to eliminate the sign extensions LLVM misses a load store narrowing opportunity in this i16
Instead of the following for memset char edx edx edx It might be better to generate eax movl edx movl edx movw edx when we can spare a register It reduces code size Evaluate what the best way to codegen sdiv C is For we currently get ret i32 Y eax movl ecx ecx ecx addl eax eax ret GCC knows several different ways to codegen one of which is eax eax ecx cmovle eax eax ret which is probably slower
into eax xorps xmm0 xmm0 eax xmm0 eax xmm0 ret esp eax movdqa xmm0 xmm0 esp const ret align it should be movdqa xmm0 edi
Clang compiles this i1 i64 store i64 i64 store i64 i64 store i64 i64 store i64 align Which gets codegen d xmm0 movaps rbp movaps rbp movaps rbp movaps rbp movq
compiles ldr LCPI1_0 ldr ldr mov lsr tst moveq r1 ldr LCPI1_1 and r0 bx lr It would be better to do something like to fold the shift into the conditional ldr LCPI1_0 ldr ldr tst movne lsr ldr LCPI1_1 and r0 bx lr it saves an instruction and a register It might be profitable to cse MOVi16 if there are lots of bit immediates with the same bottom half Robert Muth started working on an alternate jump table implementation that does not put the tables in line in the text This is more like the llvm default jump table implementation This might be useful sometime Several revisions of patches are on the mailing beginning while CMP sets them like a subtract Therefore to be able to use CMN for comparisons other than the Z we ll need additional logic to reverse the conditionals associated with the comparison Perhaps a pseudo instruction for the with a post codegen pass to clean up and handle the condition codes See PR5694 for testcase Given the following on int B
llvm lib Support Unix the directory structure underneath this directory could look like only those directories actually needing to be created should be created further subdirectories could be created to reflect versions of the various standards For example
SI optimize exec mask operations pre RA
should just be implemented with a CLZ instruction Since there are other e that share this it would be best to implement this in a target independent as zero is the default value for the binary encoder e add r0 add r5 Register operands should be distinct That when the encoding does not require two syntactical operands to refer to the same register
Doing so could allow SROA of the destination pointers See also
ArrayRef< int > hi(ArrayRef< int > Vuu)
gets compiled into this on rsp movaps rsp movaps rsp movaps rsp movaps rsp movaps xmm3
into eax xorps xmm0 xmm0 eax xmm0 eax xmm0 ret gcc
class_match< Value > m_Value()
Match an arbitrary value and ignore it.
void clearbit(int *target, int bit)
std::error_code remove(const Twine &path, bool IgnoreNonExisting=true)
Remove path.
gets compiled into this on rsp movaps rsp movaps rsp movaps rsp movaps rsp movaps rsp movaps rsp movaps rsp movaps rsp movq rsp movq rsp movq rsp movq rdx
gvn sink
When an instruction is found to only be used outside of the loop, this function moves it to the exit ...
Itanium Name Demangler i e convert the string _Z1fv into and both[sub] projects need to demangle but neither can depend on each other *libcxxabi needs the demangler to implement __cxa_demangle
gets compiled into this on rsp movaps rsp movaps rsp movaps rsp movaps rsp movaps rsp movaps rsp movaps rsp movaps rsp movq r9
Expected< ExpressionValue > min(const ExpressionValue &Lhs, const ExpressionValue &Rhs)
This compiles xmm1 mulss xmm1 xorps xmm0 movss xmm0 ret Because mulss doesn t modify the top elements
SmallVector< MachineOperand, 4 > Cond
Analysis the ScalarEvolution expression for r is this
compiles ldr LCPI1_0 ldr ldr mov lsr tst moveq r1 ldr LCPI1_1 and r0 bx lr It would be better to do something like to fold the shift into the conditional ldr LCPI1_0 ldr ldr tst movne lsr ldr LCPI1_1 and r0 bx lr it saves an instruction and a register It might be profitable to cse MOVi16 if there are lots of bit immediates with the same bottom half Robert Muth started working on an alternate jump table implementation that does not put the tables in line in the text This is more like the llvm default jump table implementation This might be useful sometime Several revisions of patches are on the mailing list
static SDValue LowerSELECT_CC(SDValue Op, SelectionDAG &DAG, const SparcTargetLowering &TLI, bool hasHardQuad, bool isV9, bool is64Bit)
bool startsWith(char C) const
http eax xorl edx cl sete al setne dl sall cl
The object format emitted by the WebAssembly backed is documented see the home and packaging for producing WebAssembly applications that can run in browsers and other environments wasi sdk provides a more minimal C C SDK based on llvm and a libc based on for producing WebAssemmbly applictions that use the WASI ABI Rust provides WebAssembly support integrated into Cargo There are two main which provides a relatively minimal environment that has an emphasis on being native wasm32 unknown which uses Emscripten internally and provides standard C C libraries
we currently eax ecx subl eax ret We would use one fewer register if codegen d eax neg eax add
The same transformation can work with an even modulo with the addition of a and shrink the compare RHS by the same amount Unless the target supports rotates
Since we know that Vector is byte aligned and we know the element offset of we should change the load into a lve *x instruction
into xmm2 addss xmm2 xmm1 xmm3 addss xmm3 movaps xmm0 unpcklps xmm0 ret seems silly when it could just be one addps Expand libm rounding functions main should enable SSE DAZ mode and other fast SSE modes Think about doing i64 math in SSE regs on x86 This testcase should have no SSE instructions in and only one load from a constant double ret double C the select is being lowered
add sub stmia L5 ldr r0 bl L_printf $stub Instead of a and a wouldn t it be better to do three moves *Return an aggregate type is even return S
We have fiadd patterns now but the followings have the same cost and complexity We need a way to specify the later is more profitable def def The FP stackifier should handle simple permutates to reduce number of shuffle e g trunc
llvm lib Support Unix the directory structure underneath this directory could look like only those directories actually needing to be created should be created further subdirectories could be created to reflect versions of the various standards For under SUS there could be v2
For the entry BB esp pxor xmm0 xmm1 ucomisd xmm1 setnp al sete cl testb al jne LBB1_5 xmm2 cvtss2sd xmm3 ucomisd xmm0 ja LBB1_3 xmm2 LBB1_3
Common register allocation spilling lr str ldr sxth r3 ldr mla r4 can lr mov lr str ldr sxth r3 mla r4 and then merge mul and lr str ldr sxth r3 mla r4 It also increase the likelihood the store may become dead bb27 Successors according to LLVM ID Predecessors according to mbb< bb27, 0x8b0a7c0 > Note ADDri is not a two address instruction its result reg1037 is an operand of the PHI node in bb76 and its operand reg1039 is the result of the PHI node We should treat it as a two address code and make sure the ADDri is scheduled after any node that reads reg1039 Use info(i.e. register scavenger) to assign it a free register to allow reuse the collector could move the objects and invalidate the derived pointer This is bad enough in the first but safe points can crop up unpredictably Consider
Instead of the following for memset char edx edx edx It might be better to generate eax movl edx movl edx movw edx when we can spare a register It reduces code size Evaluate what the best way to codegen sdiv C is For we currently get ret i32 Y eax movl ecx ecx ecx addl eax eax ret GCC knows several different ways to codegen it
format_object< Ts... > format(const char *Fmt, const Ts &... Vals)
These are helper functions used to produce formatted output.
We currently generate a but we really shouldn eax ecx xorl edx divl ecx eax divl ecx movl eax ret A similar code sequence works for division We currently compile i32 v2 eax eax jo LBB1_2 xor
this could be done in SelectionDAGISel along with other special for
Instead of the following for memset char edx edx edx It might be better to generate eax movl edx movl edx movw edx when we can spare a register It reduces code size Evaluate what the best way to codegen sdiv C is For we currently get ret i32 Y eax movl ecx sarl
gets compiled into this on rsp movaps rsp movaps rsp movaps rsp movaps rsp movaps rsp movaps rsp movaps rsp movaps rsp movq rsp movq rsp movq rsp movq rsp movq rsp rax movq rsp rax movq rsp rsp rsp eax eax jbe LBB1_3 rcx rax movq rsp eax addq
Should compile r2 movcc movcs str strb mov lr r1 movcs movcc mov lr not
llvm ldr ldrb ldrh str strh strb strb gcc and possibly speed as well(we don 't have a good way to measure on ARM). *Consider this silly example
<%struct.bf ** > define void t1() nounwind ssp
static bool rewrite(Function &F)
So we should use XX3Form_Rcr to implement intrinsic Convert DP outs ins xscvdpsp No builtin are required Round &Convert QP DP(dword[1] is set to zero) No builtin are required Round to Quad Precision because you need to assign rounding mode in instruction Provide builtin(set f128:$vT,(int_ppc_vsx_xsrqpi f128:$vB))(set f128 yields< n x< ty > >< result > yields< ty >< result > No builtin are required Load Store load store see def memrix16 in PPCInstrInfo td Load Store Vector load store outs ins lxsdx $src
< i32 > br label bb114 eax ecx movl ebp subl ebp movswl(%ebp)
Common register allocation spilling lr str ldr sxth r3 ldr mla r4 can lr mov lr str ldr sxth r3 mla r4 and then merge mul and lr str ldr sxth r3 mla r4 It also increase the likelihood the store may become dead bb27 Successors according to CFG
the custom lowered code happens to be but we shouldn t have to custom lower anything This is probably related to< 2 x i64 > ops being so bad LLVM currently generates stack realignment when it is not necessary needed The problem is that we need to know about stack alignment too before RA runs At that point we don t whether there will be vector or not Stack realignment logic is overly conservative but otherwise we can produce unaligned loads stores Fixing this will require some huge RA changes Testcase
Common register allocation spilling lr str ldr sxth r3
gets compiled into this on rsp movaps rsp movaps rsp movaps rsp movaps rsp movaps rsp movaps rsp movaps rsp movaps rsp movq rsp movq rsp movq rsp movq rsp movq rsp leaq(%rsp)
a mov and lsr and lsl orr lsr orr lsl orr r1 bx lr Something like the following would be better(fewer instructions/registers)
the resulting code requires compare and branches when and if the revised code is with conditional branches instead of More there is a byte word extend before each where there should be only and the condition codes are not remembered when the same two values are compared twice More LSR enhancements possible
@ FSEL
FSEL - Traditional three-operand fsel node.
into llvm powi allowing the code generator to produce balanced multiplication trees the intrinsic needs to be extended to support and second the code generator needs to be enhanced to lower these to multiplication trees Interesting testcase for add shift mul reassoc
bool operator==(const StringView &LHS, const StringView &RHS)
the resulting code requires compare and branches when and if the revised code is with conditional branches instead of More seriously
we compile this esp call L1 $pb L1 esp je LBB1_2 esp ret but is currently always computed in the entry block It would be better to sink the picbase computation down into the block for the as it is the only one that uses it This happens for a lot of code with early outs Another example is loads of which are usually emitted into the entry block on targets like x86 If not used in all paths through a they should be sunk into the ones that do In this whole function isel would also handle this Investigate lowering of sparse switch statements into perfect hash tables
compiles ldr LCPI1_0 ldr ldr mov lsr tst moveq r1 ldr LCPI1_1 and r0 bx lr It would be better to do something like to fold the shift into the conditional ldr LCPI1_0 ldr ldr tst movne lsr ldr LCPI1_1 and r0 bx lr it saves an instruction and a register It might be profitable to cse MOVi16 if there are lots of bit immediates with the same bottom half Robert Muth started working on an alternate jump table implementation that does not put the tables in line in the text This is more like the llvm default jump table implementation This might be useful sometime Several revisions of patches are on the mailing beginning while CMP sets them like a subtract Therefore to be able to use CMN for comparisons other than the Z bit
The same transformation can work with an even modulo with the addition of a rotate
This might compile to this xmm1 xorps xmm0 movss xmm0 ret Now consider if the code caused xmm1 to get spilled This might produce this xmm1 movaps xmm0 movaps xmm1 movss xmm0 ret since the reload is only used by these we could fold it into the producing something like xmm1 movaps xmm0 ret saving two instructions The basic idea is that a reload from a spill can
Should compile to something r4 addze r3 instead we get
Common register allocation spilling lr str ldr sxth r3 ldr mla r4 can lr mov lr str ldr sxth r3 mla r4 and then merge mul and lr str ldr sxth r3 mla r4 It also increase the likelihood the store may become dead bb27 reg1037
def v4i16(MMX_MOVDQ2Qrr VR128:$src))>
#define DEMANGLE_NAMESPACE_BEGIN
compiles ldr LCPI1_0 ldr ldr mov lsr tst moveq r1 ldr LCPI1_1 and r0 bx lr It would be better to do something like to fold the shift into the conditional ldr LCPI1_0 ldr ldr tst movne lsr ldr LCPI1_1 and r0 bx lr it saves an instruction and a register It might be profitable to cse MOVi16 if there are lots of bit immediates with the same bottom half Robert Muth started working on an alternate jump table implementation that does not put the tables in line in the text This is more like the llvm default jump table implementation This might be useful sometime Several revisions of patches are on the mailing beginning while CMP sets them like a subtract Therefore to be able to use CMN for comparisons other than the Z we ll need additional logic to reverse the conditionals associated with the comparison Perhaps a pseudo instruction for the with a post codegen pass to clean up and handle the condition codes See PR5694 for testcase Given the following on armv5
llvm lib Support Unix the directory structure underneath this directory could look like only those directories actually needing to be created should be created further subdirectories could be created to reflect versions of the various standards For under SUS there could be v1
value_type read(const void *memory, endianness endian)
Read a value of a particular endianness from memory.
Function< 16 x i8 > Produces the following code with LCPI0_0 toc ha LCPI0_1 toc ha addi
vperm to rotate result into correct slot vsel result together Should codegen branches on vec_any vec_all to avoid mfcr Two examples
loop Interchanges loops for cache reuse
static void query(const MachineInstr &MI, bool &Read, bool &Write, bool &Effects, bool &StackPointer)
gets compiled into this on rsp movaps rsp movaps rsp movaps rsp movaps rsp movaps rsp movaps rsp movaps rsp movaps rsp movq rsp movq rsp movq rsp movq rsp movq rsp rax movq rsp rax movq rsp rsp rsp eax eax jbe LBB1_3 rcx rax movq rsp eax rsp ret ecx eax rcx movl rsp jmp LBB1_2 gcc rsp rax movq rsp rsp movq rsp rax movq rsp eax eax jb L6 rdx eax rsp ret p2align edx rdx eax movl rsp eax rsp ret and it gets compiled into this on ebp esp eax movl ebp eax movl ebp eax esp popl ebp ret gcc ebp eax popl ebp ret Teach tblgen not to check bitconvert source type in some cases This allows us to consolidate the following patterns in X86InstrMMX td
therefore end up llgh r3 lr r0 br r14 but truncating the load would give
The object format emitted by the WebAssembly backed is documented see the home and packaging for producing WebAssembly applications that can run in browsers and other environments wasi sdk provides a more minimal C C SDK based on llvm and a libc based on for producing WebAssemmbly applictions that use the WASI ABI Rust provides WebAssembly support integrated into Cargo There are two main options
float space text globl _test align _test
is pessimized by loop reduce and indvars u32 to float conversion improvement
we compile this esp call L1 $pb L1 esp je LBB1_2 esp ret but is currently always computed in the entry block It would be better to sink the picbase computation down into the block for the as it is the only one that uses it This happens for a lot of code with early outs Another example is loads of which are usually emitted into the entry block on targets like x86 If not used in all paths through a they should be sunk into the ones that do In this case
this lets us change the cmpl into a testl
Here we need to push the arguments because they overwrite each other main()
StringView dropBack(size_t N=1) const
Unrolling by would eliminate the &in both leading to a net reduction in code size The resultant code would then also be suitable for exit value computation We miss a bunch of rotate opportunities on various including ppc
Analysis the ScalarEvolution expression for r is< loop > Outside the this could be evaluated simply however ScalarEvolution currently evaluates it it involves i65 arithmetic
Merge disjoint stack slots
This is blocked on not handling X *X *X powi(X, 3)(see note above). The issue is that we end up getting t
into llvm powi allowing the code generator to produce balanced multiplication trees the intrinsic needs to be extended to support and second the code generator needs to be enhanced to lower these to multiplication trees Interesting testcase for add shift mul int y
static double log2(double V)
We currently emits eax Perhaps this is what we really should generate is Is imull three or four cycles eax eax The current instruction priority is based on pattern complexity The former is more complex because it folds a load so the latter will not be emitted Perhaps we should use AddedComplexity to give LEA32r a higher priority We should always try to match LEA first since the LEA matching code does some estimate to determine whether the match is profitable if we care more about code then imull is better It s two bytes shorter than movl leal On a Pentium both variants have the same characteristics with regard to throughput
cond_true lis lo16() lo16() lo16() f1 fsel f3 blr
static void require(bool Assertion, const char *Msg, llvm::StringRef Code)
we get the following basic r4 lvx v3
So that lo16() r2 stb r3 blr Becomes r3 they should compile to something better than
instcombine should handle this transform
esp eax movl ecx ecx cvtsi2ss xmm0 eax cvtsi2ss xmm1 xmm0 addss xmm0 movss flds(%esp, 1) 0000002d addl $0x04
Common register allocation spilling lr str ldr sxth r3 ldr mla r4 can lr mov lr str ldr sxth r3 mla r4 and then merge mul and lr str ldr sxth r3 mla r4 It also increase the likelihood the store may become dead bb27 Successors according to LLVM ID Predecessors according to mbb< bb27, 0x8b0a7c0 > Note ADDri is not a two address instruction However
We currently LCPI0_0 and r2 ldr LCPI0_1 and r2 orr r0 bx lr We should be able to replace the second ldr and with a bic(i.e. reuse the constant which was already loaded). Not sure what 's necessary to do that. The code generated for bswap on armv4/5(CPUs without rev) is less than ideal
@ Keep
No function return thunk.
const char * begin() const
we get the following basic r4 lvx r3 vcmpeqfp v4
we compile this esp call L1 $pb L1 esp je LBB1_2 esp ret but is currently always computed in the entry block It would be better to sink the picbase computation down into the block for the as it is the only one that uses it This happens for a lot of code with early outs Another example is loads of arguments
into eax xorps xmm0 xmm0 eax xmm0 eax xmm0 ret esp eax movdqa xmm0 xmm0 esp const ret align it should be movdqa xmm0 xmm0 We should transform a shuffle of two vectors of constants into a single vector of constants insertelement of a constant into a vector of constants should also result in a vector of constants e g VecISelBug ll We compiled it to something horrible
http eax xorl edx cl sete al setne dl sall eax sall edx But that requires good bit subreg support this might be better It s an extra but it s one instruction and doesn t stress bit subreg eax eax movl edx edx sall eax sall cl edx bit shifts(in general) expand to really bad code. Instead of using cmovs
Should compile to use and accumulates this with a bit value We currently get this with both v4 and v6
the resulting code requires compare and branches when and if the revised code is with conditional branches instead of More there is a byte word extend before each where there should be only and the condition codes are not remembered when the same two values are compared twice More LSR enhancements i8 and i32 load store addressing modes are identical This
http eax xorl edx cl sete al setne dl sall eax sall edx But that requires good bit subreg support this might be better It s an extra but it s one instruction and doesn t stress bit subreg eax eax movl edx edx sall eax sall cl edx bit we should expand to a conditional branch like GCC produces Some isel and Sequencing of Instructions Scheduling for reduced register pressure E g Minimum Register Instruction Sequence load p Because the compare isn t commutative
To do *Keep the address of the constant pool in a register instead of forming its address all of the time *We can fold small constant offsets into the hi lo references to constant pool addresses as well *When in V9 register allocate icc *[0-3] Add support for isel ing UMUL_LOHI instead of marking it as Expand *Emit the Branch on Integer Register with Prediction instructions It s not clear how to write a pattern for this int o6 subcc l0 bne LBBt1_2 ! F nop l0 st g0
instcombine should handle this C2 when X
bool CannotBeNegativeZero(const Value *V, const TargetLibraryInfo *TLI, unsigned Depth=0)
Return true if we can prove that the specified FP value is never equal to -0.0.
Instead of the following for memset char edx movl
mov r0 ldr L5 sub r0 lr needed for prologue ldmia ip add bx lr r2 The last stmia stores r2 into the address passed in there is one additional stmia that stores and r2 to some stack location The store is dead The llvm gcc generated code looks like align
into xmm2 addss xmm2 xmm1 xmm3 addss xmm3 movaps xmm0 unpcklps xmm0 ret seems silly when it could just be one addps Expand libm rounding functions main should enable SSE DAZ mode and other fast SSE modes Think about doing i64 math in SSE regs on x86 This testcase should have no SSE instructions in and only one load from a constant double ret double C Currently
gets compiled into this on rsp movaps rsp movaps rsp movaps rsp movaps xmm4
We should investigate an instruction sinking pass Consider this silly example in pic mode
This might compile to this xmm1 xorps xmm0 movss xmm0 ret Now consider if the code caused xmm1 to get spilled This might produce this xmm1 movaps xmm0 movaps xmm1 movss xmm0 ret since the reload is only used by these we could fold it into the producing something like xmm1 movaps xmm0 ret saving two instructions The basic idea is that a reload from a spill if only one byte chunk is used
declare void exit(i32) noreturn nounwind This compiles into
StringView dropFront(size_t N=1) const
Itanium Name Demangler i e convert the string _Z1fv into and both[sub] projects need to demangle but neither can depend on each other *libcxxabi needs the demangler to implement which is part of the itanium ABI spec *LLVM needs a copy for a bunch of and cannot rely on the system s __cxa_demangle because it a might not be and b may not be up to date on the latest language features The copy of the demangler in LLVM has some extra stuff that aren t needed in which depend on the shared generic components Despite these differences
Should compile edi setae al movzbl eax ret on instead of the rather stupid looking
Common register allocation spilling lr str ldr sxth r3 ldr mla r4 can lr mov lr str ldr sxth r3 mla r4 and then merge mul and mov
The object format emitted by the WebAssembly backed is documented see the home and packaging for producing WebAssembly applications that can run in browsers and other environments wasi sdk provides a more minimal C C SDK based on llvm and a libc based on musl
Reimplement select in terms of SEL *We would really like to support but we need to prove that the add doesn t need to overflow between the two bit chunks *Implement pre post increment support(e.g. PR935) *Implement smarter const ant generation for binops with large immediates. A few ARMv6T2 ops should be pattern matched
http eax xorl edx cl sete al setne dl sall eax sall edx But that requires good bit subreg support this might be better It s an extra shift
< i32 > ret i32 tmp10 ecx cmpb
S is passed via registers r2 But gcc stores them to the stack
should just be implemented with a CLZ instruction Since there are other e PPC
The following code is currently eax eax ecx jb LBB1_2 eax movzbl first_one(%eax)
http eax xorl edx cl sete al setne dl sall eax sall edx But that requires good bit subreg support this might be better It s an extra but it s one instruction and doesn t stress bit subreg eax eax movl edx edx sall eax sall cl edx bit we should expand to a conditional branch like GCC produces Some isel and Sequencing of Instructions Scheduling for reduced register pressure E g Minimum Register Instruction Sequence load p Because the compare isn t it is not matched with the load on both sides The dag combiner should be made smart enough to canonicalize the load into the RHS of a compare when it can invert the result of the compare for free In many LLVM generates code like eax cmpl esp setl al movzbl al
This is blocked on not handling X *X *X which is the same number of multiplies and is canonical
We currently generate a but we really shouldn eax ecx xorl edx divl ecx eax divl ecx movl eax ret A similar code sequence works for division We currently compile i32 v2 eax eax jo LBB1_2 neg
the custom lowered code happens to be but we shouldn t have to custom lower anything This is probably related to< 2 x i64 > ops being so bad LLVM currently generates stack realignment when it is not necessary needed The problem is that we need to know about stack alignment too early
esp eax movl ecx ecx cvtsi2ss xmm0 andl
For the entry BB esp pxor xmm0 xmm1 ucomisd xmm1 setnp al sete cl testb al jne LBB1_5 xmm2 cvtss2sd xmm3 ucomisd xmm0 ja LBB1_3 xmm2 xmm0 ret We should sink the load into xmm3 into the LBB1_2 block This should be pretty easy
lower matrix intrinsics minimal
static const uint32_t IV[8]
Should compile r2 movcc movcs str strb mov lr r1 movcs movcc mov lr r1 str mov mov cmp r1 movlo r2 str bx lr r0 mov mov cmp r0 movhs r2 mov r1 bx lr Some of the NEON intrinsics may be appropriate for more general either as target independent intrinsics or perhaps elsewhere in the ARM backend Some of them may also be lowered to target independent and perhaps some new SDNodes could be added For and absolute value operations are well defined and standard operations
auto reverse(ContainerTy &&C)
amdgpu printf runtime AMDGPU Printf lowering
Type
MessagePack types as defined in the standard, with the exception of Integer being divided into a sign...
but this requires TBAA This isn t recognized as bswap by instcombine(yes, it really is bswap)
The object format emitted by the WebAssembly backed is documented see the home page
To do *Keep the address of the constant pool in a register instead of forming its address all of the time *We can fold small constant offsets into the hi lo references to constant pool addresses as well *When in V9 register allocate icc *[0-3] Add support for isel ing UMUL_LOHI instead of marking it as Expand *Emit the Branch on Integer Register with Prediction instructions It s not clear how to write a pattern for this int o6 subcc i0
is currently compiled esp esp jne LBB1_1 addl
Common register allocation spilling lr str ldr sxth r3 ldr mla r4 can lr mov lr str ldr sxth r3 mla r4 and then merge mul and lr str ldr sxth r3 mla r4 It also increase the likelihood the store may become dead bb27 Successors according to LLVM BB
cond_true lis lo16() lo16() lo16() f1 fsel f2
Common register allocation spilling problem
test_demangle and llvm unittest Demangle The llvm directory should only get tests for stuff not included in the core library In the future we should probably move all the tests to LLVM It is also a really good idea to run libFuzzer after non trivial changes
add sub stmia L5 ldr r0 bl L_printf $stub Instead of a stmia
this lets us change the cmpl into a which is and eliminate the shift We compile this i32 i32 i8 zeroext d nounwind
gets compiled into this on rsp movaps rsp movaps rsp movaps rsp movaps rsp movaps rsp movaps rsp movaps rsp movaps rsp movq rsp movq rsp movq rsp movq rsp movq rsp rax movq rsp rax movq rsp rsp rsp eax eax jbe LBB1_3 rcx rax movq rsp eax rsp ret ecx eax rcx movl rsp jmp LBB1_2 gcc rsp rax movq rsp rsp movq rsp rax movq rsp eax eax jb L6 rdx eax rsp ret p2align edx rdx eax movl rsp eax rsp ret and it gets compiled into this on ebp esp eax movl ebp eax movl ebp eax esp popl ebp ret gcc ebp eax popl ebp ret Teach tblgen not to check bitconvert source type in some cases This allows us to consolidate the following patterns in X86InstrMMX iPTR
BlockVerifier::State From
This is blocked on not handling X *X *X which is the same number of multiplies and is because the *X has multiple uses Here s a simple X1 B ret i32 C Reassociate should handle the example in GCC PR16157
the resulting code requires compare and branches when and if the revised code is larger
we currently eax movsbl(%esp)
Here we don t need to write any variables to the top of the stack since they don t overwrite each other int callee(int32 arg1, int32 arg2)
*Add support for compiling functions in both ARM and Thumb then taking the smallest *Add support for compiling individual basic blocks in thumb when in a larger ARM function This can be used for presumed cold like paths to abort(failure path of asserts)
This would be a win on but not x86 or ppc64 setlt(loadi8 Phi)
int FirstOnet(unsigned long long arg1)
entry mr r2 blr This could be reduced to the much andc r2 r3 slwi or r2 rlwimi stw r3 blr We could collapse a bunch of those ORs and ANDs and generate the following equivalent code
the multiplication has a latency of four cycles
bool consumeFront(StringView S)
the custom lowered code happens to be but we shouldn t have to custom lower anything This is probably related to< 2 x i64 > ops being so bad LLVM currently generates stack realignment when it is not necessary needed The problem is that we need to know about stack alignment too before RA runs At that point we don t whether there will be vector spill
StringView(const char *First_, const char *Last_)
the resulting code requires compare and branches when and if the revised code is with conditional branches instead of More there is a byte word extend before each where there should be only and the condition codes are not remembered when the same two values are compared twice More LSR enhancements i8 and i32 load store addressing modes are identical int int int d
entry mflr r11 ***stw r11
The same transformation can work with an even modulo with the addition of a and shrink the compare RHS by the same amount Unless the target supports that transformation probably isn t worthwhile The transformation can also easily be made to work with non zero equality for n
bar al al movzbl eax ret Missed optimization
Implement PPCInstrInfo::isLoadFromStackSlot isStoreToStackSlot for vector registers
This currently compiles esp xmm0 movsd esp eax eax esp ret We should use not the dag combiner This is because dagcombine2 needs to be able to see through the X86ISD::Wrapper which DAGCombine can t really do The code for turning x load into a single vector load is target independent and should be moved to the dag combiner The code for turning x load into a vector load can only handle a direct load from a global or a direct load from the stack It should be generalized to handle any load from where P can be anything The alignment inference code cannot handle loads from globals in static non mode because it doesn t look through the extra dyld stub load If you try vec_align ll without relocation model
aka conv or ret i32 or6 or even i depending on the speed of the multiplier The best way to handle this is to canonicalize it to a multiply in IR and have codegen handle lowering multiplies to shifts on cpus where shifts are faster We do a number of simplifications in simplify libcalls to strength reduce standard library functions
arc branch ARC finalize branches
Unrolling by would eliminate the &in both leading to a net reduction in code size The resultant code would then also be suitable for exit value computation We miss a bunch of rotate opportunities on various including etc On X86
we should consider alternate ways to model stack dependencies Lots of things could be done in WebAssemblyTargetTransformInfo cpp there are numerous optimization related hooks that can be overridden in WebAssemblyTargetLowering Instead of the OptimizeReturned which should consider preserving the returned attribute through to MachineInstrs and extending the MemIntrinsicResults pass to do this optimization on calls too That would also let the WebAssemblyPeephole pass clean up dead defs for such as it does for stores Consider implementing and or getMachineCombinerPatterns Find a clean way to fix the problem which leads to the Shrink Wrapping pass being run after the WebAssembly PEI pass When setting multiple variables to the same we currently get code like const It could be done with a smaller encoding like local tee $pop5 local copy
APFloat abs(APFloat X)
Returns the absolute value of the argument.
The object format emitted by the WebAssembly backed is documented see the home and packaging for producing WebAssembly applications that can run in browsers and other environments wasi sdk provides a more minimal C C SDK based on llvm and a libc based on for producing WebAssemmbly applictions that use the WASI ABI Rust provides WebAssembly support integrated into Cargo There are two main which provides a relatively minimal environment that has an emphasis on being native wasm32 unknown emscripten
bb420 i The CBE manages to produce
Common register allocation spilling lr str ldr sxth r3 ldr mla r4 can lr mov lr str ldr sxth r3 mla r4 and then merge mul and lr str ldr sxth r3 mla r4 It also increase the likelihood the store may become dead bb27 Successors according to LLVM ID Predecessors according to mbb< bb27, 0x8b0a7c0 > Note ADDri is not a two address instruction its result reg1037 is an operand of the PHI node in bb76 and its operand reg1039 is the result of the PHI node We should treat it as a two address code and make sure the ADDri is scheduled after any node that reads reg1039 Use info(i.e. register scavenger) to assign it a free register to allow reuse the collector could move the objects and invalidate the derived pointer This is bad enough in the first but safe points can crop up unpredictably **array_addr i32 n y store obj obj **nth_el If the i64 division is lowered to a libcall
ArrayRef< int > lo(ArrayRef< int > Vuu)
< i32 > ret i32 tmp10 ecx esp je LBB0_2 eax addl eax ret LBB0_2
print Instructions which execute on loop entry
llvm lib Support Unix the directory structure underneath this directory could look like only those directories actually needing to be created should be created Also
_test eax eax ret Leaf functions that require one byte spill slot have a prolog like esp and an epilog like esp popl esi ret It would be and potentially faster
http eax xorl edx cl sete al setne dl sall eax sall edx But that requires good bit subreg support this might be better It s an extra but it s one instruction and doesn t stress bit subreg eax eax movl edx edx sall eax sall cl edx bit we should expand to a conditional branch like GCC produces Some isel and Sequencing of Instructions Scheduling for reduced register pressure E g Minimum Register Instruction Sequence load p Because the compare isn t it is not matched with the load on both sides The dag combiner should be made smart enough to canonicalize the load into the RHS of a compare when it can invert the result of the compare for free In many LLVM generates code like eax cmpl esp setl al movzbl eax ret on some it is more efficient to do ebx xor eax cmpl ebx
Add support for conditional and other related patterns Instead of
Should compile r2 movcc movcs str strb mov lr r1 movcs movcc mov lr r1 str mov mov cmp r1 movlo r2 str bx lr __Z11no_overflowjj
static void write(bool isBE, void *P, T V)