LLVM  13.0.0git
X86TargetInfo.cpp
Go to the documentation of this file.
1 //===-- X86TargetInfo.cpp - X86 Target Implementation ---------------------===//
2 //
3 // Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
4 // See https://llvm.org/LICENSE.txt for license information.
5 // SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
6 //
7 //===----------------------------------------------------------------------===//
8 
11 using namespace llvm;
12 
14  static Target TheX86_32Target;
15  return TheX86_32Target;
16 }
18  static Target TheX86_64Target;
19  return TheX86_64Target;
20 }
21 
23  RegisterTarget<Triple::x86, /*HasJIT=*/true> X(
24  getTheX86_32Target(), "x86", "32-bit X86: Pentium-Pro and above", "X86");
25 
26  RegisterTarget<Triple::x86_64, /*HasJIT=*/true> Y(
27  getTheX86_64Target(), "x86-64", "64-bit X86: EM64T and AMD64", "X86");
28 }
z
return z
Definition: README.txt:14
i
i
Definition: README.txt:29
use
Move duplicate certain instructions close to their use
Definition: Localizer.cpp:31
movaps
This might compile to this xmm1 xorps xmm0 movss xmm0 ret Now consider if the code caused xmm1 to get spilled This might produce this xmm1 movaps xmm0 movaps xmm1 movss xmm0 ret since the reload is only used by these we could fold it into the producing something like xmm1 movaps xmm0 ret saving two instructions The basic idea is that a reload from a spill if only one byte chunk is bring in zeros the one element instead of elements This can be used to simplify a variety of shuffle where the elements are fixed zeros This code generates ugly probably due to costs being off or< 4 x float > eax movaps(%eax)
x
void x(unsigned short n)
Definition: README-SSE.txt:355
set
We currently generate a but we really shouldn eax ecx xorl edx divl ecx eax divl ecx movl eax ret A similar code sequence works for division We currently compile i32 v2 eax eax jo LBB1_2 atomic and others It is also currently not done for read modify write instructions It is also current not done if the OF or CF flags are needed The shift operators have the complication that when the shift count is EFLAGS is not set
Definition: README.txt:1277
functions
amdgpu propagate attributes Late propagate attributes from kernels to functions
Definition: AMDGPUPropagateAttributes.cpp:199
CmpMode::FP
@ FP
zeros
zeros_impl< sizeof(T)> zeros(const T &)
Definition: COFFEmitter.cpp:345
ABI
Generic address nodes are lowered to some combination of target independent and machine specific ABI
Definition: Relocation.txt:34
is
should just be implemented with a CLZ instruction Since there are other e that share this it would be best to implement this in a target independent as zero is the default value for the binary encoder e add r0 add r5 Register operands should be distinct That is
Definition: README.txt:725
as
compiles conv shl5 shl ret i32 or10 it would be better as
Definition: README.txt:615
ecx
Instead of the following for memset char edx edx edx It might be better to generate eax movl edx movl edx movw edx when we can spare a register It reduces code size Evaluate what the best way to codegen sdiv C is For we currently get ret i32 Y eax movl ecx ecx ecx addl ecx
Definition: README.txt:147
then
RUN< i32 >< i1 > br i1 label then
Definition: README.txt:338
targets
should just be implemented with a CLZ instruction Since there are other targets
Definition: README.txt:709
right
the custom lowered code happens to be right
Definition: README-SSE.txt:480
load
into xmm2 addss xmm2 xmm1 xmm3 addss xmm3 movaps xmm0 unpcklps xmm0 ret seems silly when it could just be one addps Expand libm rounding functions main should enable SSE DAZ mode and other fast SSE modes Think about doing i64 math in SSE regs on x86 This testcase should have no SSE instructions in and only one load from a constant double ret double C the select is being which prevents the dag combiner from turning load CPI2 load(select CPI1, CPI2)' The pattern isel got this one right. Lower memcpy/memset to a series of SSE 128 bit move instructions when it 's feasible. Codegen
Definition: README-SSE.txt:90
generate
This might compile to this xmm1 xorps xmm0 movss xmm0 ret Now consider if the code caused xmm1 to get spilled This might produce this xmm1 movaps xmm0 movaps xmm1 movss xmm0 ret since the reload is only used by these we could fold it into the producing something like xmm1 movaps xmm0 ret saving two instructions The basic idea is that a reload from a spill if only one byte chunk is bring in zeros the one element instead of elements This can be used to simplify a variety of shuffle where the elements are fixed zeros This code generates ugly probably due to costs being off or< 4 x float > eax xmm0 pxor xmm1 movaps xmm2 xmm2 xmm0 movaps eax ret Would it be better to generate
Definition: README-SSE.txt:301
constant
we should consider alternate ways to model stack dependencies Lots of things could be done in WebAssemblyTargetTransformInfo cpp there are numerous optimization related hooks that can be overridden in WebAssemblyTargetLowering Instead of the OptimizeReturned which should consider preserving the returned attribute through to MachineInstrs and extending the MemIntrinsicResults pass to do this optimization on calls too That would also let the WebAssemblyPeephole pass clean up dead defs for such as it does for stores Consider implementing and or getMachineCombinerPatterns Find a clean way to fix the problem which leads to the Shrink Wrapping pass being run after the WebAssembly PEI pass When setting multiple variables to the same constant
Definition: README.txt:91
into
This compiles into
Definition: README-SSE.txt:215
llvm
Definition: AllocatorList.h:23
eax
Add support for conditional and other related patterns Instead eax eax je LBB16_2 eax edi eax movl eax
Definition: README.txt:145
c2
This might compile to this xmm1 xorps xmm0 movss xmm0 ret Now consider if the code caused xmm1 to get spilled This might produce this xmm1 movaps c2(%esp) ... xorps %xmm0
it
into xmm2 addss xmm2 xmm1 xmm3 addss xmm3 movaps xmm0 unpcklps xmm0 ret seems silly when it could just be one addps Expand libm rounding functions main should enable SSE DAZ mode and other fast SSE modes Think about doing i64 math in SSE regs on x86 This testcase should have no SSE instructions in it
Definition: README-SSE.txt:81
subl
subl
Definition: README.txt:297
llvm::sys::path::extension
StringRef extension(StringRef path, Style style=Style::native)
Get extension.
Definition: Path.cpp:588
shrq
< float * > store float float *tmp5 ret void Compiles rax shrq
Definition: README-SSE.txt:803
cmpl
Add support for conditional and other related patterns Instead eax cmpl
Definition: README.txt:135
zero
We currently generate a but we really shouldn eax ecx xorl edx divl ecx eax divl ecx movl eax ret A similar code sequence works for division We currently compile i32 v2 eax eax jo LBB1_2 atomic and others It is also currently not done for read modify write instructions It is also current not done if the OF or CF flags are needed The shift operators have the complication that when the shift count is zero
Definition: README.txt:1277
xmm0
Instead we get xmm0
Definition: README-SSE.txt:33
simplify
hexagon bit simplify
Definition: HexagonBitSimplify.cpp:261
ceil
We have fiadd patterns now but the followings have the same cost and complexity We need a way to specify the later is more profitable def def The FP stackifier should handle simple permutates to reduce number of shuffle e g ceil
Definition: README-FPStack.txt:54
see
The object format emitted by the WebAssembly backed is documented see the home and packaging for producing WebAssembly applications that can run in browsers and other environments wasi sdk provides a more minimal C C SDK based on llvm and a libc based on for producing WebAssemmbly applictions that use the WASI ABI Rust provides WebAssembly support integrated into Cargo There are two main which provides a relatively minimal environment that has an emphasis on being native wasm32 unknown which uses Emscripten internally and provides standard C C filesystem GL and SDL bindings For more see
Definition: README.txt:41
return
return
Definition: README.txt:242
P
This currently compiles esp xmm0 movsd esp eax eax esp ret We should use not the dag combiner This is because dagcombine2 needs to be able to see through the X86ISD::Wrapper which DAGCombine can t really do The code for turning x load into a single vector load is target independent and should be moved to the dag combiner The code for turning x load into a vector load can only handle a direct load from a global or a direct load from the stack It should be generalized to handle any load from P
Definition: README-SSE.txt:411
sroa
sroa
Definition: SROA.cpp:4835
RFP
We have fiadd patterns now but the followings have the same cost and complexity We need a way to specify the later is more profitable def RFP
Definition: README-FPStack.txt:18
llvm::getTheX86_64Target
Target & getTheX86_64Target()
Definition: X86TargetInfo.cpp:17
llvm::Triple::x86
@ x86
Definition: Triple.h:83
stuff
_bar mov r0 mov r1 fldd LCPI1_0 fmrrd d0 bl _foo fmdrr r5 fmsr r0 fsitod s2 faddd d0 fmrrd d0 ldmfd r0 mov r1 the copys to callee save registers and the fact they are only being used by the fmdrr instruction It would have been better had the fmdrr been scheduled before the call and place the result in a callee save DPR register The two mov ops would not have been necessary Calling convention related stuff
Definition: README.txt:181
it
Reference model for inliner Oz decision policy Note this model is also referenced by test Transforms Inline ML tests if replacing it
Definition: README.txt:3
contains
return AArch64::GPR64RegClass contains(Reg)
llvm::Target
Target - Wrapper for Target specific information.
Definition: TargetRegistry.h:124
double
into xmm2 addss xmm2 xmm1 xmm3 addss xmm3 movaps xmm0 unpcklps xmm0 ret seems silly when it could just be one addps Expand libm rounding functions main should enable SSE DAZ mode and other fast SSE modes Think about doing i64 math in SSE regs on x86 This testcase should have no SSE instructions in and only one load from a constant double
Definition: README-SSE.txt:85
llvm::lltok::bar
@ bar
Definition: LLToken.h:37
uses
This might compile to this xmm1 xorps xmm0 movss xmm0 ret Now consider if the code caused xmm1 to get spilled This might produce this xmm1 movaps xmm0 movaps xmm1 movss xmm0 ret since the reload is only used by these we could fold it into the uses
Definition: README-SSE.txt:258
add
SSE has instructions for doing operations on complex we should pattern match them For this should turn into a horizontal add
Definition: README-SSE.txt:25
currently
Reference model for inliner Oz decision policy Note currently
Definition: README.txt:2
Wrapper
amdgpu aa AMDGPU Address space based Alias Analysis Wrapper
Definition: AMDGPUAliasAnalysis.cpp:30
codegen
Since we know that Vector is byte aligned and we know the element offset of we should change the load into a lve *x instead of doing a load store lve *x sequence Implement passing vectors by value into calls and receiving them as arguments GCC apparently tries to codegen
Definition: README_ALTIVEC.txt:46
addl
This currently compiles esp xmm0 movsd esp eax eax addl
Definition: README-SSE.txt:394
information
The object format emitted by the WebAssembly backed is documented see the home and packaging for producing WebAssembly applications that can run in browsers and other environments wasi sdk provides a more minimal C C SDK based on llvm and a libc based on for producing WebAssemmbly applictions that use the WASI ABI Rust provides WebAssembly support integrated into Cargo There are two main which provides a relatively minimal environment that has an emphasis on being native wasm32 unknown which uses Emscripten internally and provides standard C C filesystem GL and SDL bindings For more information
Definition: README.txt:29
tmp5
< float > tmp5
Definition: README-SSE.txt:794
CH
AMD64 Optimization Manual has some nice information about optimizing integer multiplication by a constant How much of it applies to Intel s X86 implementation There are definite trade offs to xmm0 cvttss2siq rdx jb L3 subss xmm0 rax cvttss2siq rdx xorq rdx rax ret instead of xmm1 cvttss2siq rcx movaps xmm2 subss xmm2 cvttss2siq rax rdx xorq rax ucomiss xmm0 cmovb rax ret Seems like the jb branch has high likelihood of being taken It would have saved a few instructions It s not possible to reference CH
Definition: README-X86-64.txt:44
like
This might compile to this xmm1 xorps xmm0 movss xmm0 ret Now consider if the code caused xmm1 to get spilled This might produce this xmm1 movaps xmm0 movaps xmm1 movss xmm0 ret since the reload is only used by these we could fold it into the producing something like xmm1 movaps xmm0 ret saving two instructions The basic idea is that a reload from a spill if only one byte chunk is bring in zeros the one element instead of elements This can be used to simplify a variety of shuffle where the elements are fixed zeros This code generates ugly probably due to costs being off or< 4 x float > eax xmm0 pxor xmm1 movaps xmm2 xmm2 xmm0 movaps eax ret Would it be better to ecx xmm0 xor eax xmm0 xmm0 movaps ecx ret Some useful information in the Apple Altivec SSE Migration or Various SSE compare translations Add hooks to commute some CMPP operations Apply the same transformation that merged four float into a single bit load to loads from constant pool Floating point max min are commutable when enable unsafe fp path is specified We should turn int_x86_sse_max_ss and X86ISD::FMIN etc into other nodes which are selected to max min instructions that are marked commutable We should materialize vector constants like all ones and signbit with code like
Definition: README-SSE.txt:340
LCPI1_1
into eax xorps xmm0 xmm0 eax xmm0 eax xmm0 ret esp eax movdqa xmm0 xmm0 esp const ret align it should be movdqa xmm0 xmm0 We should transform a shuffle of two vectors of constants into a single vector of constants insertelement of a constant into a vector of constants should also result in a vector of constants e g VecISelBug ll We compiled it to something globl _t xmm0 movhps xmm0 movss LCPI1_1
Definition: README-SSE.txt:677
test::unit
unit
Definition: README.txt:46
i64
< float > i64
Definition: README-SSE.txt:794
llvm::ISD::FP_TO_SINT
@ FP_TO_SINT
FP_TO_[US]INT - Convert a floating point value to a signed or unsigned integer.
Definition: ISDOpcodes.h:775
however
however
Definition: README.txt:253
to
Should compile to
Definition: README.txt:449
llvm::Triple::x86_64
@ x86_64
Definition: Triple.h:84
llvm::sys::path::end
const_iterator end(StringRef path)
Get end iterator over path.
Definition: Path.cpp:233
i8
Clang compiles this i8
Definition: README.txt:504
Optimization
PowerPC MI Peephole Optimization
Definition: PPCMIPeephole.cpp:1664
f32
float f32(v4f32 A)
Definition: README-SSE.txt:26
know
the custom lowered code happens to be but we shouldn t have to custom lower anything This is probably related to< 2 x i64 > ops being so bad LLVM currently generates stack realignment when it is not necessary needed The problem is that we need to know about stack alignment too before RA runs At that point we don t know
Definition: README-SSE.txt:489
pinsrb
into eax xorps xmm0 xmm0 eax xmm0 eax xmm0 ret esp eax movdqa xmm0 xmm0 esp const ret align it should be movdqa xmm0 pinsrb
Definition: README-SSE.txt:650
vectors
hexagon Hexagon specific predictive commoning for HVX vectors
Definition: HexagonVectorLoopCarriedReuse.cpp:221
ret
to esp esp setne al movzbw ax esp setg cl movzbw cx cmove cx cl jne LBB1_2 esp ret(also really horrible code on ppc). This is due to the expand code for 64-bit compares. GCC produces multiple branches
at
compiles ldr LCPI1_0 ldr ldr mov lsr tst moveq r1 ldr LCPI1_1 and r0 bx lr It would be better to do something like to fold the shift into the conditional ldr LCPI1_0 ldr ldr tst movne lsr ldr LCPI1_1 and r0 bx lr it saves an instruction and a register It might be profitable to cse MOVi16 if there are lots of bit immediates with the same bottom half Robert Muth started working on an alternate jump table implementation that does not put the tables in line in the text This is more like the llvm default jump table implementation This might be useful sometime Several revisions of patches are on the mailing beginning at
Definition: README.txt:582
worse
add sub stmia L5 ldr r0 bl L_printf $stub Instead of a and a wouldn t it be better to do three moves *Return an aggregate type is even worse
Definition: README.txt:210
movmskp
This currently compiles esp xmm0 movsd esp eax eax esp ret We should use movmskp
Definition: README-SSE.txt:397
movl
into eax xorps xmm0 xmm0 eax xmm0 movl
Definition: README-SSE.txt:624
tmp
alloca< 16 x float >, align 16 %tmp2=alloca< 16 x float >, align 16 store< 16 x float > %A,< 16 x float > *%tmp %s=bitcast< 16 x float > *%tmp to i8 *%s2=bitcast< 16 x float > *%tmp2 to i8 *call void @llvm.memcpy.i64(i8 *%s, i8 *%s2, i64 64, i32 16) %R=load< 16 x float > *%tmp2 ret< 16 x float > %R } declare void @llvm.memcpy.i64(i8 *nocapture, i8 *nocapture, i64, i32) nounwind which compiles to:_foo:subl $140, %esp movaps %xmm3, 112(%esp) movaps %xmm2, 96(%esp) movaps %xmm1, 80(%esp) movaps %xmm0, 64(%esp) movl 60(%esp), %eax movl %eax, 124(%esp) movl 56(%esp), %eax movl %eax, 120(%esp) movl 52(%esp), %eax< many many more 32-bit copies > movaps(%esp), %xmm0 movaps 16(%esp), %xmm1 movaps 32(%esp), %xmm2 movaps 48(%esp), %xmm3 addl $140, %esp ret On Nehalem, it may even be cheaper to just use movups when unaligned than to fall back to lower-granularity chunks. Implement processor-specific optimizations for parity with GCC on these processors. GCC does two optimizations:1. ix86_pad_returns inserts a noop before ret instructions if immediately preceded by a conditional branch or is the target of a jump. 2. ix86_avoid_jump_misspredicts inserts noops in cases where a 16-byte block of code contains more than 3 branches. The first one is done for all AMDs, Core2, and "Generic" The second one is done for:Atom, Pentium Pro, all AMDs, Pentium 4, Nocona, Core 2, and "Generic" Testcase:int x(int a) { return(a &0xf0)> >4 tmp
Definition: README.txt:1347
handle
then ret i32 result Tail recursion elimination should handle
Definition: README.txt:355
generates
gets compiled into this on rsp movaps rsp movaps rsp movaps rsp movaps rsp movaps rsp movaps rsp movaps rsp movaps rsp movq rsp movq rsp movq rsp movq rsp movq rsp rax movq rsp rax movq rsp rsp rsp eax eax jbe LBB1_3 rcx rax movq rsp eax rsp ret ecx eax rcx movl rsp jmp LBB1_2 gcc generates
Definition: README.txt:1153
and
We currently generate a but we really shouldn eax ecx xorl edx divl ecx eax divl ecx movl eax ret A similar code sequence works for division We currently compile i32 v2 eax eax jo LBB1_2 and
Definition: README.txt:1271
x86
Note that only the low bits of effective_addr2 are used On bit we don t eliminate the computation of the top half of effective_addr2 because we don t have whole function selection dags On x86
Definition: README.txt:318
FpADD32m
We have fiadd patterns now but the followings have the same cost and complexity We need a way to specify the later is more profitable def FpADD32m
Definition: README-FPStack.txt:18
movzwl
The following code is currently eax eax ecx jb LBB1_2 eax movzbl eax ret eax ret We could change the eax into movzwl(%esp)
llvm::LegalityPredicates::any
Predicate any(Predicate P0, Predicate P1)
True iff P0 or P1 are true.
Definition: LegalizerInfo.h:209
store
This currently compiles esp xmm0 movsd esp eax eax esp ret We should use not the dag combiner This is because dagcombine2 needs to be able to see through the X86ISD::Wrapper which DAGCombine can t really do The code for turning x load into a single vector load is target independent and should be moved to the dag combiner The code for turning x load into a vector load can only handle a direct load from a global or a direct load from the stack It should be generalized to handle any load from where P can be anything The alignment inference code cannot handle loads from globals in static non mode because it doesn t look through the extra dyld stub load If you try vec_align ll without relocation you ll see what I mean We should lower store(fneg(load p), q) into an integer load+xor+store
http
Clang compiles this i1 i64 store i64 i64 store i64 i64 store i64 i64 store i64 align Which gets codegen d xmm0 movaps rbp movaps rbp movaps rbp movaps rbp rbp rbp rbp rbp It would be better to have movq s of instead of the movaps s http
Definition: README.txt:532
like
Should compile to something like
Definition: README.txt:19
llvm::X86ISD::CMPP
@ CMPP
Definition: X86ISelLowering.h:373
extended
we should consider alternate ways to model stack dependencies Lots of things could be done in WebAssemblyTargetTransformInfo cpp there are numerous optimization related hooks that can be overridden in WebAssemblyTargetLowering Instead of the OptimizeReturned which should consider preserving the returned attribute through to MachineInstrs and extending the MemIntrinsicResults pass to do this optimization on calls too That would also let the WebAssemblyPeephole pass clean up dead defs for such as it does for stores Consider implementing and or getMachineCombinerPatterns Find a clean way to fix the problem which leads to the Shrink Wrapping pass being run after the WebAssembly PEI pass When setting multiple variables to the same we currently get code like const It could be done with a smaller encoding like local tee $pop5 local $pop6 WebAssembly registers are implicitly initialized to zero Explicit zeroing is therefore often redundant and could be optimized away Small indices may use smaller encodings than large indices WebAssemblyRegColoring and or WebAssemblyRegRenumbering should sort registers according to their usage frequency to maximize the usage of smaller encodings Many cases of irreducible control flow could be transformed more optimally than via the transform in WebAssemblyFixIrreducibleControlFlow cpp It may also be worthwhile to do transforms before register particularly when duplicating to allow register coloring to be aware of the duplication WebAssemblyRegStackify could use AliasAnalysis to reorder loads and stores more aggressively WebAssemblyRegStackify is currently a greedy algorithm This means for a binary however wasm doesn t actually require this WebAssemblyRegStackify could be extended
Definition: README.txt:149
a
=0.0 ? 0.0 :(a > 0.0 ? 1.0 :-1.0) a
Definition: README.txt:489
floor
We have fiadd patterns now but the followings have the same cost and complexity We need a way to specify the later is more profitable def def The FP stackifier should handle simple permutates to reduce number of shuffle e g floor
Definition: README-FPStack.txt:54
i32mem
We have fiadd patterns now but the followings have the same cost and complexity We need a way to specify the later is more profitable def def i32mem
Definition: README-FPStack.txt:23
way
should just be implemented with a CLZ instruction Since there are other e that share this it would be best to implement this in a target independent way
Definition: README.txt:720
result
It looks like we only need to define PPCfmarto for these because according to these instructions perform RTO on fma s result
Definition: README_P9.txt:256
i32
< float > i32
Definition: README-SSE.txt:794
imull
We currently emits imull
Definition: README.txt:235
llvm::VFISAKind::SSE
@ SSE
x3
In x86 we generate this spiffy xmm0 xmm0 ret in x86 we generate this which could be xmm1 movss xmm1 xmm0 ret In sse4 we could use insertps to make both better Here s another testcase that could use x3
Definition: README-SSE.txt:547
llvm::AArch64ISD::NEG
@ NEG
Definition: AArch64ISelLowering.h:173
bb
< i1 > br i1 label label bb bb
Definition: README.txt:978
Apple
@ Apple
Definition: AArch64MCAsmInfo.cpp:24
turning
We have fiadd patterns now but the followings have the same cost and complexity We need a way to specify the later is more profitable def def The FP stackifier should handle simple permutates to reduce number of shuffle e g turning
Definition: README-FPStack.txt:54
rcx
AMD64 Optimization Manual has some nice information about optimizing integer multiplication by a constant How much of it applies to Intel s X86 implementation There are definite trade offs to xmm0 cvttss2siq rdx jb L3 subss xmm0 rax cvttss2siq rdx xorq rdx rax ret instead of xmm1 cvttss2siq rcx movaps xmm2 subss xmm2 cvttss2siq rax rdx xorq rax ucomiss xmm0 cmovb rcx
Definition: README-X86-64.txt:36
select
into xmm2 addss xmm2 xmm1 xmm3 addss xmm3 movaps xmm0 unpcklps xmm0 ret seems silly when it could just be one addps Expand libm rounding functions main should enable SSE DAZ mode and other fast SSE modes Think about doing i64 math in SSE regs on x86 This testcase should have no SSE instructions in and only one load from a constant double ret double C the select is being which prevents the dag combiner from turning select(load CPI1)
llvm::PatternMatch::match
bool match(Val *V, const Pattern &P)
Definition: PatternMatch.h:49
insertps
In x86 we generate this spiffy xmm0 xmm0 ret in x86 we generate this which could be xmm1 movss xmm1 xmm0 ret In sse4 we could use insertps to make both better Here s another testcase that could use insertps[mem]
Definition: README-SSE.txt:547
code
This might compile to this code
Definition: README-SSE.txt:240
llvm::ISD::ABS
@ ABS
ABS - Determine the unsigned absolute value of a signed integer value of the same bitwidth.
Definition: ISDOpcodes.h:630
them
Common register allocation spilling lr str ldr sxth r3 ldr mla r4 can lr mov lr str ldr sxth r3 mla r4 and then merge mul and lr str ldr sxth r3 mla r4 It also increase the likelihood the store may become dead bb27 Successors according to LLVM ID Predecessors according to mbb< bb27, 0x8b0a7c0 > Note ADDri is not a two address instruction its result reg1037 is an operand of the PHI node in bb76 and its operand reg1039 is the result of the PHI node We should treat it as a two address code and make sure the ADDri is scheduled after any node that reads reg1039 Use info(i.e. register scavenger) to assign it a free register to allow reuse the collector could move the objects and invalidate the derived pointer This is bad enough in the first but safe points can crop up unpredictably **array_addr i32 n y store obj obj **nth_el If the i64 division is lowered to a then a safe point array and nth_el no longer point into the correct object The fix for this is to copy address calculations so that dependent pointers are never live across safe point boundaries But the loads cannot be copied like this if there was an intervening so may be hard to get right Only a concurrent mutator can trigger a collection at the libcall safe point So single threaded programs do not have this even with a copying collector LLVM optimizations would probably undo a front end s careful work The ocaml frametable structure supports liveness information It would be good to support it The FIXME in ComputeCommonTailLength in BranchFolding cpp needs to be revisited The check is there to work around a misuse of directives in assembly It would be good to detect collector target compatibility instead of silently doing the wrong thing It would be really nice to be able to write patterns in td files for which would eliminate a bunch of predicates on them(e.g. no side effects). Once this is in place
f
Itanium Name Demangler i e convert the string _Z1fv into f()". You can also use the CRTP base ManglingParser to perform some simple analysis on the mangled name
P2
This might compile to this xmm1 xorps xmm0 movss xmm0 ret Now consider if the code caused xmm1 to get spilled This might produce this xmm1 movaps xmm0 movaps xmm1 movss xmm0 ret since the reload is only used by these we could fold it into the producing something like xmm1 movaps xmm0 ret saving two instructions The basic idea is that a reload from a spill if only one byte chunk is bring in zeros the one element instead of elements This can be used to simplify a variety of shuffle where the elements are fixed zeros This code generates ugly probably due to costs being off or< 4 x float > * P2
Definition: README-SSE.txt:278
C
(vector float) vec_cmpeq(*A, *B) C
Definition: README_ALTIVEC.txt:86
andnot
This might compile to this xmm1 xorps xmm0 movss xmm0 ret Now consider if the code caused xmm1 to get spilled This might produce this xmm1 movaps xmm0 movaps xmm1 movss xmm0 ret since the reload is only used by these we could fold it into the producing something like xmm1 movaps xmm0 ret saving two instructions The basic idea is that a reload from a spill if only one byte chunk is bring in zeros the one element instead of elements This can be used to simplify a variety of shuffle where the elements are fixed zeros This code generates ugly probably due to costs being off or< 4 x float > eax xmm0 pxor xmm1 movaps xmm2 xmm2 xmm0 movaps eax ret Would it be better to ecx xmm0 xor eax xmm0 xmm0 movaps ecx ret Some useful information in the Apple Altivec SSE Migration andnot
Definition: README-SSE.txt:318
int
Clang compiles this i1 i64 store i64 i64 store i64 i64 store i64 i64 store i64 align Which gets codegen d xmm0 movaps rbp movaps rbp movaps rbp movaps rbp rbp rbp rbp rbp It would be better to have movq s of instead of the movaps s LLVM produces ret int
Definition: README.txt:536
rax
AMD64 Optimization Manual has some nice information about optimizing integer multiplication by a constant How much of it applies to Intel s X86 implementation There are definite trade offs to xmm0 cvttss2siq rdx jb L3 subss xmm0 rax cvttss2siq rdx xorq rax
Definition: README-X86-64.txt:20
Y
static GCMetadataPrinterRegistry::Add< OcamlGCMetadataPrinter > Y("ocaml", "ocaml 3.10-compatible collector")
test
Definition: README.txt:37
shorter
http eax xorl edx cl sete al setne dl sall eax sall edx But that requires good bit subreg support this might be better It s an extra but it s one instruction shorter
Definition: README.txt:31
t
bitcast float %x to i32 %s=and i32 %t, 2147483647 %d=bitcast i32 %s to float ret float %d } declare float @fabsf(float %n) define float @bar(float %x) nounwind { %d=call float @fabsf(float %x) ret float %d } This IR(from PR6194):target datalayout="e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v64:64:64-v128:128:128-a0:0:64-s0:64:64-f80:128:128-n8:16:32:64-S128" target triple="x86_64-apple-darwin10.0.0" %0=type { double, double } %struct.float3=type { float, float, float } define void @test(%0, %struct.float3 *nocapture %res) nounwind noinline ssp { entry:%tmp18=extractvalue %0 %0, 0 t
Definition: README-SSE.txt:788
b
the resulting code requires compare and branches when and if the revised code is with conditional branches instead of More there is a byte word extend before each where there should be only and the condition codes are not remembered when the same two values are compared twice More LSR enhancements i8 and i32 load store addressing modes are identical int b
Definition: README.txt:418
f32mem
We have fiadd patterns now but the followings have the same cost and complexity We need a way to specify the later is more profitable def f32mem
Definition: README-FPStack.txt:18
l
This requires reassociating to forms of expressions that are already something that reassoc doesn t think about yet These two functions should generate the same code on big endian int * l
Definition: README.txt:100
codegen
AMD64 Optimization Manual has some nice information about optimizing integer multiplication by a constant How much of it applies to Intel s X86 implementation There are definite trade offs to xmm0 cvttss2siq rdx jb L3 subss xmm0 rax cvttss2siq rdx xorq rdx rax ret instead of xmm1 cvttss2siq rcx movaps xmm2 subss xmm2 cvttss2siq rax rdx xorq rax ucomiss xmm0 cmovb rax ret Seems like the jb branch has high likelihood of being taken It would have saved a few instructions It s not possible to reference and DH registers in an instruction requiring REX prefix divb and mulb both produce results in AH If isel emits a CopyFromReg which gets turned into a movb and that can be allocated a r8b r15b To get around isel emits a CopyFromReg from AX and then right shift it down by and truncate it It s not pretty but it works We need some register allocation magic to make the hack go which would often require a callee saved register Callees usually need to keep this value live for most of their body so it doesn t add a significant burden on them We currently implement this in codegen
Definition: README-X86-64.txt:64
here
the custom lowered code happens to be but we shouldn t have to custom lower anything This is probably related to< 2 x i64 > ops being so bad LLVM currently generates stack realignment when it is not necessary needed The problem is that we need to know about stack alignment too before RA runs At that point we don t whether there will be vector or not Stack realignment logic is overly conservative here
Definition: README-SSE.txt:490
tmp20
< float > tmp20
Definition: README-SSE.txt:424
in
The object format emitted by the WebAssembly backed is documented in
Definition: README.txt:11
be
Common register allocation spilling lr str ldr sxth r3 ldr mla r4 can be
Definition: README.txt:14
pool
into xmm2 addss xmm2 xmm1 xmm3 addss xmm3 movaps xmm0 unpcklps xmm0 ret seems silly when it could just be one addps Expand libm rounding functions main should enable SSE DAZ mode and other fast SSE modes Think about doing i64 math in SSE regs on x86 This testcase should have no SSE instructions in and only one load from a constant pool
Definition: README-SSE.txt:85
llvm::ISD::SINT_TO_FP
@ SINT_TO_FP
[SU]INT_TO_FP - These operators convert integers (whose interpreted sign depends on the first letter)...
Definition: ISDOpcodes.h:729
LBB1_1
_foo edi jbe LBB1_3 LBB1_1
Definition: README-X86-64.txt:94
above
To do *Keep the address of the constant pool in a register instead of forming its address all of the time *We can fold small constant offsets into the hi lo references to constant pool addresses as well *When in V9 register allocate icc *[0-3] Add support for isel ing UMUL_LOHI instead of marking it as Expand *Emit the Branch on Integer Register with Prediction instructions It s not clear how to write a pattern for this int o6 subcc l0 bne LBBt1_2 ! F nop l0 st g0 retl nop should be replaced with a brz in V9 mode *Same as above
Definition: README.txt:40
FpIADD32m
We have fiadd patterns now but the followings have the same cost and complexity We need a way to specify the later is more profitable def def FpIADD32m
Definition: README-FPStack.txt:20
that
Reference model for inliner Oz decision policy Note that
Definition: README.txt:2
llvm::tgtok::If
@ If
Definition: TGLexer.h:51
IR
Statically lint checks LLVM IR
Definition: Lint.cpp:744
modes
*Add support for compiling functions in both ARM and Thumb then taking the smallest *Add support for compiling individual basic blocks in thumb when in a larger ARM function This can be used for presumed cold like paths to EH handling etc *Thumb doesn t have normal pre post increment addressing modes
Definition: README-Thumb.txt:12
reduce
loop reduce
Definition: LoopStrengthReduce.cpp:6003
llvm::LegalizeActions::Custom
@ Custom
The target wants to do something special with this combination of operand and type.
Definition: LegalizerInfo.h:85
c
the resulting code requires compare and branches when and if the revised code is with conditional branches instead of More there is a byte word extend before each where there should be only and the condition codes are not remembered when the same two values are compared twice More LSR enhancements i8 and i32 load store addressing modes are identical int int c
Definition: README.txt:418
pshufd
Instead we get xmm1 addss xmm1 pshufd
Definition: README-SSE.txt:35
only
dot regions only
Definition: RegionPrinter.cpp:205
llvm::ISD::CopyFromReg
@ CopyFromReg
CopyFromReg - This node indicates that the input value is a virtual or physical register that is defi...
Definition: ISDOpcodes.h:201
place
Common register allocation spilling lr str ldr sxth r3 ldr mla r4 can lr mov lr str ldr sxth r3 mla r4 and then merge mul and lr str ldr sxth r3 mla r4 It also increase the likelihood the store may become dead bb27 Successors according to LLVM ID Predecessors according to mbb< bb27, 0x8b0a7c0 > Note ADDri is not a two address instruction its result reg1037 is an operand of the PHI node in bb76 and its operand reg1039 is the result of the PHI node We should treat it as a two address code and make sure the ADDri is scheduled after any node that reads reg1039 Use info(i.e. register scavenger) to assign it a free register to allow reuse the collector could move the objects and invalidate the derived pointer This is bad enough in the first place
Definition: README.txt:50
_mm_set_ps
return _mm_set_ps(0.0, 0.0, 0.0, b)
now
static sys::TimePoint< std::chrono::seconds > now(bool Deterministic)
Definition: ArchiveWriter.cpp:259
llvm::support::unaligned
@ unaligned
Definition: Endian.h:30
X
static GCMetadataPrinterRegistry::Add< ErlangGCPrinter > X("erlang", "erlang-compatible garbage collector")
fildl
In fpstack this compiles esp eax movl esp fildl(%esp) fmuls LCPI1_0 addl $4
include
include(LLVM-Build) add_subdirectory(IR) add_subdirectory(FuzzMutate) add_subdirectory(FileCheck) add_subdirectory(InterfaceStub) add_subdirectory(IRReader) add_subdirectory(CodeGen) add_subdirectory(BinaryFormat) add_subdirectory(Bitcode) add_subdirectory(Bitstream) add_subdirectory(DWARFLinker) add_subdirectory(Extensions) add_subdirectory(Frontend) add_subdirectory(Transforms) add_subdirectory(Linker) add_subdirectory(Analysis) add_subdirectory(LTO) add_subdirectory(MC) add_subdirectory(MCA) add_subdirectory(Object) add_subdirectory(ObjectYAML) add_subdirectory(Option) add_subdirectory(Remarks) add_subdirectory(DebugInfo) add_subdirectory(ExecutionEngine) add_subdirectory(Target) add_subdirectory(AsmParser) add_subdirectory(LineEditor) add_subdirectory(ProfileData) add_subdirectory(Passes) add_subdirectory(TextAPI) add_subdirectory(ToolDrivers) add_subdirectory(XRay) if(LLVM_INCLUDE_TESTS) add_subdirectory(Testing) endif() add_subdirectory(WindowsManifest) set(LLVMCONFIGLIBRARYDEPENDENCIESINC "$
Definition: CMakeLists.txt:1
too
The initial backend is deliberately restricted to z10 We should add support for later architectures at some point If an asm ties an i32 r result to an i64 the input will be treated as an leaving the upper bits uninitialised For i64 store i32 i32 *dst ret void from CodeGen SystemZ asm ll will use LHI rather than LGHI to load This seems to be a general target independent problem The tuning of the choice between LOAD XC and CLC for constant length block operations We could extend them to variable length operations too
Definition: README.txt:40
pinsrw
This might compile to this xmm1 xorps xmm0 movss xmm0 ret Now consider if the code caused xmm1 to get spilled This might produce this xmm1 movaps xmm0 movaps xmm1 movss xmm0 ret since the reload is only used by these we could fold it into the producing something like xmm1 movaps xmm0 ret saving two instructions The basic idea is that a reload from a spill if only one byte chunk is bring in zeros the one element instead of elements This can be used to simplify a variety of shuffle where the elements are fixed zeros This code generates ugly probably due to costs being off or< 4 x float > eax xmm0 pxor xmm1 movaps xmm2 xmm2 xmm0 movaps eax ret Would it be better to ecx xmm0 xor eax pinsrw
Definition: README-SSE.txt:304
will
Common register allocation spilling lr str ldr sxth r3 ldr mla r4 can lr mov lr str ldr sxth r3 mla r4 and then merge mul and lr str ldr sxth r3 mla r4 It also increase the likelihood the store may become dead bb27 Successors according to LLVM ID Predecessors according to mbb< bb27, 0x8b0a7c0 > Note ADDri is not a two address instruction its result reg1037 is an operand of the PHI node in bb76 and its operand reg1039 is the result of the PHI node We should treat it as a two address code and make sure the ADDri is scheduled after any node that reads reg1039 Use info(i.e. register scavenger) to assign it a free register to allow reuse the collector could move the objects and invalidate the derived pointer This is bad enough in the first but safe points can crop up unpredictably **array_addr i32 n y store obj obj **nth_el If the i64 division is lowered to a then a safe point will(must) appear for the call site. If a collection occurs
foo
< i32 > tmp foo
Definition: README.txt:383
one
the resulting code requires compare and branches when and if the revised code is with conditional branches instead of More there is a byte word extend before each where there should be only one
Definition: README.txt:401
tmp12
< i32 > tmp12
Definition: README-SSE.txt:793
into
Clang compiles this into
Definition: README.txt:504
Instead
Current eax eax eax ret Ideal eax eax ret Re implement atomic builtins x86 does not have to use add to implement these Instead
Definition: README.txt:1366
i32
Common register allocation spilling lr str ldr sxth r3 ldr mla r4 can lr mov lr str ldr sxth r3 mla r4 and then merge mul and lr str ldr sxth r3 mla r4 It also increase the likelihood the store may become dead bb27 Successors according to LLVM ID Predecessors according to mbb< bb27, 0x8b0a7c0 > Note ADDri is not a two address instruction its result reg1037 is an operand of the PHI node in bb76 and its operand reg1039 is the result of the PHI node We should treat it as a two address code and make sure the ADDri is scheduled after any node that reads reg1039 Use info(i.e. register scavenger) to assign it a free register to allow reuse the collector could move the objects and invalidate the derived pointer This is bad enough in the first but safe points can crop up unpredictably **array_addr i32
Definition: README.txt:122
foo1
__m128 foo1(float x1, float x4)
Definition: README-SSE.txt:548
_t
into eax xorps xmm0 xmm0 eax xmm0 eax xmm0 ret esp eax movdqa xmm0 xmm0 esp const ret align it should be movdqa xmm0 xmm0 We should transform a shuffle of two vectors of constants into a single vector of constants insertelement of a constant into a vector of constants should also result in a vector of constants e g VecISelBug ll We compiled it to something globl _t _t
Definition: README-SSE.txt:675
this
AMD64 Optimization Manual has some nice information about optimizing integer multiplication by a constant How much of it applies to Intel s X86 implementation There are definite trade offs to xmm0 cvttss2siq rdx jb L3 subss xmm0 rax cvttss2siq rdx xorq rdx rax ret instead of xmm1 cvttss2siq rcx movaps xmm2 subss xmm2 cvttss2siq rax rdx xorq rax ucomiss xmm0 cmovb rax ret Seems like the jb branch has high likelihood of being taken It would have saved a few instructions It s not possible to reference and DH registers in an instruction requiring REX prefix divb and mulb both produce results in AH If isel emits a CopyFromReg which gets turned into a movb and that can be allocated a r8b r15b To get around this
Definition: README-X86-64.txt:49
movsd
This currently compiles esp movsd(%esp)
instcombine
aggressive instcombine
Definition: AggressiveInstCombine.cpp:445
bad
We currently generate an sqrtsd and divsd instructions This is bad
Definition: README-SSE.txt:820
llvm::tgtok::In
@ In
Definition: TGLexer.h:51
code
We have fiadd patterns now but the followings have the same cost and complexity We need a way to specify the later is more profitable def def The FP stackifier should handle simple permutates to reduce number of shuffle e g which prevents the spiller from folding spill code into the instructions Currently the x86 codegen isn t very good at mixing SSE and FPStack code
Definition: README-FPStack.txt:71
llvm::lto::backend
Error backend(const Config &C, AddStreamFn AddStream, unsigned ParallelCodeGenParallelismLevel, Module &M, ModuleSummaryIndex &CombinedIndex)
Runs a regular LTO backend.
Definition: LTOBackend.cpp:493
llvm::getTheX86_32Target
Target & getTheX86_32Target()
Definition: X86TargetInfo.cpp:13
needed
We should do a little better with eliminating dead stores The stores to the stack are dead since a and b are not needed
Definition: README_ALTIVEC.txt:212
x2
gcc mainline compiles it x2(%rip)
type
AMD64 Optimization Manual has some nice information about optimizing integer multiplication by a constant How much of it applies to Intel s X86 implementation There are definite trade offs to xmm0 cvttss2siq rdx jb L3 subss xmm0 rax cvttss2siq rdx xorq rdx rax ret instead of xmm1 cvttss2siq rcx movaps xmm2 subss xmm2 cvttss2siq rax rdx xorq rax ucomiss xmm0 cmovb rax ret Seems like the jb branch has high likelihood of being taken It would have saved a few instructions It s not possible to reference and DH registers in an instruction requiring REX prefix divb and mulb both produce results in AH If isel emits a CopyFromReg which gets turned into a movb and that can be allocated a r8b r15b To get around isel emits a CopyFromReg from AX and then right shift it down by and truncate it It s not pretty but it works We need some register allocation magic to make the hack go which would often require a callee saved register Callees usually need to keep this value live for most of their body so it doesn t add a significant burden on them We currently implement this in however this is suboptimal because it means that it would be quite awkward to implement the optimization for callers A better implementation would be to relax the LLVM IR rules for sret arguments to allow a function with an sret argument to have a non void return type
Definition: README-X86-64.txt:70
nodes
Unify divergent function exit nodes
Definition: AMDGPUUnifyDivergentExitNodes.cpp:87
index
splat index
Definition: README_ALTIVEC.txt:181
following
This is equivalent to the following
Definition: README.txt:671
LLVM_EXTERNAL_VISIBILITY
#define LLVM_EXTERNAL_VISIBILITY
Definition: Compiler.h:132
tmp10
< i128 > tmp10
Definition: README-SSE.txt:791
llvm::ARM_MB::ST
@ ST
Definition: ARMBaseInfo.h:73
SSE4
gcc mainline compiles it xmm0 xmm1 movaps xmm2 movlhps xmm2 movaps xmm0 ret We compile vector multiply by constant into poor< i32 10, i32 10, i32 10, i32 10 > ret< 4 x i32 > A On targets without SSE4
Definition: README-SSE.txt:571
AH
AMD64 Optimization Manual has some nice information about optimizing integer multiplication by a constant How much of it applies to Intel s X86 implementation There are definite trade offs to xmm0 cvttss2siq rdx jb L3 subss xmm0 rax cvttss2siq rdx xorq rdx rax ret instead of xmm1 cvttss2siq rcx movaps xmm2 subss xmm2 cvttss2siq rax rdx xorq rax ucomiss xmm0 cmovb rax ret Seems like the jb branch has high likelihood of being taken It would have saved a few instructions It s not possible to reference AH
Definition: README-X86-64.txt:44
tests
Reference model for inliner Oz decision policy Note this model is also referenced by test Transforms Inline ML tests if replacing check those tests
Definition: README.txt:3
slot
This might compile to this xmm1 xorps xmm0 movss xmm0 ret Now consider if the code caused xmm1 to get spilled This might produce this xmm1 movaps xmm0 movaps xmm1 movss xmm0 ret since the reload is only used by these we could fold it into the producing something like xmm1 movaps xmm0 ret saving two instructions The basic idea is that a reload from a spill slot
Definition: README-SSE.txt:269
llvm::LegalityPredicates::all
Predicate all(Predicate P0, Predicate P1)
True iff P0 and P1 are true.
Definition: LegalizerInfo.h:196
s
multiplies can be turned into SHL s
Definition: README.txt:370
llvm::ARM_AM::add
@ add
Definition: ARMAddressingModes.h:39
move
compiles ldr LCPI1_0 ldr ldr mov lsr tst moveq r1 ldr LCPI1_1 and r0 bx lr It would be better to do something like to fold the shift into the conditional move
Definition: README.txt:546
llvm::numbers::e
constexpr double e
Definition: MathExtras.h:57
slightly
entry stw r5 blr GCC r3 srawi xor r4 subf r0 stw r5 blr which is much nicer This theoretically may help improve twolf slightly(used in dimbox.c:142?).
to
This compiles xmm1 mulss xmm1 xorps xmm0 movss xmm0 ret Because mulss doesn t modify the top the top elements of xmm1 are already zero d We could compile this to
Definition: README-SSE.txt:224
I
#define I(x, y, z)
Definition: MD5.cpp:59
something
This might compile to this xmm1 xorps xmm0 movss xmm0 ret Now consider if the code caused xmm1 to get spilled This might produce this xmm1 movaps xmm0 movaps xmm1 movss xmm0 ret since the reload is only used by these we could fold it into the producing something like xmm1 movaps xmm0 ret saving two instructions The basic idea is that a reload from a spill if only one byte chunk is bring in zeros the one element instead of elements This can be used to simplify a variety of shuffle where the elements are fixed zeros This code generates ugly probably due to costs being off or something
Definition: README-SSE.txt:278
call
S is passed via registers r2 But gcc stores them to the and then reload them to and r3 before issuing the call(r0 contains the address of the format string)
Definition: README.txt:190
node
This currently compiles esp xmm0 movsd esp eax eax esp ret We should use not the dag combiner This is because dagcombine2 needs to be able to see through the X86ISD::Wrapper node
Definition: README-SSE.txt:406
generated
The following code is currently generated
Definition: README.txt:954
numbers
SSE has instructions for doing operations on complex numbers
Definition: README-SSE.txt:22
LCPI1_0
> ldr r0, pc, #((LCPI1_0-(LPCRELL0+4))&0xfffffffc) We compile the following:define i16 @func_entry_2E_ce(i32 %i) { switch i32 %i, label %bb12.exitStub[i32 0, label %bb4.exitStub i32 1, label %bb9.exitStub i32 2, label %bb4.exitStub i32 3, label %bb4.exitStub i32 7, label %bb9.exitStub i32 8, label %bb.exitStub i32 9, label %bb9.exitStub] bb12.exitStub:ret i16 0 bb4.exitStub:ret i16 1 bb9.exitStub:ret i16 2 bb.exitStub:ret i16 3 } into:_func_entry_2E_ce:mov r2, #1 lsl r2, r0 cmp r0, #9 bhi LBB1_4 @bb12.exitStub LBB1_1:@newFuncRoot mov r1, #13 tst r2, r1 bne LBB1_5 @bb4.exitStub LBB1_2:@newFuncRoot ldr r1, LCPI1_0 tst r2, r1 bne LBB1_6 @bb9.exitStub LBB1_3:@newFuncRoot mov r1, #1 lsl r1, r1, #8 tst r2, r1 bne LBB1_7 @bb.exitStub LBB1_4:@bb12.exitStub mov r0, #0 bx lr LBB1_5:@bb4.exitStub mov r0, #1 bx lr LBB1_6:@bb9.exitStub mov r0, #2 bx lr LBB1_7:@bb.exitStub mov r0, #3 bx lr LBB1_8:.align 2 LCPI1_0:.long 642 gcc compiles to:cmp r0, #9 @ lr needed for prologue bhi L2 ldr r3, L11 mov r2, #1 mov r1, r2, asl r0 ands r0, r3, r2, asl r0 movne r0, #2 bxne lr tst r1, #13 beq L9 L3:mov r0, r2 bx lr L9:tst r1, #256 movne r0, #3 bxne lr L2:mov r0, #0 bx lr L12:.align 2 L11:.long 642 GCC is doing a couple of clever things here:1. It is predicating one of the returns. This isn 't a clear win though:in cases where that return isn 't taken, it is replacing one condbranch with two 'ne' predicated instructions. 2. It is sinking the shift of "1 << i" into the tst, and using ands instead of tst. This will probably require whole function isel. 3. GCC emits:tst r1, #256 we emit:mov r1, #1 lsl r1, r1, #8 tst r2, r1 When spilling in thumb mode and the sp offset is too large to fit in the ldr/str offset field, we load the offset from a constpool entry and add it to sp:ldr r2, LCPI add r2, sp ldr r2,[r2] These instructions preserve the condition code which is important if the spill is between a cmp and a bcc instruction. However, we can use the(potentially) cheaper sequence if we know it 's ok to clobber the condition register. add r2, sp, #255 *4 add r2, #132 ldr r2,[r2, #7 *4] This is especially bad when dynamic alloca is used. The all fixed size stack objects are referenced off the frame pointer with negative offsets. See oggenc for an example. Poor codegen test/CodeGen/ARM/select.ll f7:ldr r5, LCPI1_0 LPC0:add r5, pc ldr r6, LCPI1_1 ldr r2, LCPI1_2 mov r3, r6 mov lr, pc bx r5 Make register allocator/spiller smarter so we can re-materialize "mov r, imm", etc. Almost all Thumb instructions clobber condition code. Thumb load/store address mode offsets are scaled. The values kept in the instruction operands are pre-scale values. This probably ought to be changed to avoid extra work when we convert Thumb2 instructions to Thumb1 instructions. We need to make(some of the) Thumb1 instructions predicable. That will allow shrinking of predicated Thumb2 instructions. To allow this, we need to be able to toggle the 's' bit since they do not set CPSR when they are inside IT blocks. Make use of hi register variants of cmp:tCMPhir/tCMPZhir. Thumb1 immediate field sometimes keep pre-scaled values. See ThumbRegisterInfo::eliminateFrameIndex. This is inconsistent from ARM and Thumb2. Rather than having tBR_JTr print a ".align 2" and constant island pass pad it, add a target specific ALIGN instruction instead. That way, getInstSizeInBytes won 't have to over-estimate. It can also be used for loop alignment pass. We generate conditional code for icmp when we don 't need to. This code:int foo(int s) { return s==1 LCPI1_0
Definition: README-Thumb.txt:249
llvm::shuffle
void shuffle(Iterator first, Iterator last, RNG &&g)
Definition: STLExtras.h:1309
Guide
This might compile to this xmm1 xorps xmm0 movss xmm0 ret Now consider if the code caused xmm1 to get spilled This might produce this xmm1 movaps xmm0 movaps xmm1 movss xmm0 ret since the reload is only used by these we could fold it into the producing something like xmm1 movaps xmm0 ret saving two instructions The basic idea is that a reload from a spill if only one byte chunk is bring in zeros the one element instead of elements This can be used to simplify a variety of shuffle where the elements are fixed zeros This code generates ugly probably due to costs being off or< 4 x float > eax xmm0 pxor xmm1 movaps xmm2 xmm2 xmm0 movaps eax ret Would it be better to ecx xmm0 xor eax xmm0 xmm0 movaps ecx ret Some useful information in the Apple Altivec SSE Migration Guide
Definition: README-SSE.txt:318
ops
xray Insert XRay ops
Definition: XRayInstrumentation.cpp:268
memcpy
<%struct.s * > cast struct s *S to sbyte *< sbyte * > sbyte uint cast struct s *agg result to sbyte *< sbyte * > sbyte uint cast struct s *memtmp to sbyte *< sbyte * > sbyte uint ret void llc ends up issuing two memcpy or custom lower memcpy(of small size) to be ldmia/stmia. I think option 2 is better but the current register allocator cannot allocate a chunk of registers at a time. A feasible temporary solution is to use specific physical registers at the lowering time for small(<
extract
loop extract
Definition: LoopExtractor.cpp:97
B
Instead we get xmm1 addss xmm1 xmm2 movhlps xmm0 movaps xmm3 addss xmm3 movdqa xmm0 addss xmm0 ret there are cases where some simple SLP would improve codegen a bit compiling _Complex float B
Definition: README-SSE.txt:46
llvm::ScaledNumbers::compare
int compare(DigitsT LDigits, int16_t LScale, DigitsT RDigits, int16_t RScale)
Compare two scaled numbers.
Definition: ScaledNumber.h:251
slower
Instead of the following for memset char edx edx edx It might be better to generate eax movl edx movl edx movw edx when we can spare a register It reduces code size Evaluate what the best way to codegen sdiv C is For we currently get ret i32 Y eax movl ecx ecx ecx addl eax eax ret GCC knows several different ways to codegen one of which is eax eax ecx cmovle eax eax ret which is probably slower
Definition: README.txt:161
edi
into eax xorps xmm0 xmm0 eax xmm0 eax xmm0 ret esp eax movdqa xmm0 xmm0 esp const ret align it should be movdqa xmm0 edi
Definition: README-SSE.txt:650
movq
Clang compiles this i1 i64 store i64 i64 store i64 i64 store i64 i64 store i64 align Which gets codegen d xmm0 movaps rbp movaps rbp movaps rbp movaps rbp movq
Definition: README.txt:521
function
print Print MemDeps of function
Definition: MemDepPrinter.cpp:83
cvtsi2sd
In fpstack this compiles esp eax movl esp esp ret in SSE it compiles into significantly slower esp cvtsi2sd(%esp)
_a
So that lo16() _a(r2) lbz r2
foo
int foo(unsigned x)
Definition: README-X86-64.txt:81
RA
SI optimize exec mask operations pre RA
Definition: SIOptimizeExecMaskingPreRA.cpp:71
isel
amdgpu isel
Definition: AMDGPUISelDAGToDAG.cpp:375
shufps
This might compile to this xmm1 xorps xmm0 movss xmm0 ret Now consider if the code caused xmm1 to get spilled This might produce this xmm1 movaps xmm0 movaps xmm1 movss xmm0 ret since the reload is only used by these we could fold it into the producing something like xmm1 movaps xmm0 ret saving two instructions The basic idea is that a reload from a spill if only one byte chunk is bring in zeros the one element instead of elements This can be used to simplify a variety of shuffle where the elements are fixed zeros This code generates ugly probably due to costs being off or< 4 x float > eax xmm0 pxor xmm1 movaps xmm2 shufps
Definition: README-SSE.txt:293
also
Doing so could allow SROA of the destination pointers See also
Definition: README.txt:166
norm
void norm(double x, double y, double z)
Definition: README-SSE.txt:815
gcc
into eax xorps xmm0 xmm0 eax xmm0 eax xmm0 ret gcc
Definition: README-SSE.txt:630
llvm::support::aligned
@ aligned
Definition: Endian.h:30
rdar
into eax xorps xmm0 xmm0 eax xmm0 eax xmm0 ret esp eax movdqa xmm0 xmm0 esp const ret align it should be movdqa xmm0 xmm0 We should transform a shuffle of two vectors of constants into a single vector of constants insertelement of a constant into a vector of constants should also result in a vector of constants e g VecISelBug ll We compiled it to something globl _t xmm0 movhps xmm0 movss xmm1 movaps xmm2 xmm2 xmm0 movaps rdar
Definition: README-SSE.txt:689
llvm::min
Expected< ExpressionValue > min(const ExpressionValue &Lhs, const ExpressionValue &Rhs)
Definition: FileCheck.cpp:357
elements
This compiles xmm1 mulss xmm1 xorps xmm0 movss xmm0 ret Because mulss doesn t modify the top elements
Definition: README-SSE.txt:221
LBB1_2
_foo edi jbe LBB1_3 eax LBB1_2
Definition: README-X86-64.txt:96
esp
We currently emits esp
Definition: README.txt:235
xmm2
Instead we get xmm1 addss xmm1 xmm2 movhlps xmm0 movaps xmm3 addss xmm3 movdqa xmm2
Definition: README-SSE.txt:39
A
* A
Definition: README_ALTIVEC.txt:89
instruction
Since we know that Vector is byte aligned and we know the element offset of we should change the load into a lve *x instruction
Definition: README_ALTIVEC.txt:37
lowered
into xmm2 addss xmm2 xmm1 xmm3 addss xmm3 movaps xmm0 unpcklps xmm0 ret seems silly when it could just be one addps Expand libm rounding functions main should enable SSE DAZ mode and other fast SSE modes Think about doing i64 math in SSE regs on x86 This testcase should have no SSE instructions in and only one load from a constant double ret double C the select is being lowered
Definition: README-SSE.txt:89
scale
static uint64_t scale(uint64_t Num, uint32_t N, uint32_t D)
Definition: BranchProbability.cpp:68
trunc
We have fiadd patterns now but the followings have the same cost and complexity We need a way to specify the later is more profitable def def The FP stackifier should handle simple permutates to reduce number of shuffle e g trunc
Definition: README-FPStack.txt:63
fldl
In fpstack this compiles esp eax movl esp esp ret in SSE it compiles into significantly slower esp xmm0 mulsd xmm0 movsd esp fldl(%esp) addl $12
LBB1_3
For the entry BB esp pxor xmm0 xmm1 ucomisd xmm1 setnp al sete cl testb al jne LBB1_5 xmm2 cvtss2sd xmm3 ucomisd xmm0 ja LBB1_3 xmm2 LBB1_3
Definition: README.txt:521
Consider
Common register allocation spilling lr str ldr sxth r3 ldr mla r4 can lr mov lr str ldr sxth r3 mla r4 and then merge mul and lr str ldr sxth r3 mla r4 It also increase the likelihood the store may become dead bb27 Successors according to LLVM ID Predecessors according to mbb< bb27, 0x8b0a7c0 > Note ADDri is not a two address instruction its result reg1037 is an operand of the PHI node in bb76 and its operand reg1039 is the result of the PHI node We should treat it as a two address code and make sure the ADDri is scheduled after any node that reads reg1039 Use info(i.e. register scavenger) to assign it a free register to allow reuse the collector could move the objects and invalidate the derived pointer This is bad enough in the first but safe points can crop up unpredictably Consider
Definition: README.txt:121
xor
We currently generate a but we really shouldn eax ecx xorl edx divl ecx eax divl ecx movl eax ret A similar code sequence works for division We currently compile i32 v2 eax eax jo LBB1_2 xor
Definition: README.txt:1271
for
this could be done in SelectionDAGISel along with other special for
Definition: README.txt:104
consider
This currently compiles esp xmm0 movsd esp eax eax esp ret We should use not the dag combiner This is because dagcombine2 needs to be able to see through the X86ISD::Wrapper which DAGCombine can t really do The code for turning x load into a single vector load is target independent and should be moved to the dag combiner The code for turning x load into a vector load can only handle a direct load from a global or a direct load from the stack It should be generalized to handle any load from where P can be anything The alignment inference code cannot handle loads from globals in static non mode because it doesn t look through the extra dyld stub load If you try vec_align ll without relocation you ll see what I mean We should lower which eliminates a constant pool load For consider
Definition: README-SSE.txt:421
instructions
We have fiadd patterns now but the followings have the same cost and complexity We need a way to specify the later is more profitable def def The FP stackifier should handle simple permutates to reduce number of shuffle instructions
Definition: README-FPStack.txt:25
However
AMD64 Optimization Manual has some nice information about optimizing integer multiplication by a constant How much of it applies to Intel s X86 implementation There are definite trade offs to xmm0 cvttss2siq rdx jb L3 subss xmm0 rax cvttss2siq rdx xorq rdx rax ret instead of xmm1 cvttss2siq rcx movaps xmm2 subss xmm2 cvttss2siq rax rdx xorq rax ucomiss xmm0 cmovb rax ret Seems like the jb branch has high likelihood of being taken It would have saved a few instructions It s not possible to reference and DH registers in an instruction requiring REX prefix However
Definition: README-X86-64.txt:45
not
Should compile r2 movcc movcs str strb mov lr r1 movcs movcc mov lr not
Definition: README.txt:465
movzbl
into eax xorps xmm0 xmm0 movzbl(%esp)
t1
<%struct.bf ** > define void t1() nounwind ssp
Definition: README.txt:1497
instructions
This might compile to this xmm1 xorps xmm0 movss xmm0 ret Now consider if the code caused xmm1 to get spilled This might produce this xmm1 movaps xmm0 movaps xmm1 movss xmm0 ret since the reload is only used by these instructions
Definition: README-SSE.txt:257
CFG
Common register allocation spilling lr str ldr sxth r3 ldr mla r4 can lr mov lr str ldr sxth r3 mla r4 and then merge mul and lr str ldr sxth r3 mla r4 It also increase the likelihood the store may become dead bb27 Successors according to CFG
Definition: README.txt:39
Testcase
the custom lowered code happens to be but we shouldn t have to custom lower anything This is probably related to< 2 x i64 > ops being so bad LLVM currently generates stack realignment when it is not necessary needed The problem is that we need to know about stack alignment too before RA runs At that point we don t whether there will be vector or not Stack realignment logic is overly conservative but otherwise we can produce unaligned loads stores Fixing this will require some huge RA changes Testcase
Definition: README-SSE.txt:498
LLVMInitializeX86TargetInfo
LLVM_EXTERNAL_VISIBILITY void LLVMInitializeX86TargetInfo()
Definition: X86TargetInfo.cpp:22
mode
*Add support for compiling functions in both ARM and Thumb mode
Definition: README-Thumb.txt:5
possible
the resulting code requires compare and branches when and if the revised code is with conditional branches instead of More there is a byte word extend before each where there should be only and the condition codes are not remembered when the same two values are compared twice More LSR enhancements possible
Definition: README.txt:410
OneArgFPRW
We have fiadd patterns now but the followings have the same cost and complexity We need a way to specify the later is more profitable def OneArgFPRW
Definition: README-FPStack.txt:18
foo
< float * > store float float *tmp5 ret void Compiles rax rax movl rdi ret This would be better kept in the SSE unit by treating XMM0 as a and doing a shuffle from v[1] to v[0] then a float store[UNSAFE FP] void foo(double, double, double)
bit
compiles ldr LCPI1_0 ldr ldr mov lsr tst moveq r1 ldr LCPI1_1 and r0 bx lr It would be better to do something like to fold the shift into the conditional ldr LCPI1_0 ldr ldr tst movne lsr ldr LCPI1_1 and r0 bx lr it saves an instruction and a register It might be profitable to cse MOVi16 if there are lots of bit immediates with the same bottom half Robert Muth started working on an alternate jump table implementation that does not put the tables in line in the text This is more like the llvm default jump table implementation This might be useful sometime Several revisions of patches are on the mailing beginning while CMP sets them like a subtract Therefore to be able to use CMN for comparisons other than the Z bit
Definition: README.txt:584
code
*Add support for compiling functions in both ARM and Thumb then taking the smallest *Add support for compiling individual basic blocks in thumb when in a larger ARM function This can be used for presumed cold code
Definition: README-Thumb.txt:9
can
This might compile to this xmm1 xorps xmm0 movss xmm0 ret Now consider if the code caused xmm1 to get spilled This might produce this xmm1 movaps xmm0 movaps xmm1 movss xmm0 ret since the reload is only used by these we could fold it into the producing something like xmm1 movaps xmm0 ret saving two instructions The basic idea is that a reload from a spill can
Definition: README-SSE.txt:269
or
compiles or
Definition: README.txt:606
get
Should compile to something r4 addze r3 instead we get
Definition: README.txt:24
_test
float space text globl _test align _test
Definition: README_ALTIVEC.txt:118
llvm::X86ISD::FMIN
@ FMIN
Definition: X86ISelLowering.h:263
targets
Some targets(e.g. athlons) prefer freep to fstp ST(0)
Definition: README-FPStack.txt:7
better
In x86 we generate this spiffy xmm0 xmm0 ret in x86 we generate this which could be better
Definition: README-SSE.txt:537
llvm::RegisterTarget
RegisterTarget - Helper template for registering a target, for use in the target's initialization fun...
Definition: TargetRegistry.h:948
LC0
into eax xorps xmm0 xmm0 eax xmm0 eax xmm0 ret esp eax movdqa LC0
Definition: README-SSE.txt:632
x
TODO unsigned x
Definition: README.txt:10
globals
name anon globals
Definition: NameAnonGlobals.cpp:113
got
Generic address nodes are lowered to some combination of target independent and machine specific and compilation options The choice of specific instructions that are to be used is delegated to ISel which in turn relies on TableGen patterns to choose subtarget specific instructions For in the pseudo code generated got(sym))
LCPI1_0
gcc mainline compiles it xmm0 xmm1 movaps xmm2 movlhps xmm2 movaps xmm0 ret We compile vector multiply by constant into poor< i32 10, i32 10, i32 10, i32 10 > ret< 4 x i32 > A On targets without this compiles globl _f xmm1 movd eax imull LCPI1_0
Definition: README-SSE.txt:584
than
So that lo16() r2 stb r3 blr Becomes r3 they should compile to something better than
Definition: README.txt:161
transform
instcombine should handle this transform
Definition: README.txt:262
shift_right
SSE Variable shift can be custom lowered to something like which uses a small table unaligned load shuffle instead of going through memory __m128i shift_right(__m128i value, unsigned long offset)
Definition: README-SSE.txt:15
flds
esp eax movl ecx ecx cvtsi2ss xmm0 eax cvtsi2ss xmm1 xmm0 addss xmm0 movss flds(%esp, 1) 0000002d addl $0x04
llvm::MCID::Add
@ Add
Definition: MCInstrDesc.h:183
Intel
@ Intel
Definition: X86MCAsmInfo.cpp:23
arguments
we compile this esp call L1 $pb L1 esp je LBB1_2 esp ret but is currently always computed in the entry block It would be better to sink the picbase computation down into the block for the as it is the only one that uses it This happens for a lot of code with early outs Another example is loads of arguments
Definition: README.txt:425
horrible
into eax xorps xmm0 xmm0 eax xmm0 eax xmm0 ret esp eax movdqa xmm0 xmm0 esp const ret align it should be movdqa xmm0 xmm0 We should transform a shuffle of two vectors of constants into a single vector of constants insertelement of a constant into a vector of constants should also result in a vector of constants e g VecISelBug ll We compiled it to something horrible
Definition: README-SSE.txt:672
shifts
http eax xorl edx cl sete al setne dl sall eax sall edx But that requires good bit subreg support this might be better It s an extra but it s one instruction and doesn t stress bit subreg eax eax movl edx edx sall eax sall cl edx bit shifts(in general) expand to really bad code. Instead of using cmovs
y
void y(unsigned n)
Definition: README-SSE.txt:358
mode
In x86 mode
Definition: README-SSE.txt:527
This
the resulting code requires compare and branches when and if the revised code is with conditional branches instead of More there is a byte word extend before each where there should be only and the condition codes are not remembered when the same two values are compared twice More LSR enhancements i8 and i32 load store addressing modes are identical This
Definition: README.txt:418
BH
AMD64 Optimization Manual has some nice information about optimizing integer multiplication by a constant How much of it applies to Intel s X86 implementation There are definite trade offs to xmm0 cvttss2siq rdx jb L3 subss xmm0 rax cvttss2siq rdx xorq rdx rax ret instead of xmm1 cvttss2siq rcx movaps xmm2 subss xmm2 cvttss2siq rax rdx xorq rax ucomiss xmm0 cmovb rax ret Seems like the jb branch has high likelihood of being taken It would have saved a few instructions It s not possible to reference BH
Definition: README-X86-64.txt:44
_f
gcc mainline compiles it xmm0 xmm1 movaps xmm2 movlhps xmm2 movaps xmm0 ret We compile vector multiply by constant into poor< i32 10, i32 10, i32 10, i32 10 > ret< 4 x i32 > A On targets without this compiles globl _f _f
Definition: README-SSE.txt:582
xmm3
Instead we get xmm1 addss xmm1 xmm2 movhlps xmm0 movaps xmm3 addss xmm3 movdqa xmm0 addss xmm3
Definition: README-SSE.txt:40
Currently
into xmm2 addss xmm2 xmm1 xmm3 addss xmm3 movaps xmm0 unpcklps xmm0 ret seems silly when it could just be one addps Expand libm rounding functions main should enable SSE DAZ mode and other fast SSE modes Think about doing i64 math in SSE regs on x86 This testcase should have no SSE instructions in and only one load from a constant double ret double C Currently
Definition: README-SSE.txt:89
Also
Instead we get xmm1 addss xmm1 xmm2 movhlps xmm0 movaps xmm3 addss xmm3 movdqa xmm0 addss xmm0 ret Also
Definition: README-SSE.txt:43
used
This might compile to this xmm1 xorps xmm0 movss xmm0 ret Now consider if the code caused xmm1 to get spilled This might produce this xmm1 movaps xmm0 movaps xmm1 movss xmm0 ret since the reload is only used by these we could fold it into the producing something like xmm1 movaps xmm0 ret saving two instructions The basic idea is that a reload from a spill if only one byte chunk is used
Definition: README-SSE.txt:270
edi
_foo edi jbe LBB1_3 edi
Definition: README-X86-64.txt:94
llvm::max
Align max(MaybeAlign Lhs, Align Rhs)
Definition: Alignment.h:340
shift
http eax xorl edx cl sete al setne dl sall eax sall edx But that requires good bit subreg support this might be better It s an extra shift
Definition: README.txt:30
LLVM
MIPS Relocation Principles In LLVM
Definition: Relocation.txt:3
stack
S is passed via registers r2 But gcc stores them to the stack
Definition: README.txt:189
However
This might compile to this xmm1 xorps xmm0 movss xmm0 ret Now consider if the code caused xmm1 to get spilled This might produce this xmm1 movaps xmm0 movaps xmm1 movss xmm0 ret However
Definition: README-SSE.txt:257
early
the custom lowered code happens to be but we shouldn t have to custom lower anything This is probably related to< 2 x i64 > ops being so bad LLVM currently generates stack realignment when it is not necessary needed The problem is that we need to know about stack alignment too early
Definition: README-SSE.txt:486
andl
esp eax movl ecx ecx cvtsi2ss xmm0 andl
Definition: README.txt:302
h
the multiplication has a latency of four as opposed to two cycles for the movl lea variant It appears gcc place string data with linkonce linkage in section coalesced instead of section coalesced Take a look at darwin h
Definition: README.txt:261
BB
Common register allocation spilling lr str ldr sxth r3 ldr mla r4 can lr mov lr str ldr sxth r3 mla r4 and then merge mul and lr str ldr sxth r3 mla r4 It also increase the likelihood the store may become dead bb27 Successors according to LLVM BB
Definition: README.txt:39
problem
Common register allocation spilling problem
Definition: README.txt:5
tmp11
< i128 >< i128 > tmp11
Definition: README-SSE.txt:792
nounwind
this lets us change the cmpl into a which is and eliminate the shift We compile this i32 i32 i8 zeroext d nounwind
Definition: README.txt:973
lshr
Vector Shift Left don t map to llvm shl and lshr
Definition: README_P9.txt:118
spill
the custom lowered code happens to be but we shouldn t have to custom lower anything This is probably related to< 2 x i64 > ops being so bad LLVM currently generates stack realignment when it is not necessary needed The problem is that we need to know about stack alignment too before RA runs At that point we don t whether there will be vector spill
Definition: README-SSE.txt:489
$dst
So we should use XX3Form_Rcr to implement instrinsic Convert DP outs ins xscvdpsp No builtin are required Round &Convert QP DP(dword[1] is set to zero) No builtin are required Round to Quad Precision because you need to assign rounding mode in instruction Provide builtin(set f128:$vT,(int_ppc_vsx_xsrqpi f128:$vB))(set f128 yields< n x< ty > >< result > yields< ty >< result > No builtin are required Load Store load store see def memrix16 in PPCInstrInfo td Load Store Vector load store outs ins lxsdx set load store with conversion from to outs ins lxsspx set load store outs ins lxsiwzx set PPClfiwzx ins stxsiwx $dst
Definition: README_P9.txt:538
f64
QP Compare Ordered outs ins xscmpudp No builtin are required Or llvm fcmp order unorder compare DP QP Compare builtin are required DP xscmp *dp write to VSX register Use int_ppc_vsx_xscmpeqdp f64
Definition: README_P9.txt:314
d
the resulting code requires compare and branches when and if the revised code is with conditional branches instead of More there is a byte word extend before each where there should be only and the condition codes are not remembered when the same two values are compared twice More LSR enhancements i8 and i32 load store addressing modes are identical int int int d
Definition: README.txt:418
n
The same transformation can work with an even modulo with the addition of a and shrink the compare RHS by the same amount Unless the target supports that transformation probably isn t worthwhile The transformation can also easily be made to work with non zero equality for n
Definition: README.txt:685
registers
Implement PPCInstrInfo::isLoadFromStackSlot isStoreToStackSlot for vector registers
Definition: README_ALTIVEC.txt:4
tmp19
< double > tmp19
Definition: README-SSE.txt:789
model
This currently compiles esp xmm0 movsd esp eax eax esp ret We should use not the dag combiner This is because dagcombine2 needs to be able to see through the X86ISD::Wrapper which DAGCombine can t really do The code for turning x load into a single vector load is target independent and should be moved to the dag combiner The code for turning x load into a vector load can only handle a direct load from a global or a direct load from the stack It should be generalized to handle any load from where P can be anything The alignment inference code cannot handle loads from globals in static non mode because it doesn t look through the extra dyld stub load If you try vec_align ll without relocation model
Definition: README-SSE.txt:414
test
modulo schedule test
Definition: ModuloSchedule.cpp:2125
X86
Unrolling by would eliminate the &in both leading to a net reduction in code size The resultant code would then also be suitable for exit value computation We miss a bunch of rotate opportunities on various including etc On X86
Definition: README.txt:568
X86TargetInfo.h
TargetRegistry.h
_foo
So that _foo
Definition: README.txt:132
shrl
This currently compiles esp xmm0 movsd esp eax shrl
Definition: README-SSE.txt:393
produce
bb420 i The CBE manages to produce
Definition: README.txt:49
libcall
Common register allocation spilling lr str ldr sxth r3 ldr mla r4 can lr mov lr str ldr sxth r3 mla r4 and then merge mul and lr str ldr sxth r3 mla r4 It also increase the likelihood the store may become dead bb27 Successors according to LLVM ID Predecessors according to mbb< bb27, 0x8b0a7c0 > Note ADDri is not a two address instruction its result reg1037 is an operand of the PHI node in bb76 and its operand reg1039 is the result of the PHI node We should treat it as a two address code and make sure the ADDri is scheduled after any node that reads reg1039 Use info(i.e. register scavenger) to assign it a free register to allow reuse the collector could move the objects and invalidate the derived pointer This is bad enough in the first but safe points can crop up unpredictably **array_addr i32 n y store obj obj **nth_el If the i64 division is lowered to a libcall
Definition: README.txt:127
optimization
hexagon cext Hexagon constant extender optimization
Definition: HexagonConstExtenders.cpp:572
entry
print Instructions which execute on loop entry
Definition: MustExecute.cpp:339
operations
This might compile to this xmm1 xorps xmm0 movss xmm0 ret Now consider if the code caused xmm1 to get spilled This might produce this xmm1 movaps xmm0 movaps xmm1 movss xmm0 ret since the reload is only used by these we could fold it into the producing something like xmm1 movaps xmm0 ret saving two instructions The basic idea is that a reload from a spill if only one byte chunk is bring in zeros the one element instead of elements This can be used to simplify a variety of shuffle operations
Definition: README-SSE.txt:271
eax
This might compile to this xmm1 xorps xmm0 movss xmm0 ret Now consider if the code caused xmm1 to get spilled This might produce this xmm1 movaps xmm0 movaps xmm1 movss xmm0 ret since the reload is only used by these we could fold it into the producing something like xmm1 movaps xmm0 ret saving two instructions The basic idea is that a reload from a spill if only one byte chunk is bring in zeros the one element instead of elements This can be used to simplify a variety of shuffle where the elements are fixed zeros This code generates ugly probably due to costs being off or< 4 x float > eax xmm0 pxor xmm1 movaps xmm2 xmm2 xmm0 movaps eax ret Would it be better to ecx xmm0 xor eax
Definition: README-SSE.txt:303
of
Add support for conditional and other related patterns Instead of
Definition: README.txt:134
llvm::Intrinsic::ID
unsigned ID
Definition: TargetTransformInfo.h:38
xmm1
Instead we get xmm1 addss xmm1 xmm2 movhlps xmm0 movaps xmm3 addss xmm1
Definition: README-SSE.txt:38
cases
Code Generation Notes for reduce the size of the ISel and reduce repetition in the implementation In a small number of cases
Definition: MSA.txt:6
example
SSE has instructions for doing operations on complex we should pattern match them For example
Definition: README-SSE.txt:23