Go to the documentation of this file.
14 static Target ThePPC32Target;
15 return ThePPC32Target;
18 static Target ThePPC32LETarget;
19 return ThePPC32LETarget;
22 static Target ThePPC64Target;
23 return ThePPC64Target;
26 static Target ThePPC64LETarget;
27 return ThePPC64LETarget;
Move duplicate certain instructions close to their use
SSE Variable shift can be custom lowered to something like which uses a small table unaligned load shuffle instead of going through memory byte
We currently generate a but we really shouldn eax ecx xorl edx divl ecx eax divl ecx movl eax ret A similar code sequence works for division We currently compile i32 v2 eax eax jo LBB1_2 atomic and others It is also currently not done for read modify write instructions It is also current not done if the OF or CF flags are needed The shift operators have the complication that when the shift count is EFLAGS is not set
should just be implemented with a CLZ instruction Since there are other e that share this behavior
should just be implemented with a CLZ instruction Since there are other e that share this it would be best to implement this in a target independent as zero is the default value for the binary encoder e add r0 add r5 Register operands should be distinct That is
compiles conv shl5 shl ret i32 or10 it would be better as
We generate the following IR with i32 b nounwind readnone
we should consider alternate ways to model stack dependencies Lots of things could be done in WebAssemblyTargetTransformInfo cpp there are numerous optimization related hooks that can be overridden in WebAssemblyTargetLowering Instead of the OptimizeReturned which should consider preserving the returned attribute through to MachineInstrs and extending the MemIntrinsicResults pass to do this optimization on calls too That would also let the WebAssemblyPeephole pass clean up dead defs for such as it does for stores Consider implementing and or getMachineCombinerPatterns Find a clean way to fix the problem which leads to the Shrink Wrapping pass being run after the WebAssembly PEI pass When setting multiple variables to the same constant
This is an optimization pass for GlobalISel generic memory operations.
QP Compare Ordered outs ins xscmpudp No intrinsic
vector float f(vector float a, vector float b)
Decimal Convert From to National Zoned Signed int_ppc_altivec_bcdcfno int_ppc_altivec_bcdcfzo int_ppc_altivec_bcdctno int_ppc_altivec_bcdctzo int_ppc_altivec_bcdcfsqo int_ppc_altivec_bcdctsqo int_ppc_altivec_bcdcpsgno int_ppc_altivec_bcdsetsgno int_ppc_altivec_bcdso int_ppc_altivec_bcduso int_ppc_altivec_bcdsro i e VA byte[7] Decimal(Unsigned) Truncate Define DAG Node in PPCInstrInfo def def def def PPCfsqrtrto
into xmm2 addss xmm2 xmm1 xmm3 addss xmm3 movaps xmm0 unpcklps xmm0 ret seems silly when it could just be one addps Expand libm rounding functions main should enable SSE DAZ mode and other fast SSE modes Think about doing i64 math in SSE regs on x86 This testcase should have no SSE instructions in it
Vector Rotate Left Mask Mask Insert
It looks like we only need to define PPCfmarto for these because according to PowerISA_V3
The object format emitted by the WebAssembly backed is documented see the home and packaging for producing WebAssembly applications that can run in browsers and other environments wasi sdk provides a more minimal C C SDK based on llvm and a libc based on for producing WebAssemmbly applictions that use the WASI ABI Rust provides WebAssembly support integrated into Cargo There are two main which provides a relatively minimal environment that has an emphasis on being native wasm32 unknown which uses Emscripten internally and provides standard C C filesystem GL and SDL bindings For more see
cond_next The vcmpeqfp vcmpeqfp instructions currently cannot be merged when the vcmpeqfp result is used by a branch This can be improved The code generated for this is truly float b
SI optimize exec mask operations
We currently emits eax Perhaps this is what we really should generate is Is imull three or four cycles Note
Function< 16 x i8 > Produces the following code with LCPI0_0 toc ha addis
Vector Multiply by Extended Write Unsigned int_ppc_altivec_vmul10euq
Common register allocation spilling lr str ldr sxth r3 ldr mla r4 can lr mov lr str ldr sxth r3 mla r4 and then merge mul and lr str ldr sxth r3 mla r4 It also increase the likelihood the store may become dead bb27 Successors according to LLVM ID Predecessors according to mbb< bb27, 0x8b0a7c0 > Note ADDri is not a two address instruction its result reg1037 is an operand of the PHI node in bb76 and its operand reg1039 is the result of the PHI node We should treat it as a two address code and make sure the ADDri is scheduled after any node that reads reg1039 Use info(i.e. register scavenger) to assign it a free register to allow reuse the collector could move the objects and invalidate the derived pointer This is bad enough in the first but safe points can crop up unpredictably **array_addr i32 n y store obj obj **nth_el If the i64 division is lowered to a then a safe point array and nth_el no longer point into the correct object The fix for this is to copy address calculations so that dependent pointers are never live across safe point boundaries But the loads cannot be copied like this if there was an intervening store
Target - Wrapper for Target specific information.
QP Compare Ordered outs ins xscmpudp No builtin are required Or llvm fcmp order unorder compare DP QP Compare builtin are required DP xscmp *dp write to VSX register Use int_ppc_vsx_xscmpeqdp int_ppc_vsx_xscmpgedp int_ppc_vsx_xscmpgtdp int_ppc_vsx_xscmpnedp v2i64
Since we know that Vector is byte aligned and we know the element offset of we should change the load into a lve *x instead of doing a load store lve *x sequence Implement passing vectors by value into calls and receiving them as arguments GCC apparently tries to codegen
TODO use VCMP VCMPo form(support intrinsic) - Vector Extract Unsigned v16i8
Implement PPCInstrInfo::isLoadFromStackSlot isStoreToStackSlot for vector to generate better spill code The first should be a single lvx from the constant pool
So we should use XX3Form_Rcr to implement intrinsic Convert DP outs ins xscvdpsp No builtin are required Round &Convert QP DP(dword[1] is set to zero) No builtin are required Round to Quad Precision because you need to assign rounding mode in instruction Provide builtin(set f128:$vT,(int_ppc_vsx_xsrqpi f128:$vB))(set f128 yields< n x< ty > >< result > yields< ty >< result > No builtin are required Load Store load store see def memrix16 in PPCInstrInfo td Load Store Vector load store outs ins lxsdx set load store with conversion from to outs ins lxsspx set load store outs ins lxsiwzx set PPClfiwzx outs
This returns a reference to a raw_fd_ostream for standard output.
Since we know that Vector is byte aligned and we know the element offset of we should change the load into a lve *x instead of doing a load store lve *x sequence Implement passing vectors by value into calls and receiving them as arguments GCC apparently tries to then a load and vperm of Variable We need a way to teach tblgen that some operands of an intrinsic are required to be constants The verifier should enforce this constraint We currently codegen SCALAR_TO_VECTOR as a store of the scalar to a byte aligned stack followed by a load vperm We should probably just store it to a scalar stack then use lvsl vperm to load it If the value is already in memory this is a big win extract_vector_elt of an arbitrary constant vector can be done with the following instructions
QP Compare Ordered outs ins xscmpudp $XB
So we should use XX3Form_Rcr to implement intrinsic Convert DP QP
Code Generation Notes for reduce the size of the ISel and reduce repetition in the implementation In a small number of this can cause even when no optimisation has taken place Instructions
const_iterator end(StringRef path)
Get end iterator over path.
So we should use XX3Form_Rcr to implement intrinsic Convert DP outs ins xscvdpsp No builtin are required Round &Convert QP DP(dword[1] is set to zero) No builtin are required Round to Quad Precision Integer
Since we know that Vector is byte aligned and we know the element offset of we should change the load into a lve *x instead of doing a load store lve *x sequence Implement passing vectors by value into calls and receiving them as arguments GCC apparently tries to then a load and vperm of Variable We need a way to teach tblgen that some operands of an intrinsic are required to be constants The verifier should enforce this constraint We currently codegen SCALAR_TO_VECTOR as a store of the scalar to a byte aligned stack slot
So we should use XX3Form_Rcr to implement intrinsic Convert DP outs ins xscvdpsp No builtin are required Round &Convert QP DP(dword[1] is set to zero) No builtin are required Round to Quad Precision because you need to assign rounding mode in instruction Provide builtin(set f128:$vT,(int_ppc_vsx_xsrqpi f128:$vB))(set f128 yields< n x< ty > >< result > yields< ty >< result > No builtin are required Load Store load store see def memrix16 in PPCInstrInfo td Load Store Vector load store outs ins lxsdx set load store with conversion from to outs ins lxsspx set f32
So we should use XX3Form_Rcr to implement intrinsic Convert DP outs ins xscvdpsp No builtin are required Round &Convert QP DP(dword[1] is set to zero) No builtin are required Round to Quad Precision because you need to assign rounding mode in instruction Provide builtin(set f128:$vT,(int_ppc_vsx_xsrqpi f128:$vB))(set f128 yields< n x< ty > >< result > yields< ty >< result > No builtin are required Load Store Vector
Decimal Convert From to National Zoned Signed int_ppc_altivec_bcdcfno i1
should just be implemented with a CLZ instruction Since there are other e g
So we should use XX3Form_Rcr to implement intrinsic Convert DP outs ins xscvdpsp No builtin are required Round &Convert QP DP(dword[1] is set to zero) No builtin are required Round to Quad Precision because you need to assign rounding mode in instruction Provide builtin(set f128:$vT,(int_ppc_vsx_xsrqpi f128:$vB))(set f128 yields< n x< ty > >< result > yields< ty >< result > No builtin are required Load Store load store see def memrix16 in PPCInstrInfo td Load Store Vector load store outs ins lxsdx set load store with conversion from to outs vssrc
So we should use XX3Form_Rcr to implement intrinsic Convert DP outs ins xscvdpsp IIC_VecFP
we should consider alternate ways to model stack dependencies Lots of things could be done in WebAssemblyTargetTransformInfo cpp there are numerous optimization related hooks that can be overridden in WebAssemblyTargetLowering Instead of the OptimizeReturned which should consider preserving the returned attribute through to MachineInstrs and extending the MemIntrinsicResults pass to do this optimization on calls too That would also let the WebAssemblyPeephole pass clean up dead defs for such as it does for stores Consider implementing and or getMachineCombinerPatterns Find a clean way to fix the problem which leads to the Shrink Wrapping pass being run after the WebAssembly PEI pass When setting multiple variables to the same we currently get code like const It could be done with a smaller encoding like local tee $pop5 local $pop6 WebAssembly registers are implicitly initialized to zero Explicit zeroing is therefore often redundant and could be optimized away Small indices may use smaller encodings than large indices WebAssemblyRegColoring and or WebAssemblyRegRenumbering should sort registers according to their usage frequency to maximize the usage of smaller encodings Many cases of irreducible control flow could be transformed more optimally than via the transform in WebAssemblyFixIrreducibleControlFlow cpp It may also be worthwhile to do transforms before register particularly when duplicating to allow register coloring to be aware of the duplication WebAssemblyRegStackify could use AliasAnalysis to reorder loads and stores more aggressively WebAssemblyRegStackify is currently a greedy algorithm This means that
Function< 16 x i8 > Produces the following code with LCPI0_0 toc ha LCPI0_1 toc ha LCPI0_0 toc l LCPI0_1 toc l vpmsumb
We currently generate a but we really shouldn eax ecx xorl edx divl ecx eax divl ecx movl eax ret A similar code sequence works for division We currently compile i32 v2 eax eax jo LBB1_2 and
Should compile to something like
=0.0 ? 0.0 :(a > 0.0 ? 1.0 :-1.0) a
Vector Multiply by Write Unsigned Quadword
Function align align store< 16 x i8 >< i8 1, i8 2, i8 3, i8 4, i8 5, i8 6, i8 7, i8 8, i8 9, i8 10, i8 11, i8 12, i8 13, i8 14, i8 15, i8 16 >< 16 x i8 > * a
Code Generation Notes for reduce the size of the ISel and reduce repetition in the implementation In a small number of this can cause different(semantically equivalent) instructions to be used in place of the requested instruction
should just be implemented with a CLZ instruction Since there are other e that share this it would be best to implement this in a target independent way
It looks like we only need to define PPCfmarto for these because according to these instructions perform RTO on fma s result
So we should use XX3Form_Rcr to implement intrinsic Convert DP outs ins xscvdpsp No builtin are required Round &Convert QP DP(dword[1] is set to zero) No builtin are required Round to Quad Precision because you need to assign rounding mode in instruction Provide builtin(set f128:$vT,(int_ppc_vsx_xsrqpi f128:$vB))(set f128 yields< n x< ty > >< result > yields< ty >< result > No builtin are required Load Store load store see def memrix16 in PPCInstrInfo td Load Store Vector load store outs ins lxsdx set load store with conversion from to outs ins lxsspx set load store outs ins lxsiwzx set PPClfiwzx ins stxsiwx PPCstfiwx outs ins lxvd2x set v8i16
@ STXVD2X
CHAIN = STXVD2X CHAIN, VSRC, Ptr - Occurs only for little endian.
Target & getThePPC64LETarget()
Function< 16 x i8 > Produces the following code with mtriple
into xmm2 addss xmm2 xmm1 xmm3 addss xmm3 movaps xmm0 unpcklps xmm0 ret seems silly when it could just be one addps Expand libm rounding functions main should enable SSE DAZ mode and other fast SSE modes Think about doing i64 math in SSE regs on x86 This testcase should have no SSE instructions in and only one load from a constant double ret double C the select is being which prevents the dag combiner from turning select(load CPI1)
@ Ref
The access may reference the value stored in memory.
Function align align store< 16 x i8 >< i8 1, i8 2, i8 3, i8 4, i8 5, i8 6, i8 7, i8 8, i8 9, i8 10, i8 11, i8 12, i8 13, i8 14, i8 15, i8 16 >< 16 x i8 > align store< 16 x i8 >< i8 113, i8 114, i8 115, i8 116, i8 117, i8 118, i8 119, i8 120, i8 121, i8 122, i8 123, i8 124, i8 125, i8 126, i8 127, i8 112 >< 16 x i8 > align
bool match(Val *V, const Pattern &P)
We currently generate a but we really shouldn eax ecx xorl edx divl ecx eax divl ecx movl eax ret A similar code sequence works for division We currently compile i32 v2 eax eax jo LBB1_2 shl
(vector float) vec_cmpeq(*A, *B) C
QP Compare Ordered outs ins vsfrc
static GCMetadataPrinterRegistry::Add< OcamlGCMetadataPrinter > Y("ocaml", "ocaml 3.10-compatible collector")
bitcast float %x to i32 %s=and i32 %t, 2147483647 %d=bitcast i32 %s to float ret float %d } declare float @fabsf(float %n) define float @bar(float %x) nounwind { %d=call float @fabsf(float %x) ret float %d } This IR(from PR6194):target datalayout="e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v64:64:64-v128:128:128-a0:0:64-s0:64:64-f80:128:128-n8:16:32:64-S128" target triple="x86_64-apple-darwin10.0.0" %0=type { double, double } %struct.float3=type { float, float, float } define void @test(%0, %struct.float3 *nocapture %res) nounwind noinline ssp { entry:%tmp18=extractvalue %0 %0, 0 t
This requires reassociating to forms of expressions that are already something that reassoc doesn t think about yet These two functions should generate the same code on big endian int * l
So we should use XX3Form_Rcr to implement intrinsic Convert DP outs ins xscvdpsp No builtin are required Round &Convert QP DP(dword[1] is set to zero) No builtin are required Round to Quad Precision because you need to assign rounding mode in instruction Provide builtin(set f128:$vT,(int_ppc_vsx_xsrqpi f128:$vB))(set f128 yields< n x< ty > >< result > yields< ty >< result > No builtin are required Load Store load store see def memrix16 in PPCInstrInfo td Load Store Vector load store outs ins lxsdx set load store with conversion from to outs ins lxsspx set load store outs ins lxsiwzx set PPClfiwzx ins stxsiwx PPCstfiwx outs ins lxvd2x set int_ppc_vsx_lxvh8x int_ppc_vsx_lxvb16x ins stxvd2x store int_ppc_vsx_lxvl int_ppc_vsx_lxvll int_ppc_vsx_lxvwsx st[dw] at
static GCRegistry::Add< OcamlGC > B("ocaml", "ocaml 3.10-compatible GC")
QP Compare Ordered outs ins xscmpudp No SDAG
Function< 16 x i8 > Produces the following code with LCPI0_0 toc ha LCPI0_1 toc ha LCPI0_0 toc l LCPI0_1 toc l lxvw4x
The object format emitted by the WebAssembly backed is documented in
Common register allocation spilling lr str ldr sxth r3 ldr mla r4 can be
Vector Shift Left don t map to llvm shl and because they have different e g vslv
Decimal Convert From to National Zoned Signed int_ppc_altivec_bcdcfno int_ppc_altivec_bcdcfzo int_ppc_altivec_bcdctno int_ppc_altivec_bcdctzo int_ppc_altivec_bcdcfsqo int_ppc_altivec_bcdctsqo int_ppc_altivec_bcdcpsgno int_ppc_altivec_bcdsetsgno int_ppc_altivec_bcdso int_ppc_altivec_bcduso int_ppc_altivec_bcdsro i e VA byte[7] Decimal(Unsigned) Truncate Define DAG Node in PPCInstrInfo def def PPCfmulrto
Vector Multiply by Write Unsigned vmul10uq
Decimal Convert From to National Zoned Signed QWord
Statically lint checks LLVM IR
The initial backend is deliberately restricted to z10 We should add support for later architectures at some point If an asm ties an i32 r result to an i64 input
the resulting code requires compare and branches when and if the revised code is with conditional branches instead of More there is a byte word extend before each where there should be only and the condition codes are not remembered when the same two values are compared twice More LSR enhancements i8 and i32 load store addressing modes are identical int int c
static GCMetadataPrinterRegistry::Add< ErlangGCPrinter > X("erlang", "erlang-compatible garbage collector")
Decimal Convert From to National Zoned Signed int_ppc_altivec_bcdcfno int_ppc_altivec_bcdcfzo int_ppc_altivec_bcdctno int_ppc_altivec_bcdctzo int_ppc_altivec_bcdcfsqo int_ppc_altivec_bcdctsqo int_ppc_altivec_bcdcpsgno int_ppc_altivec_bcdsetsgno int_ppc_altivec_bcdso int_ppc_altivec_bcduso int_ppc_altivec_bcdsro i e VA byte[7] Decimal(Unsigned) Truncate Define DAG Node in PPCInstrInfo def PPCfdivrto
The initial backend is deliberately restricted to z10 We should add support for later architectures at some point If an asm ties an i32 r result to an i64 the input will be treated as an leaving the upper bits uninitialised For i64 store i32 i32 *dst ret void from CodeGen SystemZ asm ll will use LHI rather than LGHI to load This seems to be a general target independent problem The tuning of the choice between LOAD XC and CLC for constant length block operations We could extend them to variable length operations too
Common register allocation spilling lr str ldr sxth r3 ldr mla r4 can lr mov lr str ldr sxth r3 mla r4 and then merge mul and lr str ldr sxth r3 mla r4 It also increase the likelihood the store may become dead bb27 Successors according to LLVM ID Predecessors according to mbb< bb27, 0x8b0a7c0 > Note ADDri is not a two address instruction its result reg1037 is an operand of the PHI node in bb76 and its operand reg1039 is the result of the PHI node We should treat it as a two address code and make sure the ADDri is scheduled after any node that reads reg1039 Use info(i.e. register scavenger) to assign it a free register to allow reuse the collector could move the objects and invalidate the derived pointer This is bad enough in the first but safe points can crop up unpredictably **array_addr i32 n y store obj obj **nth_el If the i64 division is lowered to a then a safe point will(must) appear for the call site. If a collection occurs
Common register allocation spilling lr str ldr sxth r3 ldr mla r4 can lr mov lr str ldr sxth r3 mla r4 and then merge mul and lr str ldr sxth r3 mla r4 It also increase the likelihood the store may become dead bb27 Successors according to LLVM ID Predecessors according to mbb< bb27, 0x8b0a7c0 > Note ADDri is not a two address instruction its result reg1037 is an operand of the PHI node in bb76 and its operand reg1039 is the result of the PHI node We should treat it as a two address code and make sure the ADDri is scheduled after any node that reads reg1039 Use info(i.e. register scavenger) to assign it a free register to allow reuse the collector could move the objects and invalidate the derived pointer This is bad enough in the first but safe points can crop up unpredictably **array_addr i32
Implement PPCInstrInfo::isLoadFromStackSlot isStoreToStackSlot for vector to generate better spill code The first should be a single lvx from the constant the second should be a xor bar(x)
Vector Shift Left don t map to llvm shl and because they have different e g int_ppc_altivec_vslv
So we should use XX3Form_Rcr to implement intrinsic Convert DP outs ins xscvdpsp No builtin are required Round &Convert QP DP(dword[1] is set to zero) No builtin are required Round to Quad Precision because you need to assign rounding mode in instruction Provide builtin(set f128:$vT,(int_ppc_vsx_xsrqpi f128:$vB))(set f128 yields< n x< ty > >< result > yields< ty >< result > No builtin are required Load Store load store see def memrix16 in PPCInstrInfo td Load Store Vector load store outs ins lxsdx set load store with conversion from to outs ins lxsspx set load store outs ins lxsiwzx set PPClfiwzx ins stxsiwx PPCstfiwx outs vsrc
The initial backend is deliberately restricted to z10 We should add support for later architectures at some point If an asm ties an i32 r result to an i64 the input will be treated as an leaving the upper bits uninitialised For i64 store i32 val
We should do a little better with eliminating dead stores The stores to the stack are dead since a and b are not needed
AMD64 Optimization Manual has some nice information about optimizing integer multiplication by a constant How much of it applies to Intel s X86 implementation There are definite trade offs to xmm0 cvttss2siq rdx jb L3 subss xmm0 rax cvttss2siq rdx xorq rdx rax ret instead of xmm1 cvttss2siq rcx movaps xmm2 subss xmm2 cvttss2siq rax rdx xorq rax ucomiss xmm0 cmovb rax ret Seems like the jb branch has high likelihood of being taken It would have saved a few instructions It s not possible to reference and DH registers in an instruction requiring REX prefix divb and mulb both produce results in AH If isel emits a CopyFromReg which gets turned into a movb and that can be allocated a r8b r15b To get around isel emits a CopyFromReg from AX and then right shift it down by and truncate it It s not pretty but it works We need some register allocation magic to make the hack go which would often require a callee saved register Callees usually need to keep this value live for most of their body so it doesn t add a significant burden on them We currently implement this in however this is suboptimal because it means that it would be quite awkward to implement the optimization for callers A better implementation would be to relax the LLVM IR rules for sret arguments to allow a function with an sret argument to have a non void return type
entry mr r6 r2 blr CodeGen PowerPC vec_constants ll has an and operation that should be codegen d to andc The issue is that the all ones build vector is SelectNodeTo d a VSPLTISB instruction node before the and xor is selected which prevents the vnot pattern from matching An alternative to the store store load approach for illegal insert element lowering would be
This is equivalent to the following
static GCRegistry::Add< StatepointGC > D("statepoint-example", "an example strategy for statepoint")
#define LLVM_EXTERNAL_VISIBILITY
Predicate all(Predicate P0, Predicate P1)
True iff P0 and P1 are true.
Decimal Convert From to National Zoned Signed int_ppc_altivec_bcdcfno int_ppc_altivec_bcdcfzo int_ppc_altivec_bcdctno int_ppc_altivec_bcdctzo int_ppc_altivec_bcdcfsqo int_ppc_altivec_bcdctsqo int_ppc_altivec_bcdcpsgno int_ppc_altivec_bcdsetsgno int_ppc_altivec_bcdso int_ppc_altivec_bcduso int_ppc_altivec_bcdsro i e VA byte[7] Decimal(Unsigned) Truncate Define DAG Node in PPCInstrInfo SDTFPBinOp
Vector Shift Left don t map to llvm shl and because they have different semantics
Decimal Convert From to National Zoned Signed int_ppc_altivec_bcdcfno int_ppc_altivec_bcdcfzo int_ppc_altivec_bcdctno int_ppc_altivec_bcdctzo int_ppc_altivec_bcdcfsqo int_ppc_altivec_bcdctsqo int_ppc_altivec_bcdcpsgno int_ppc_altivec_bcdsetsgno int_ppc_altivec_bcdso int_ppc_altivec_bcduso int_ppc_altivec_bcdsro i e VA byte[7] Decimal(Unsigned) Truncate Define DAG Node in PPCInstrInfo def def def PPCfsubrto
Decimal Convert From to National Zoned Signed int_ppc_altivec_bcdcfno int_ppc_altivec_bcdcfzo int_ppc_altivec_bcdctno int_ppc_altivec_bcdctzo int_ppc_altivec_bcdcfsqo int_ppc_altivec_bcdctsqo int_ppc_altivec_bcdcpsgno int_ppc_altivec_bcdsetsgno int_ppc_altivec_bcdso int_ppc_altivec_bcduso int_ppc_altivec_bcdsro i e VA byte[7] Decimal(Unsigned) Truncate Define DAG Node in PPCInstrInfo def def def def DAG patterns of each instruction(PPCInstrVSX.td)
LLVM currently emits rax rax movq rax rax ret It could narrow the loads and stores to emit rax rax movq rax rax ret The trouble is that there is a TokenFactor between the store and the load
So we should use XX3Form_Rcr to implement intrinsic Convert DP outs ins xscvdpsp No builtin are required Round &Convert QP DP(dword[1] is set to zero) No builtin are required Round to Quad Precision because you need to assign rounding mode in instruction Provide builtin(set f128:$vT,(int_ppc_vsx_xsrqpi f128:$vB))(set f128 yields< n x< ty > >< result > yields< ty >< result > No builtin are required Load Store load store see def memrix16 in PPCInstrInfo td Load Store Vector load store outs ins memrr
llvm lib Support Unix the directory structure underneath this directory could look like only those directories actually needing to be created should be created further subdirectories could be created to reflect versions of the various standards For example
@ Define
Register definition.
should just be implemented with a CLZ instruction Since there are other e that share this it would be best to implement this in a target independent as zero is the default value for the binary encoder e add r0 add r5 Register operands should be distinct That when the encoding does not require two syntactical operands to refer to the same register
Decimal Convert From to National Zoned Signed int_ppc_altivec_bcdcfno int_ppc_altivec_bcdcfzo int_ppc_altivec_bcdctno int_ppc_altivec_bcdctzo int_ppc_altivec_bcdcfsqo int_ppc_altivec_bcdctsqo int_ppc_altivec_bcdcpsgno int_ppc_altivec_bcdsetsgno int_ppc_altivec_bcdso int_ppc_altivec_bcduso int_ppc_altivec_bcdsro i e VA byte[7] Decimal(Unsigned) Truncate Define DAG Node in PPCInstrInfo def def def def SDTFPUnaryOp
Vector Rotate Left Mask Mask v4i32
support::ulittle32_t Word
QP Compare Ordered outs ins xscmpudp No builtin are required Or llvm fcmp order unorder compare DP QP Compare builtin are required DP Compare
So we should use XX3Form_Rcr to implement intrinsic Convert DP outs ins xscvdpsp No builtin are required Round &Convert QP DP(dword[1] is set to zero) No builtin are required Round to Quad Precision because you need to assign rounding mode in instruction Provide builtin(set f128:$vT,(int_ppc_vsx_xsrqpi f128:$vB))(set f128 yields< n x< ty > >< result > yields< ty >< result > No builtin are required Load Store load store see def memrix16 in PPCInstrInfo td Load Store Vector load store outs ins lxsdx set load store with conversion from to DP
Since we know that Vector is byte aligned and we know the element offset of we should change the load into a lve *x instruction
We currently generate a but we really shouldn eax ecx xorl edx divl ecx eax divl ecx movl eax ret A similar code sequence works for division We currently compile i32 v2 eax eax jo LBB1_2 xor
this could be done in SelectionDAGISel along with other special for
Vector Multiply by Write Unsigned v1i128
Should compile r2 movcc movcs str strb mov lr r1 movcs movcc mov lr not
It looks like we only need to define PPCfmarto for these instructions
@ LXVD2X
VSRC, CHAIN = LXVD2X_LE CHAIN, Ptr - Occurs only for little endian.
So we should use XX3Form_Rcr to implement intrinsic Convert DP outs ins xscvdpsp No builtin are required Round &Convert QP DP(dword[1] is set to zero) No builtin are required Round to Quad Precision because you need to assign rounding mode in instruction Provide builtin(set f128:$vT,(int_ppc_vsx_xsrqpi f128:$vB))(set f128 yields< n x< ty > >< result > yields< ty >< result > No builtin are required Load Store load store see def memrix16 in PPCInstrInfo td Load Store Vector load store outs ins lxsdx $src
Vector Multiply by Extended Write Unsigned vmul10euq
Target & getThePPC64Target()
compiles ldr LCPI1_0 ldr ldr mov lsr tst moveq r1 ldr LCPI1_1 and r0 bx lr It would be better to do something like to fold the shift into the conditional ldr LCPI1_0 ldr ldr tst movne lsr ldr LCPI1_1 and r0 bx lr it saves an instruction and a register It might be profitable to cse MOVi16 if there are lots of bit immediates with the same bottom half Robert Muth started working on an alternate jump table implementation that does not put the tables in line in the text This is more like the llvm default jump table implementation This might be useful sometime Several revisions of patches are on the mailing beginning while CMP sets them like a subtract Therefore to be able to use CMN for comparisons other than the Z bit
*Add support for compiling functions in both ARM and Thumb then taking the smallest *Add support for compiling individual basic blocks in thumb when in a larger ARM function This can be used for presumed cold code
The same transformation can work with an even modulo with the addition of a rotate
vcmpeq to generate a select mask lvsl slot x
Function< 16 x i8 > Produces the following code with LCPI0_0 toc ha LCPI0_1 toc ha addi
vperm to rotate result into correct slot vsel result together Should codegen branches on vec_any vec_all to avoid mfcr Two examples
So we should use XX3Form_Rcr to implement intrinsic Convert DP outs ins xscvdpsp No builtin are required Round &Convert QP DP(dword[1] is set to zero) No builtin are required Round to Quad Precision because you need to assign rounding mode in instruction Provide builtin(set f128:$vT,(int_ppc_vsx_xsrqpi f128:$vB))(set f128 yields< n x< ty > >< result > yields< ty >< result > No builtin are required Load Store load store see def memrix16 in PPCInstrInfo td Load Store Vector load xoaddr
In x86 we generate this spiffy xmm0 xmm0 ret in x86 we generate this which could be better
RegisterTarget - Helper template for registering a target, for use in the target's initialization fun...
QP Compare Ordered outs ins xscmpudp $XA
LLVM_EXTERNAL_VISIBILITY void LLVMInitializePowerPCTargetInfo()
cond_true lis lo16() lo16() lo16() f1 fsel f3 blr
Target & getThePPC32LETarget()
the resulting code requires compare and branches when and if the revised code is with conditional branches instead of More there is a byte word extend before each where there should be only and the condition codes are not remembered when the same two values are compared twice More LSR enhancements i8 and i32 load store addressing modes are identical This
So we should use XX3Form_Rcr to implement intrinsic Convert DP outs ins xscvdpsp No builtin are required Round &Convert QP DP(dword[1] is set to zero) No builtin are required Round to Quad Precision because you need to assign rounding mode in instruction Provide builtin(set f128:$vT,(int_ppc_vsx_xsrqpi f128:$vB))(set f128 yields< n x< ty > >< result > yields< ty >< result > No builtin are required Load Store load store see def memrix16 in PPCInstrInfo td Load Store Vector load store outs ins lxsdx set load iaddrX4
This might compile to this xmm1 xorps xmm0 movss xmm0 ret Now consider if the code caused xmm1 to get spilled This might produce this xmm1 movaps xmm0 movaps xmm1 movss xmm0 ret since the reload is only used by these we could fold it into the producing something like xmm1 movaps xmm0 ret saving two instructions The basic idea is that a reload from a spill if only one byte chunk is used
So we should use XX3Form_Rcr to implement intrinsic Convert DP outs ins xscvdpsp No builtin are required Round &Convert QP DP(dword[1] is set to zero) No builtin are required Round to Quad Precision because you need to assign rounding mode in instruction Provide builtin(set f128:$vT,(int_ppc_vsx_xsrqpi f128:$vB))(set f128 yields< n x< ty > >< result > yields< ty >< result > No builtin are required Load Store load store see def memrix16 in PPCInstrInfo td Load Store Vector load store outs ins lxsdx set load store with conversion from to outs ins lxsspx set load store outs ins lxsiwzx set PPClfiwzx ins stxsiwx IIC_LdStSTFD
Reimplement select in terms of SEL *We would really like to support but we need to prove that the add doesn t need to overflow between the two bit chunks *Implement pre post increment support(e.g. PR935) *Implement smarter const ant generation for binops with large immediates. A few ARMv6T2 ops should be pattern matched
MIPS Relocation Principles In LLVM
S is passed via registers r2 But gcc stores them to the stack
So we should use XX3Form_Rcr to implement intrinsic Convert DP outs ins xscvdpsp No builtin are required Round &Convert QP DP(dword[1] is set to zero) No builtin are required Round to Quad Precision because you need to assign rounding mode in instruction Provide builtin(set f128:$vT,(int_ppc_vsx_xsrqpi f128:$vB))(set f128 yields< n x< ty > >< result > yields< ty >< result > No builtin are required Load Store load store see def memrix16 in PPCInstrInfo td Load Store Vector load store outs ins lxsdx IIC_LdStLFD
So we should use XX3Form_Rcr to implement intrinsic Convert DP outs ins xscvdpsp No builtin are required Round &Convert QP DP(dword[1] is set to zero) No builtin are required Round to Quad Precision because you need to assign rounding mode in instruction Provide builtin(set f128:$vT,(int_ppc_vsx_xsrqpi f128:$vB))(set f128 yields< n x< ty > >< result > yields< ty >< result > No builtin are required Load Store load ix16addr
amdgpu printf runtime AMDGPU Printf lowering
Type
MessagePack types as defined in the standard, with the exception of Integer being divided into a sign...
fma RegConstraint<"$vTi = $vT">
Function< 16 x i8 > Produces the following code with LCPI0_0 toc ha LCPI0_1 toc ha LCPI0_0 toc l LCPI0_1 toc l stxvw4x
QP Compare Ordered outs ins xscmpudp No builtin are required Or llvm fcmp order unorder compare DP QP Compare builtin are required DP xscmp *dp write to VSX register Use int_ppc_vsx_xscmpeqdp int_ppc_vsx_xscmpgedp int_ppc_vsx_xscmpgtdp int_ppc_vsx_xscmpnedp v2f64
QP Compare Ordered outs ins xscmpudp No builtin are required Or llvm fcmp order unorder compare DP QP Compare builtin are required DP xscmp *dp write to VSX register Use int_ppc_vsx_xscmpeqdp int_ppc_vsx_xscmpgedp int_ppc_vsx_xscmpgtdp int_ppc_vsx_xscmpnedp $XT
this lets us change the cmpl into a which is and eliminate the shift We compile this i32 i32 i8 zeroext d nounwind
BlockVerifier::State From
Vector Shift Left don t map to llvm shl and lshr
the custom lowered code happens to be but we shouldn t have to custom lower anything This is probably related to< 2 x i64 > ops being so bad LLVM currently generates stack realignment when it is not necessary needed The problem is that we need to know about stack alignment too before RA runs At that point we don t whether there will be vector spill
Function< 16 x i8 > Produces the following code with LCPI0_0 toc ha LCPI0_1 toc ha LCPI0_0 toc l LCPI0_1 toc l ori
QP Compare Ordered outs ins xscmpudp No builtin are required Or llvm fcmp order unorder compare DP QP Compare builtin are required DP xscmp *dp write to VSX register Use int_ppc_vsx_xscmpeqdp f64
the resulting code requires compare and branches when and if the revised code is with conditional branches instead of More there is a byte word extend before each where there should be only and the condition codes are not remembered when the same two values are compared twice More LSR enhancements i8 and i32 load store addressing modes are identical int int int d
Implement PPCInstrInfo::isLoadFromStackSlot isStoreToStackSlot for vector registers
arc branch ARC finalize branches
So we should use XX3Form_Rcr to implement intrinsic Convert DP outs ins xscvdpsp No builtin are required Round &Convert QP DP(dword[1] is set to zero) No builtin are required Round to Quad Precision because you need to assign rounding mode in instruction Provide builtin(set f128:$vT,(int_ppc_vsx_xsrqpi f128:$vB))(set f128 yields< n x< ty > >< result > yields< ty >< result > No builtin are required Load Store load store see def memrix16 in PPCInstrInfo td Load Store Vector Indexed
So we should use XX3Form_Rcr to implement intrinsic Convert DP outs ins xscvdpsp No builtin are required Round &Convert QP DP(dword[1] is set to zero) No builtin are required Round to Quad Precision because you need to assign rounding mode in instruction Provide builtin(set f128:$vT,(int_ppc_vsx_xsrqpi f128:$vB))(set f128 yields< n x< ty > >< result > yields< ty >< result > No builtin are required Load Store load store see def memrix16 in PPCInstrInfo td Load Store Vector load store outs ins lxsdx set load store with conversion from to outs ins lxsspx set load store outs ins lxsiwzx set PPClfiwzx ins stxsiwx $dst
print Instructions which execute on loop entry
Implement PPCInstrInfo::isLoadFromStackSlot isStoreToStackSlot for vector to generate better spill code The first should be a single lvx from the constant the second should be a xor stvx
Add support for conditional and other related patterns Instead of
Error build(ArrayRef< Module * > Mods, SmallVector< char, 0 > &Symtab, StringTableBuilder &StrtabBuilder, BumpPtrAllocator &Alloc)
Fills in Symtab and StrtabBuilder with a valid symbol and string table for Mods.
Target & getThePPC32Target()