LLVM
17.0.0git
|
#include <xmmintrin.h>
#include <math.h>
#include <emmintrin.h>
Typedefs | |
using | t = bitcast float %x to i32 %s=and i32 %t, 2147483647 %d=bitcast i32 %s to float ret float %d } declare float @fabsf(float %n) define float @bar(float %x) nounwind { %d=call float @fabsf(float %x) ret float %d } This IR(from PR6194):target datalayout="e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v64:64:64-v128:128:128-a0:0:64-s0:64:64-f80:128:128-n8:16:32:64-S128" target triple="x86_64-apple-darwin10.0.0" %0=type { double, double } %struct.float3=type { float, float, float } define void @test(%0, %struct.float3 *nocapture %res) nounwind noinline ssp { entry:%tmp18=extractvalue %0 %0, 0 |
Functions | |
SSE Variable shift can be custom lowered to something like which uses a small table unaligned load shuffle instead of going through memory __m128i | shift_right (__m128i value, unsigned long offset) |
float | f32 (v4f32 A) |
into xmm2 addss xmm2 xmm1 xmm3 addss xmm3 movaps xmm0 unpcklps xmm0 ret seems silly when it could just be one addps Expand libm rounding functions main should enable SSE DAZ mode and other fast SSE modes Think about doing i64 math in SSE regs on x86 This testcase should have no SSE instructions in and only one load from a constant double ret double C the select is being which prevents the dag combiner from turning | select (load CPI1) |
into xmm2 addss xmm2 xmm1 xmm3 addss xmm3 movaps xmm0 unpcklps xmm0 ret seems silly when it could just be one addps Expand libm rounding functions main should enable SSE DAZ mode and other fast SSE modes Think about doing i64 math in SSE regs on x86 This testcase should have no SSE instructions in and only one load from a constant double ret double C the select is being which prevents the dag combiner from turning load CPI2 | load (select CPI1, CPI2)' The pattern isel got this one right. Lower memcpy/memset to a series of SSE 128 bit move instructions when it 's feasible. Codegen |
return | _mm_set_ps (0.0, 0.0, 0.0, b) |
This might compile to this xmm1 xorps xmm0 movss xmm0 ret Now consider if the code caused xmm1 to get spilled This might produce this xmm1 movaps | c2 (%esp) ... xorps %xmm0 |
This might compile to this xmm1 xorps xmm0 movss xmm0 ret Now consider if the code caused xmm1 to get spilled This might produce this xmm1 movaps xmm0 movaps xmm1 movss xmm0 ret since the reload is only used by these we could fold it into the producing something like xmm1 movaps xmm0 ret saving two instructions The basic idea is that a reload from a spill if only one byte chunk is bring in zeros the one element instead of elements This can be used to simplify a variety of shuffle where the elements are fixed zeros This code generates ugly probably due to costs being off or< 4 x float > eax | movaps (%eax) |
void | x (unsigned short n) |
void | y (unsigned n) |
compile | to (-O3 -static -fomit-frame-pointer) |
This currently compiles esp | movsd (%esp) |
This currently compiles esp xmm0 movsd esp | movl (%esp) |
This currently compiles esp xmm0 movsd esp eax eax esp ret We should use not the dag combiner This is because dagcombine2 needs to be able to see through the X86ISD::Wrapper which DAGCombine can t really do The code for turning x load into a single vector load is target independent and should be moved to the dag combiner The code for turning x load into a vector load can only handle a direct load from a global or a direct load from the stack It should be generalized to handle any load from where P can be anything The alignment inference code cannot handle loads from globals in static non mode because it doesn t look through the extra dyld stub load If you try vec_align ll without relocation you ll see what I mean We should lower | store (fneg(load p), q) into an integer load+xor+store |
vSInt16 | madd (vSInt16 b) |
Generated | code (x86-32, linux) |
__m128 | foo1 (float x1, float x4) |
gcc mainline compiles it | x2 (%rip) |
gcc mainline compiles it xmm0 | x3 (%rip) |
into eax xorps xmm0 xmm0 | movzbl (%esp) |
into eax xorps xmm0 xmm0 eax xmm0 eax xmm0 ret esp eax movdqa xmm0 xmm0 esp const ret align it should be movdqa | LC0 (%rip) |
compiles | to (x86-32) |
In fpstack this compiles esp eax movl esp | fildl (%esp) fmuls LCPI1_0 addl $4 |
In fpstack this compiles esp eax movl esp esp ret in SSE it compiles into significantly slower esp | cvtsi2sd (%esp) |
In fpstack this compiles esp eax movl esp esp ret in SSE it compiles into significantly slower esp xmm0 mulsd xmm0 movsd esp | fldl (%esp) addl $12 |
< float * > store float float *tmp5 ret void Compiles rax rax movl rdi ret This would be better kept in the SSE unit by treating XMM0 as a and doing a shuffle from v[1] to v[0] then a float store[UNSAFE FP] void | foo (double, double, double) |
void | norm (double x, double y, double z) |
Variables | |
SSE Variable shift can be custom lowered to something like | this |
SSE Variable shift can be custom lowered to something like which uses a small table unaligned load shuffle instead of going through memory | __m128i_shift_right |
SSE Variable shift can be custom lowered to something like which uses a small table unaligned load shuffle instead of going through memory | byte |
SSE has instructions for doing operations on complex | numbers |
SSE has instructions for doing operations on complex we should pattern match them For | example |
SSE has instructions for doing operations on complex we should pattern match them For this should turn into a horizontal | add |
Instead we get | xmm0 |
Instead we get xmm1 addss xmm1 | pshufd |
Instead we get xmm1 addss xmm1 xmm2 movhlps xmm0 movaps xmm3 addss | xmm1 |
Instead we get xmm1 addss xmm1 xmm2 movhlps xmm0 movaps xmm3 addss xmm3 movdqa | xmm2 |
Instead we get xmm1 addss xmm1 xmm2 movhlps xmm0 movaps xmm3 addss xmm3 movdqa xmm0 addss | xmm3 |
Instead we get xmm1 addss xmm1 xmm2 movhlps xmm0 movaps xmm3 addss xmm3 movdqa xmm0 addss xmm0 ret | Also |
Instead we get xmm1 addss xmm1 xmm2 movhlps xmm0 movaps xmm3 addss xmm3 movdqa xmm0 addss xmm0 ret there are cases where some simple SLP would improve codegen a bit compiling _Complex float | B |
into | __pad1__ |
into xmm2 addss xmm2 xmm1 xmm3 addss xmm3 movaps xmm0 unpcklps xmm0 ret seems silly when it could just be one addps Expand libm rounding functions | inline |
into xmm2 addss xmm2 xmm1 xmm3 addss xmm3 movaps xmm0 unpcklps xmm0 ret seems silly when it could just be one addps Expand libm rounding functions main should enable SSE DAZ mode and other fast SSE modes Think about doing i64 math in SSE regs on x86 This testcase should have no SSE instructions in | it |
into xmm2 addss xmm2 xmm1 xmm3 addss xmm3 movaps xmm0 unpcklps xmm0 ret seems silly when it could just be one addps Expand libm rounding functions main should enable SSE DAZ mode and other fast SSE modes Think about doing i64 math in SSE regs on x86 This testcase should have no SSE instructions in and only one load from a constant | pool |
into xmm2 addss xmm2 xmm1 xmm3 addss xmm3 movaps xmm0 unpcklps xmm0 ret seems silly when it could just be one addps Expand libm rounding functions main should enable SSE DAZ mode and other fast SSE modes Think about doing i64 math in SSE regs on x86 This testcase should have no SSE instructions in and only one load from a constant | double |
into xmm2 addss xmm2 xmm1 xmm3 addss xmm3 movaps xmm0 unpcklps xmm0 ret seems silly when it could just be one addps Expand libm rounding functions main should enable SSE DAZ mode and other fast SSE modes Think about doing i64 math in SSE regs on x86 This testcase should have no SSE instructions in and only one load from a constant double ret double C | Currently |
into xmm2 addss xmm2 xmm1 xmm3 addss xmm3 movaps xmm0 unpcklps xmm0 ret seems silly when it could just be one addps Expand libm rounding functions main should enable SSE DAZ mode and other fast SSE modes Think about doing i64 math in SSE regs on x86 This testcase should have no SSE instructions in and only one load from a constant double ret double C the select is being | lowered |
This compiles | into |
This compiles xmm1 mulss xmm1 xorps xmm0 movss xmm0 ret Because mulss doesn t modify the top | elements |
This compiles xmm1 mulss xmm1 xorps xmm0 movss xmm0 ret Because mulss doesn t modify the top the top elements of xmm1 are already zero d We could compile this | to |
This might compile to this | code |
This might compile to this xmm1 xorps xmm0 movss xmm0 ret Now consider if the code caused xmm1 to get spilled This might produce this xmm1 movaps xmm0 movaps xmm1 movss xmm0 ret | However |
This might compile to this xmm1 xorps xmm0 movss xmm0 ret Now consider if the code caused xmm1 to get spilled This might produce this xmm1 movaps xmm0 movaps xmm1 movss xmm0 ret since the reload is only used by these | instructions |
This might compile to this xmm1 xorps xmm0 movss xmm0 ret Now consider if the code caused xmm1 to get spilled This might produce this xmm1 movaps xmm0 movaps xmm1 movss xmm0 ret since the reload is only used by these we could fold it into the | uses |
This might compile to this xmm1 xorps xmm0 movss xmm0 ret Now consider if the code caused xmm1 to get spilled This might produce this xmm1 movaps xmm0 movaps xmm1 movss xmm0 ret since the reload is only used by these we could fold it into the producing something like xmm1 movaps xmm0 ret saving two instructions The basic idea is that a reload from a spill | slot |
This might compile to this xmm1 xorps xmm0 movss xmm0 ret Now consider if the code caused xmm1 to get spilled This might produce this xmm1 movaps xmm0 movaps xmm1 movss xmm0 ret since the reload is only used by these we could fold it into the producing something like xmm1 movaps xmm0 ret saving two instructions The basic idea is that a reload from a spill | can |
This might compile to this xmm1 xorps xmm0 movss xmm0 ret Now consider if the code caused xmm1 to get spilled This might produce this xmm1 movaps xmm0 movaps xmm1 movss xmm0 ret since the reload is only used by these we could fold it into the producing something like xmm1 movaps xmm0 ret saving two instructions The basic idea is that a reload from a spill if only one byte chunk is | used |
This might compile to this xmm1 xorps xmm0 movss xmm0 ret Now consider if the code caused xmm1 to get spilled This might produce this xmm1 movaps xmm0 movaps xmm1 movss xmm0 ret since the reload is only used by these we could fold it into the producing something like xmm1 movaps xmm0 ret saving two instructions The basic idea is that a reload from a spill if only one byte chunk is bring in zeros the one element instead of elements This can be used to simplify a variety of shuffle | operations |
This might compile to this xmm1 xorps xmm0 movss xmm0 ret Now consider if the code caused xmm1 to get spilled This might produce this xmm1 movaps xmm0 movaps xmm1 movss xmm0 ret since the reload is only used by these we could fold it into the producing something like xmm1 movaps xmm0 ret saving two instructions The basic idea is that a reload from a spill if only one byte chunk is bring in zeros the one element instead of elements This can be used to simplify a variety of shuffle where the elements are fixed zeros This code generates ugly probably due to costs being off or | something |
This might compile to this xmm1 xorps xmm0 movss xmm0 ret Now consider if the code caused xmm1 to get spilled This might produce this xmm1 movaps xmm0 movaps xmm1 movss xmm0 ret since the reload is only used by these we could fold it into the producing something like xmm1 movaps xmm0 ret saving two instructions The basic idea is that a reload from a spill if only one byte chunk is bring in zeros the one element instead of elements This can be used to simplify a variety of shuffle where the elements are fixed zeros This code generates ugly probably due to costs being off or< 4 x float > * | P2 |
This might compile to this xmm1 xorps xmm0 movss xmm0 ret Now consider if the code caused xmm1 to get spilled This might produce this xmm1 movaps xmm0 movaps xmm1 movss xmm0 ret since the reload is only used by these we could fold it into the producing something like xmm1 movaps xmm0 ret saving two instructions The basic idea is that a reload from a spill if only one byte chunk is bring in zeros the one element instead of elements This can be used to simplify a variety of shuffle where the elements are fixed zeros This code generates ugly probably due to costs being off or< 4 x float > eax xmm0 pxor xmm1 movaps xmm2 | shufps |
This might compile to this xmm1 xorps xmm0 movss xmm0 ret Now consider if the code caused xmm1 to get spilled This might produce this xmm1 movaps xmm0 movaps xmm1 movss xmm0 ret since the reload is only used by these we could fold it into the producing something like xmm1 movaps xmm0 ret saving two instructions The basic idea is that a reload from a spill if only one byte chunk is bring in zeros the one element instead of elements This can be used to simplify a variety of shuffle where the elements are fixed zeros This code generates ugly probably due to costs being off or< 4 x float > eax xmm0 pxor xmm1 movaps xmm2 xmm2 xmm0 movaps eax ret Would it be better to | generate |
This might compile to this xmm1 xorps xmm0 movss xmm0 ret Now consider if the code caused xmm1 to get spilled This might produce this xmm1 movaps xmm0 movaps xmm1 movss xmm0 ret since the reload is only used by these we could fold it into the producing something like xmm1 movaps xmm0 ret saving two instructions The basic idea is that a reload from a spill if only one byte chunk is bring in zeros the one element instead of elements This can be used to simplify a variety of shuffle where the elements are fixed zeros This code generates ugly probably due to costs being off or< 4 x float > eax xmm0 pxor xmm1 movaps xmm2 xmm2 xmm0 movaps eax ret Would it be better to ecx xmm0 xor | eax |
This might compile to this xmm1 xorps xmm0 movss xmm0 ret Now consider if the code caused xmm1 to get spilled This might produce this xmm1 movaps xmm0 movaps xmm1 movss xmm0 ret since the reload is only used by these we could fold it into the producing something like xmm1 movaps xmm0 ret saving two instructions The basic idea is that a reload from a spill if only one byte chunk is bring in zeros the one element instead of elements This can be used to simplify a variety of shuffle where the elements are fixed zeros This code generates ugly probably due to costs being off or< 4 x float > eax xmm0 pxor xmm1 movaps xmm2 xmm2 xmm0 movaps eax ret Would it be better to ecx xmm0 xor eax | pinsrw |
This might compile to this xmm1 xorps xmm0 movss xmm0 ret Now consider if the code caused xmm1 to get spilled This might produce this xmm1 movaps xmm0 movaps xmm1 movss xmm0 ret since the reload is only used by these we could fold it into the producing something like xmm1 movaps xmm0 ret saving two instructions The basic idea is that a reload from a spill if only one byte chunk is bring in zeros the one element instead of elements This can be used to simplify a variety of shuffle where the elements are fixed zeros This code generates ugly probably due to costs being off or< 4 x float > eax xmm0 pxor xmm1 movaps xmm2 xmm2 xmm0 movaps eax ret Would it be better to ecx xmm0 xor eax xmm0 xmm0 movaps ecx ret Some useful information in the Apple Altivec SSE Migration | Guide |
This might compile to this xmm1 xorps xmm0 movss xmm0 ret Now consider if the code caused xmm1 to get spilled This might produce this xmm1 movaps xmm0 movaps xmm1 movss xmm0 ret since the reload is only used by these we could fold it into the producing something like xmm1 movaps xmm0 ret saving two instructions The basic idea is that a reload from a spill if only one byte chunk is bring in zeros the one element instead of elements This can be used to simplify a variety of shuffle where the elements are fixed zeros This code generates ugly probably due to costs being off or< 4 x float > eax xmm0 pxor xmm1 movaps xmm2 xmm2 xmm0 movaps eax ret Would it be better to ecx xmm0 xor eax xmm0 xmm0 movaps ecx ret Some useful information in the Apple Altivec SSE Migration | andnot |
This might compile to this xmm1 xorps xmm0 movss xmm0 ret Now consider if the code caused xmm1 to get spilled This might produce this xmm1 movaps xmm0 movaps xmm1 movss xmm0 ret since the reload is only used by these we could fold it into the producing something like xmm1 movaps xmm0 ret saving two instructions The basic idea is that a reload from a spill if only one byte chunk is bring in zeros the one element instead of elements This can be used to simplify a variety of shuffle where the elements are fixed zeros This code generates ugly probably due to costs being off or< 4 x float > eax xmm0 pxor xmm1 movaps xmm2 xmm2 xmm0 movaps eax ret Would it be better to ecx xmm0 xor eax xmm0 xmm0 movaps ecx ret Some useful information in the Apple Altivec SSE Migration or Various SSE compare translations Add hooks to commute some CMPP operations Apply the same transformation that merged four float into a single bit load to loads from constant pool Floating point max min are commutable when enable unsafe fp path is specified We should turn int_x86_sse_max_ss and X86ISD::FMIN etc into other nodes which are selected to max min instructions that are marked commutable We should materialize vector constants like all ones and signbit with code | like |
This currently compiles esp xmm0 movsd esp eax | shrl |
This currently compiles esp xmm0 movsd esp eax eax | addl |
This currently compiles esp xmm0 movsd esp eax eax esp ret We should use | movmskp |
This currently compiles esp xmm0 movsd esp eax eax esp ret We should use not the dag combiner This is because dagcombine2 needs to be able to see through the X86ISD::Wrapper | node |
This currently compiles esp xmm0 movsd esp eax eax esp ret We should use not the dag combiner This is because dagcombine2 needs to be able to see through the X86ISD::Wrapper which DAGCombine can t really do The code for turning x load into a single vector load is target independent and should be moved to the dag combiner The code for turning x load into a vector load can only handle a direct load from a global or a direct load from the stack It should be generalized to handle any load from | P |
This currently compiles esp xmm0 movsd esp eax eax esp ret We should use not the dag combiner This is because dagcombine2 needs to be able to see through the X86ISD::Wrapper which DAGCombine can t really do The code for turning x load into a single vector load is target independent and should be moved to the dag combiner The code for turning x load into a vector load can only handle a direct load from a global or a direct load from the stack It should be generalized to handle any load from where P can be anything The alignment inference code cannot handle loads from globals in static non mode because it doesn t look through the extra dyld stub load If you try vec_align ll without relocation | model =static |
This currently compiles esp xmm0 movsd esp eax eax esp ret We should use not the dag combiner This is because dagcombine2 needs to be able to see through the X86ISD::Wrapper which DAGCombine can t really do The code for turning x load into a single vector load is target independent and should be moved to the dag combiner The code for turning x load into a vector load can only handle a direct load from a global or a direct load from the stack It should be generalized to handle any load from where P can be anything The alignment inference code cannot handle loads from globals in static non mode because it doesn t look through the extra dyld stub load If you try vec_align ll without relocation you ll see what I mean We should lower which eliminates a constant pool load For | consider |
This currently compiles esp xmm0 movsd esp eax eax esp ret We should use not the dag combiner This is because dagcombine2 needs to be able to see through the X86ISD::Wrapper which DAGCombine can t really do The code for turning x load into a single vector load is target independent and should be moved to the dag combiner The code for turning x load into a vector load can only handle a direct load from a global or a direct load from the stack It should be generalized to handle any load from where P can be anything The alignment inference code cannot handle loads from globals in static non mode because it doesn t look through the extra dyld stub load If you try vec_align ll without relocation you ll see what I mean We should lower which eliminates a constant pool load For float z nounwind | readonly |
< float > | tmp20 |
the custom lowered code happens to be | right |
the custom lowered code happens to be but we shouldn t have to custom lower anything This is probably related to< 2 x i64 > ops being so bad LLVM currently generates stack realignment when it is not necessary needed The problem is that we need to know about stack alignment too | early |
the custom lowered code happens to be but we shouldn t have to custom lower anything This is probably related to< 2 x i64 > ops being so bad LLVM currently generates stack realignment when it is not necessary needed The problem is that we need to know about stack alignment too before RA runs At that point we don t | know |
the custom lowered code happens to be but we shouldn t have to custom lower anything This is probably related to< 2 x i64 > ops being so bad LLVM currently generates stack realignment when it is not necessary needed The problem is that we need to know about stack alignment too before RA runs At that point we don t whether there will be vector | spill |
the custom lowered code happens to be but we shouldn t have to custom lower anything This is probably related to< 2 x i64 > ops being so bad LLVM currently generates stack realignment when it is not necessary needed The problem is that we need to know about stack alignment too before RA runs At that point we don t whether there will be vector or not Stack realignment logic is overly conservative | here |
the custom lowered code happens to be but we shouldn t have to custom lower anything This is probably related to< 2 x i64 > ops being so bad LLVM currently generates stack realignment when it is not necessary needed The problem is that we need to know about stack alignment too before RA runs At that point we don t whether there will be vector or not Stack realignment logic is overly conservative but otherwise we can produce unaligned loads stores Fixing this will require some huge RA changes | Testcase |
static const vSInt16 | a |
In x86 | mode |
In x86 we generate this spiffy xmm0 xmm0 ret in x86 we generate this which could be | better |
In x86 we generate this spiffy xmm0 xmm0 ret in x86 we generate this which could be xmm1 movss xmm1 xmm0 ret In sse4 we could use insertps to make both better Here s another testcase that could use | insertps [mem] |
In x86 we generate this spiffy xmm0 xmm0 ret in x86 we generate this which could be xmm1 movss xmm1 xmm0 ret In sse4 we could use insertps to make both better Here s another testcase that could use | x3 |
gcc mainline compiles it xmm0 xmm1 movaps xmm2 movlhps xmm2 movaps xmm0 ret We compile vector multiply by constant into poor< i32 10, i32 10, i32 10, i32 10 > ret< 4 x i32 > A On targets without | SSE4 |
gcc mainline compiles it xmm0 xmm1 movaps xmm2 movlhps xmm2 movaps xmm0 ret We compile vector multiply by constant into poor< i32 10, i32 10, i32 10, i32 10 > ret< 4 x i32 > A On targets without this compiles globl _f | _f |
gcc mainline compiles it xmm0 xmm1 movaps xmm2 movlhps xmm2 movaps xmm0 ret We compile vector multiply by constant into poor< i32 10, i32 10, i32 10, i32 10 > ret< 4 x i32 > A On targets without this compiles globl _f xmm1 movd eax imull | LCPI1_0 |
into | __pad2__ |
into eax xorps xmm0 xmm0 eax xmm0 | movl |
into eax xorps xmm0 xmm0 eax xmm0 eax xmm0 ret | gcc |
into eax xorps xmm0 xmm0 eax xmm0 eax xmm0 ret esp eax movdqa | LC0 |
into eax xorps xmm0 xmm0 eax xmm0 eax xmm0 ret esp eax movdqa xmm0 xmm0 esp const ret align it should be movdqa xmm0 | pinsrb |
into eax xorps xmm0 xmm0 eax xmm0 eax xmm0 ret esp eax movdqa xmm0 xmm0 esp const ret align it should be movdqa xmm0 | edi |
into eax xorps xmm0 xmm0 eax xmm0 eax xmm0 ret esp eax movdqa xmm0 xmm0 esp const ret align it should be movdqa xmm0 xmm0 We should transform a shuffle of two vectors of constants into a single vector of constants insertelement of a constant into a vector of constants should also result in a vector of constants e g VecISelBug ll We compiled it to something | horrible |
into eax xorps xmm0 xmm0 eax xmm0 eax xmm0 ret esp eax movdqa xmm0 xmm0 esp const ret align it should be movdqa xmm0 xmm0 We should transform a shuffle of two vectors of constants into a single vector of constants insertelement of a constant into a vector of constants should also result in a vector of constants e g VecISelBug ll We compiled it to something globl _t | _t |
into eax xorps xmm0 xmm0 eax xmm0 eax xmm0 ret esp eax movdqa xmm0 xmm0 esp const ret align it should be movdqa xmm0 xmm0 We should transform a shuffle of two vectors of constants into a single vector of constants insertelement of a constant into a vector of constants should also result in a vector of constants e g VecISelBug ll We compiled it to something globl _t xmm0 movhps xmm0 movss | LCPI1_1 |
into eax xorps xmm0 xmm0 eax xmm0 eax xmm0 ret esp eax movdqa xmm0 xmm0 esp const ret align it should be movdqa xmm0 xmm0 We should transform a shuffle of two vectors of constants into a single vector of constants insertelement of a constant into a vector of constants should also result in a vector of constants e g VecISelBug ll We compiled it to something globl _t xmm0 movhps xmm0 movss xmm1 movaps xmm2 xmm2 xmm0 movaps | rdar |
< double > | tmp19 = bitcast double %tmp18 to i64 |
< i128 > | tmp10 = lshr i128 %tmp20 |
< i128 >< i128 > | tmp11 = trunc i128 %tmp10 to i32 |
< i32 > | tmp12 = bitcast i32 %tmp11 to float |
< float > | tmp5 = getelementptr inbounds %struct.float3* %res |
< float > | i64 |
< float > | i32 |
< float * > store float float *tmp5 ret void Compiles rax | shrq |
We currently generate an sqrtsd and divsd instructions This is | bad |
We currently generate an sqrtsd and divsd instructions This is fp div is slow and not pipelined In ffast math mode we could compute scale first and emit mulsd in place of the divs This can be done as a target independent transform If we re dealing with floats instead of doubles we could even replace the sqrtss and inversion with an rsqrtss | instruction |
using t = bitcast float %x to i32 %s = and i32 %t, 2147483647 %d = bitcast i32 %s to float ret float %d } declare float @fabsf(float %n) define float @bar(float %x) nounwind { %d = call float @fabsf(float %x) ret float %d } This IR (from PR6194): target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v64:64:64-v128:128:128-a0:0:64-s0:64:64-f80:128:128-n8:16:32:64-S128" target triple = "x86_64-apple-darwin10.0.0" %0 = type { double, double } %struct.float3 = type { float, float, float } define void @test(%0, %struct.float3* nocapture %res) nounwind noinline ssp { entry: %tmp18 = extractvalue %0 %0, 0 |
Definition at line 788 of file README-SSE.txt.
This might compile to this xmm1 xorps xmm0 movss xmm0 ret Now consider if the code caused xmm1 to get spilled This might produce this xmm1 movaps xmm0 movaps xmm1 movss xmm0 ret since the reload is only used by these we could fold it into the producing something like xmm1 movaps c2 | ( | % | esp | ) |
Referenced by categorize(), samesets(), and llvm::setProfMetadata().
Generated code | ( | x86- | 32, |
linux | |||
) |
Definition at line 508 of file README-SSE.txt.
In fpstack this compiles esp eax movl esp esp ret in SSE it compiles into significantly slower esp cvtsi2sd | ( | % | esp | ) |
float f32 | ( | v4f32 | A | ) |
Definition at line 26 of file README-SSE.txt.
References A.
In fpstack this compiles esp eax movl esp esp ret in SSE it compiles into significantly slower esp xmm0 mulsd xmm0 movsd esp fldl | ( | % | esp | ) |
<float*> store float float* tmp5 ret void Compiles rax rax movl rdi ret This would be better kept in the SSE unit by treating XMM0 as a and doing a shuffle from v [1] to v [0] then a float store [UNSAFE FP] void foo | ( | double | , |
double | , | ||
double | |||
) |
Referenced by norm().
__m128 foo1 | ( | float | x1, |
float | x4 | ||
) |
Definition at line 548 of file README-SSE.txt.
References _mm_set_ps(), x2(), and x3.
into eax xorps xmm0 xmm0 eax xmm0 eax xmm0 ret esp eax movdqa xmm0 xmm0 esp const ret align it should be movdqa LC0 | ( | % | rip | ) |
into xmm2 addss xmm2 xmm1 xmm3 addss xmm3 movaps xmm0 unpcklps xmm0 ret seems silly when it could just be one addps Expand libm rounding functions main should enable SSE DAZ mode and other fast SSE modes Think about doing i64 math in SSE regs on x86 This testcase should have no SSE instructions in and only one load from a constant double ret double C the select is being which prevents the dag combiner from turning load CPI2 load | ( | select | CPI1, |
CPI2 | |||
) |
Definition at line 90 of file README-SSE.txt.
vSInt16 madd | ( | vSInt16 | b | ) |
Definition at line 503 of file README-SSE.txt.
into xmm2 addss xmm2 xmm1 xmm3 addss xmm3 movaps xmm0 unpcklps xmm0 ret seems silly when it could just be one addps Expand libm rounding functions main should enable SSE DAZ mode and other fast SSE modes Think about doing i64 math in SSE regs on x86 This testcase should have no SSE instructions in and only one load from a constant double ret double C the select is being which prevents the dag combiner from turning select | ( | load | CPI1 | ) |
Referenced by zero().
SSE Variable shift can be custom lowered to something like which uses a small table unaligned load shuffle instead of going through memory __m128i shift_right | ( | __m128i | value, |
unsigned long | offset | ||
) |
Definition at line 15 of file README-SSE.txt.
|
static |
compile to | ( | -O3 -static -fomit-frame- | pointer | ) |
Definition at line 362 of file README-SSE.txt.
References d.
compiles to | ( | x86- | 32 | ) |
void x | ( | unsigned short | n | ) |
void y | ( | unsigned | n | ) |
SSE Variable shift can be custom lowered to something like which uses a small table unaligned load shuffle instead of going through memory __m128i_shift_right |
Definition at line 11 of file README-SSE.txt.
into __pad1__ |
Definition at line 53 of file README-SSE.txt.
into __pad2__ |
Definition at line 619 of file README-SSE.txt.
gcc mainline compiles it xmm0 xmm1 movaps xmm2 movlhps xmm2 movaps xmm0 ret We compile vector multiply by constant into poor< i32 10, i32 10, i32 10, i32 10 > ret<4 x i32> A On targets without this compiles globl _f _f |
Definition at line 582 of file README-SSE.txt.
into eax xorps xmm0 xmm0 eax xmm0 eax xmm0 ret esp eax movdqa xmm0 xmm0 esp const ret align it should be movdqa xmm0 xmm0 We should transform a shuffle of two vectors of constants into a single vector of constants insertelement of a constant into a vector of constants should also result in a vector of constants e g VecISelBug ll We compiled it to something globl _t _t |
Definition at line 675 of file README-SSE.txt.
Definition at line 500 of file README-SSE.txt.
Current eax eax eax ret Ideal eax eax ret Re implement atomic builtins x86 does not have to use add to implement these it can use add |
Definition at line 25 of file README-SSE.txt.
Definition at line 394 of file README-SSE.txt.
into eax xorps xmm0 xmm0 eax xmm0 eax xmm0 ret esp eax movdqa xmm0 xmm0 esp const ret align it should be movdqa xmm0 xmm0 We should transform a shuffle of two vectors of constants into a single vector of constants Also |
Definition at line 43 of file README-SSE.txt.
This might compile to this xmm1 xorps xmm0 movss xmm0 ret Now consider if the code caused xmm1 to get spilled This might produce this xmm1 movaps xmm0 movaps xmm1 movss xmm0 ret since the reload is only used by these we could fold it into the producing something like xmm1 movaps xmm0 ret saving two instructions The basic idea is that a reload from a spill if only one byte chunk is bring in zeros the one element instead of elements This can be used to simplify a variety of shuffle where the elements are fixed zeros This code generates ugly probably due to costs being off or<4 x float> eax xmm0 pxor xmm1 movaps xmm2 xmm2 xmm0 movaps eax ret Would it be better to ecx xmm0 xor eax xmm0 xmm0 movaps ecx ret Some useful information in the Apple Altivec SSE Migration andnot |
Definition at line 318 of file README-SSE.txt.
Instead we get xmm1 addss xmm1 xmm2 movhlps xmm0 movaps xmm3 addss xmm3 movdqa xmm0 addss xmm0 ret there are cases where some simple SLP would improve codegen a bit compiling _Complex float B |
Definition at line 46 of file README-SSE.txt.
We currently generate an sqrtsd and divsd instructions This is bad |
Definition at line 820 of file README-SSE.txt.
Definition at line 537 of file README-SSE.txt.
Decimal Convert From to National Zoned Signed int_ppc_altivec_bcdcfno int_ppc_altivec_bcdcfzo int_ppc_altivec_bcdctno int_ppc_altivec_bcdctzo int_ppc_altivec_bcdcfsqo int_ppc_altivec_bcdctsqo int_ppc_altivec_bcdcpsgno int_ppc_altivec_bcdsetsgno int_ppc_altivec_bcdso int_ppc_altivec_bcduso int_ppc_altivec_bcdsro i e VA byte |
Definition at line 11 of file README-SSE.txt.
Referenced by readPrefixes().
This might compile to this xmm1 xorps xmm0 movss xmm0 ret Now consider if the code caused xmm1 to get spilled This might produce this xmm1 movaps xmm0 movaps xmm1 movss xmm0 ret since the reload is only used by these we could fold it into the producing something like xmm1 movaps xmm0 ret saving two instructions The basic idea is that a reload from a spill can |
Definition at line 269 of file README-SSE.txt.
Definition at line 240 of file README-SSE.txt.
This currently compiles esp xmm0 movsd esp eax eax esp ret We should use not the dag combiner This is because dagcombine2 needs to be able to see through the X86ISD::Wrapper which DAGCombine can t really do The code for turning x load into a single vector load is target independent and should be moved to the dag combiner The code for turning x load into a vector load can only handle a direct load from a global or a direct load from the stack It should be generalized to handle any load from where P can be anything The alignment inference code cannot handle loads from globals in static non mode because it doesn t look through the extra dyld stub load If you try vec_align ll without relocation you ll see what I mean We should lower which eliminates a constant pool load For consider |
Definition at line 421 of file README-SSE.txt.
into xmm2 addss xmm2 xmm1 xmm3 addss xmm3 movaps xmm0 unpcklps xmm0 ret seems silly when it could just be one addps Expand libm rounding functions main should enable SSE DAZ mode and other fast SSE modes Think about doing i64 math in SSE regs on x86 This testcase should have no SSE instructions in and only one load from a constant double ret double C Currently |
Definition at line 89 of file README-SSE.txt.
into xmm2 addss xmm2 xmm1 xmm3 addss xmm3 movaps xmm0 unpcklps xmm0 ret seems silly when it could just be one addps Expand libm rounding functions main should enable SSE DAZ mode and other fast SSE modes Think about doing i64 math in SSE regs on x86 This testcase should have no SSE instructions in and only one load from a constant double |
Definition at line 85 of file README-SSE.txt.
Referenced by areRuntimeChecksProfitable(), llvm::ExecutionEngine::getConstantValue(), llvm::DOTGraphTraits< DOTFuncInfo * >::getEdgeAttributes(), initBranchWeights(), LowerFPToInt(), llvm::mca::ResourceCycles::operator double(), llvm::mca::IncrementalSourceMgr::printStatistic(), llvm::msgpack::Reader::read(), llvm::APInt::roundToDouble(), llvm::MCJIT::runFunction(), and llvm::Module::setPartialSampleProfileRatio().
the custom lowered code happens to be but we shouldn t have to custom lower anything This is probably related to<2 x i64> ops being so bad LLVM currently generates stack realignment when it is not necessary needed The problem is that we need to know about stack alignment too early |
Definition at line 486 of file README-SSE.txt.
Definition at line 303 of file README-SSE.txt.
into eax xorps xmm0 xmm0 eax xmm0 eax xmm0 ret esp eax movdqa xmm0 xmm0 esp const ret align it should be movdqa xmm0 edi |
Definition at line 650 of file README-SSE.txt.
This compiles xmm1 mulss xmm1 xorps xmm0 movss xmm0 ret Because mulss doesn t modify the top elements |
Definition at line 221 of file README-SSE.txt.
Referenced by llvm::RegionBase< RegionTraits< Function > >::addSubRegion(), and llvm::RegionBase< RegionTraits< Function > >::print().
This currently compiles esp xmm0 movsd esp eax eax esp ret We should use not the dag combiner This is because dagcombine2 needs to be able to see through the X86ISD::Wrapper which DAGCombine can t really do The code for turning x load into a single vector load is target independent and should be moved to the dag combiner The code for turning x load into a vector load can only handle a direct load from a global or a direct load from the stack It should be generalized to handle any load from where P can be anything The alignment inference code cannot handle loads from globals in static non mode because it doesn t look through the extra dyld stub load If you try vec_align ll without relocation you ll see what I mean We should lower which eliminates a constant pool load For example |
Definition at line 23 of file README-SSE.txt.
This might compile to this xmm1 xorps xmm0 movss xmm0 ret Now consider if the code caused xmm1 to get spilled This might produce this xmm1 movaps xmm0 movaps xmm1 movss xmm0 ret since the reload is only used by these we could fold it into the producing something like xmm1 movaps xmm0 ret saving two instructions The basic idea is that a reload from a spill if only one byte chunk is bring in zeros the one element instead of elements This can be used to simplify a variety of shuffle where the elements are fixed zeros This code generates ugly probably due to costs being off or<4 x float> eax xmm0 pxor xmm1 movaps xmm2 xmm2 xmm0 movaps eax ret Would it be better to generate |
Definition at line 301 of file README-SSE.txt.
This might compile to this xmm1 xorps xmm0 movss xmm0 ret Now consider if the code caused xmm1 to get spilled This might produce this xmm1 movaps xmm0 movaps xmm1 movss xmm0 ret since the reload is only used by these we could fold it into the producing something like xmm1 movaps xmm0 ret saving two instructions The basic idea is that a reload from a spill if only one byte chunk is bring in zeros the one element instead of elements This can be used to simplify a variety of shuffle where the elements are fixed zeros This code generates ugly probably due to costs being off or<4 x float> eax xmm0 pxor xmm1 movaps xmm2 xmm2 xmm0 movaps eax ret Would it be better to ecx xmm0 xor eax xmm0 xmm0 movaps ecx ret Some useful information in the Apple Altivec SSE Migration Guide |
Definition at line 318 of file README-SSE.txt.
the custom lowered code happens to be but we shouldn t have to custom lower anything This is probably related to<2 x i64> ops being so bad LLVM currently generates stack realignment when it is not necessary needed The problem is that we need to know about stack alignment too before RA runs At that point we don t whether there will be vector or not Stack realignment logic is overly conservative here |
Definition at line 490 of file README-SSE.txt.
into eax xorps xmm0 xmm0 eax xmm0 eax xmm0 ret esp eax movdqa xmm0 xmm0 esp const ret align it should be movdqa xmm0 xmm0 We should transform a shuffle of two vectors of constants into a single vector of constants insertelement of a constant into a vector of constants should also result in a vector of constants e g VecISelBug ll We compiled it to something horrible |
Definition at line 672 of file README-SSE.txt.
This might compile to this xmm1 xorps xmm0 movss xmm0 ret Now consider if the code caused xmm1 to get spilled This might produce this xmm1 movaps xmm0 movaps xmm1 movss xmm0 ret However |
Definition at line 257 of file README-SSE.txt.
<float> i32 |
Definition at line 794 of file README-SSE.txt.
<float> i64 |
Definition at line 794 of file README-SSE.txt.
into xmm2 addss xmm2 xmm1 xmm3 addss xmm3 movaps xmm0 unpcklps xmm0 ret seems silly when it could just be one addps Expand libm rounding functions inline |
Definition at line 72 of file README-SSE.txt.
Definition at line 547 of file README-SSE.txt.
We currently generate an sqrtsd and divsd instructions This is fp div is slow and not pipelined In ffast math mode we could compute scale first and emit mulsd in place of the divs This can be done as a target independent transform If we re dealing with floats instead of doubles we could even replace the sqrtss and inversion with an rsqrtss instruction |
Definition at line 826 of file README-SSE.txt.
This might compile to this xmm1 xorps xmm0 movss xmm0 ret Now consider if the code caused xmm1 to get spilled This might produce this xmm1 movaps xmm0 movaps xmm1 movss xmm0 ret since the reload is only used by these instructions |
Definition at line 257 of file README-SSE.txt.
In fpstack this compiles into |
Definition at line 215 of file README-SSE.txt.
into xmm2 addss xmm2 xmm1 xmm3 addss xmm3 movaps xmm0 unpcklps xmm0 ret seems silly when it could just be one addps Expand libm rounding functions main should enable SSE DAZ mode and other fast SSE modes Think about doing i64 math in SSE regs on x86 This testcase should have no SSE instructions in it |
Definition at line 81 of file README-SSE.txt.
Referenced by llvm::detail::all_of_zip_predicate_first(), llvm::MachObjectWriter::bindIndirectSymbols(), llvm::mca::ResourceManager::checkAvailability(), checkOverlappingElement(), llvm::sys::DynamicLibrary::HandleSet::CloseLibrary(), llvm::MachObjectWriter::computeSymbolTable(), llvm::SMSchedule::cycleScheduled(), dominatesAllUsesOf(), llvm::MCFragment::dump(), llvm::MCSection::dump(), llvm::MCAssembler::dump(), llvm::SMSchedule::earliestCycleInChain(), llvm::AArch64TargetWinCOFFStreamer::emitARM64WinCFIPrologEnd(), llvm::MipsAsmPrinter::emitEndOfAsmFile(), llvm::SampleProfileLoaderBaseImpl< MachineBasicBlock >::findFunctionSamples(), llvm::opt::Arg::getAsString(), llvm::X86FrameLowering::getWin64EHFrameIndexRef(), llvm::StringSet< AllocatorTy >::insert(), llvm::SMSchedule::latestCycleInChain(), llvm::StringMap< uint64_t >::lookup(), LookupNearestOption(), llvm::yaml::MappingTraits< const InterfaceFile * >::NormalizedTBD_V4::NormalizedTBD_V4(), llvm::ARMAsmBackend::setIsThumb(), llvm::SMSchedule::stageScheduled(), and llvm::MachObjectWriter::writeObject().
the custom lowered code happens to be but we shouldn t have to custom lower anything This is probably related to<2 x i64> ops being so bad LLVM currently generates stack realignment when it is not necessary needed The problem is that we need to know about stack alignment too before RA runs At that point we don t know |
Definition at line 489 of file README-SSE.txt.
Definition at line 632 of file README-SSE.txt.
Referenced by DecodeCtrRegsRegisterClass(), llvm::HexagonTargetLowering::getRegisterByName(), and llvm::HexagonRegisterInfo::getReservedRegs().
In fpstack this compiles esp eax movl esp esp ret in SSE it compiles into significantly slower esp xmm0 mulsd LCPI1_0 |
Definition at line 584 of file README-SSE.txt.
into eax xorps xmm0 xmm0 eax xmm0 eax xmm0 ret esp eax movdqa xmm0 xmm0 esp const ret align it should be movdqa xmm0 xmm0 We should transform a shuffle of two vectors of constants into a single vector of constants insertelement of a constant into a vector of constants should also result in a vector of constants e g VecISelBug ll We compiled it to something globl _t xmm0 movhps xmm0 movss LCPI1_1 |
Definition at line 677 of file README-SSE.txt.
This might compile to this xmm1 xorps xmm0 movss xmm0 ret Now consider if the code caused xmm1 to get spilled This might produce this xmm1 movaps xmm0 movaps xmm1 movss xmm0 ret since the reload is only used by these we could fold it into the producing something like xmm1 movaps xmm0 ret saving two instructions The basic idea is that a reload from a spill if only one byte chunk is bring in zeros the one element instead of elements This can be used to simplify a variety of shuffle where the elements are fixed zeros This code generates ugly probably due to costs being off or<4 x float> eax xmm0 pxor xmm1 movaps xmm2 xmm2 xmm0 movaps eax ret Would it be better to ecx xmm0 xor eax xmm0 xmm0 movaps ecx ret Some useful information in the Apple Altivec SSE Migration or Various SSE compare translations Add hooks to commute some CMPP operations Apply the same transformation that merged four float into a single bit load to loads from constant pool Floating point max min are commutable when enable unsafe fp path is specified We should turn int_x86_sse_max_ss and X86ISD::FMIN etc into other nodes which are selected to max min instructions that are marked commutable We should materialize vector constants like all ones and signbit with code like |
Definition at line 340 of file README-SSE.txt.
into xmm2 addss xmm2 xmm1 xmm3 addss xmm3 movaps xmm0 unpcklps xmm0 ret seems silly when it could just be one addps Expand libm rounding functions main should enable SSE DAZ mode and other fast SSE modes Think about doing i64 math in SSE regs on x86 This testcase should have no SSE instructions in and only one load from a constant double ret double C the select is being lowered |
Definition at line 89 of file README-SSE.txt.
Definition at line 527 of file README-SSE.txt.
|
static |
Definition at line 414 of file README-SSE.txt.
Referenced by llvm::SparcTargetLowering::LowerGlobalTLSAddress(), and LowerToTLSExecModel().
Definition at line 624 of file README-SSE.txt.
This currently compiles esp xmm0 movsd esp eax eax esp ret We should use not the dag combiner This is because dagcombine2 needs to be able to see through the X86ISD::Wrapper node |
Definition at line 406 of file README-SSE.txt.
Referenced by llvm::RNSuccIterator< NodeRef, BlockT, RegionT >::RNSuccIterator().
SSE has instructions for doing operations on complex numbers |
Definition at line 22 of file README-SSE.txt.
This might compile to this xmm1 xorps xmm0 movss xmm0 ret Now consider if the code caused xmm1 to get spilled This might produce this xmm1 movaps xmm0 movaps xmm1 movss xmm0 ret since the reload is only used by these we could fold it into the producing something like xmm1 movaps xmm0 ret saving two instructions The basic idea is that a reload from a spill if only one byte chunk is bring in zeros the one element instead of elements This can be used to simplify a variety of shuffle operations |
Definition at line 271 of file README-SSE.txt.
This currently compiles esp xmm0 movsd esp eax eax esp ret We should use not the dag combiner This is because dagcombine2 needs to be able to see through the X86ISD::Wrapper which DAGCombine can t really do The code for turning x load into a single vector load is target independent and should be moved to the dag combiner The code for turning x load into a vector load can only handle a direct load from a global or a direct load from the stack It should be generalized to handle any load from P |
Definition at line 411 of file README-SSE.txt.
Referenced by llvm::StringTableBuilder::add(), llvm::legacy::PassManager::add(), llvm::legacy::FunctionPassManager::add(), llvm::DebugifyCustomPassManager::add(), llvm::legacy::FunctionPassManagerImpl::add(), llvm::PMDataManager::add(), llvm::legacy::PassManagerImpl::add(), addAllTypesFromDWP(), llvm::AnalysisResolver::addAnalysisImplsPair(), AddCalls(), llvm::vfs::InMemoryFileSystem::addFile(), llvm::vfs::InMemoryFileSystem::addFileNoOwn(), llvm::GlobalsAAResult::FunctionInfo::addFunctionInfo(), llvm::PMTopLevelManager::addImmutablePass(), llvm::RegBankSelect::RepairingPlacement::addInsertPoint(), llvm::PressureDiffs::addInstruction(), llvm::AMDGPUPassConfig::addIRPasses(), llvm::RegPressureTracker::addLiveRegs(), llvm::PMDataManager::addLowerLevelRequiredPass(), llvm::GlobalsAAResult::FunctionInfo::addModRefInfoForGlobal(), llvm::MachineFunctionPassManager::addPass(), llvm::TargetPassConfig::addPass(), llvm::PassManager< LazyCallGraph::SCC, CGSCCAnalysisManager, LazyCallGraph &, CGSCCUpdateResult & >::addPass(), llvm::ScheduleDAGInstrs::addPhysRegDeps(), llvm::orc::ObjectLinkingLayer::addPlugin(), llvm::SIScheduleBlock::addPred(), llvm::SUnit::addPred(), llvm::ScalarEvolution::ExitLimit::addPredicate(), llvm::MachObjectWriter::addRelocation(), llvm::SIScheduleBlock::addSucc(), llvm::RegisterOperands::adjustLaneLiveness(), llvm::RegPressureTracker::advance(), llvm::all_of(), llvm::detail::all_of_zip_predicate_first(), llvm::ms_demangle::ArenaAllocator::alloc(), llvm::ms_demangle::ArenaAllocator::allocArray(), llvm::ms_demangle::ArenaAllocator::allocUnalignedBuffer(), allPredecessorsComeFromSameSource(), allUsesOfLoadAndStores(), allUsesOfLoadedValueWillTrapIfNull(), llvm::AMDGPUExternalAAWrapper::AMDGPUExternalAAWrapper(), llvm::any_of(), llvm::LiveRegSet::appendTo(), llvm::LoopPass::assignPassManager(), llvm::CallGraphSCCPass::assignPassManager(), llvm::DwarfCFIException::beginBasicBlockSection(), llvm::rdf::DataFlowGraph::build(), buildUseMask(), llvm::RegPressureTracker::bumpDeadDefs(), llvm::RegPressureTracker::bumpUpwardPressure(), CalcNodeSethiUllmanNumber(), llvm::ModuleSummaryIndex::calculateCallGraphRoot(), llvm::AAResults::callCapturesBefore(), llvm::CallGraph::CallGraph(), llvm::AAResults::canBasicBlockModify(), canFoldTermCondOfLoop(), canLoopBeDeleted(), llvm::canonicalizePath(), canVectorizeLoads(), llvm::GraphTraits< CallGraphDOTInfo * >::CGGetValuePtr(), llvm::DOTGraphTraits< CallGraphDOTInfo * >::CGGetValuePtr(), llvm::GraphTraits< CallGraph * >::CGGetValuePtr(), llvm::GraphTraits< const CallGraph * >::CGGetValuePtr(), llvm::GraphTraits< CallGraphNode * >::CGNGetValue(), llvm::GraphTraits< const CallGraphNode * >::CGNGetValue(), llvm::AArch64GISelUtils::changeFCMPPredToAArch64CC(), changeICMPPredToAArch64CC(), llvm::AArch64GISelUtils::changeVectorFCMPPredToAArch64CC(), charTailAt(), llvm::RuntimeDyldCheckerImpl::check(), llvm::MCAsmParserExtension::check(), llvm::MCAsmParser::check(), checkDyldCommand(), checkDylibCommand(), checkRpathCommand(), checkSubCommand(), checkVectorTypeForPromotion(), llvm::remarks::BitstreamRemarkParser::classof(), llvm::remarks::YAMLRemarkParser::classof(), llvm::remarks::YAMLStrTabRemarkParser::classof(), llvm::SCEVComparePredicate::classof(), llvm::SCEVWrapPredicate::classof(), llvm::SCEVUnionPredicate::classof(), llvm::DenseMapBase< DenseMap< llvm::VPInstruction *, llvm::InterleaveGroup< llvm::VPInstruction > *, DenseMapInfo< llvm::VPInstruction * >, llvm::detail::DenseMapPair< llvm::VPInstruction *, llvm::InterleaveGroup< llvm::VPInstruction > * > >, llvm::VPInstruction *, llvm::InterleaveGroup< llvm::VPInstruction > *, DenseMapInfo< llvm::VPInstruction * >, llvm::detail::DenseMapPair< llvm::VPInstruction *, llvm::InterleaveGroup< llvm::VPInstruction > * > >::clear(), llvm::rdf::DataFlowGraph::DefStack::clear_block(), clusterSortPtrAccesses(), collectBitParts(), collectEHScopeMembers(), llvm::PMTopLevelManager::collectLastUses(), collectLeaves(), llvm::collectPGOFuncNameStrings(), collectReleaseInsertPts(), llvm::PMDataManager::collectRequiredAndUsedAnalyses(), combineSIntToFP(), combineUIntToFP(), llvm::HvxSelector::completeToPerfect(), llvm::compression::compress(), llvm::AccelTableBase::computeBucketCount(), llvm::EHStreamer::computeCallSiteTable(), llvm::ScheduleDAGMILive::computeCyclicCriticalPath(), llvm::HexagonBlockRanges::computeDeadMap(), computeKnownBitsFromOperator(), ComputeLiveInBlocks(), llvm::rdf::Liveness::computeLiveIns(), llvm::EHStreamer::computePadMap(), llvm::rdf::Liveness::computePhiInfo(), llvm::object::computeSymbolSizes(), computeUnlikelySuccessors(), llvm::JumpThreadingPass::computeValueKnownInPredecessorsImpl(), computeVTableFuncs(), llvm::DwarfUnit::constructContainingTypeDIEs(), llvm::ScalarEvolution::convertSCEVToAddRecWithPredicates(), convertToSinitPriority(), llvm::copy_if(), llvm::orc::OrcV2CAPIHelper::copyToSymbolStringPtr(), llvm::count_if(), llvm::symbolize::SymbolizableObjectFile::create(), llvm::sampleprof::SampleProfileReader::create(), llvm::vfs::RedirectingFileSystem::create(), llvm::sys::fs::create_directories(), llvm::IRBuilderBase::CreateConstrainedFPCmp(), llvm::NoFolder::CreateFCmp(), llvm::InstSimplifyFolder::CreateFCmp(), llvm::ConstantFolder::CreateFCmp(), llvm::TargetFolder::CreateFCmp(), llvm::IRBuilderBase::CreateFCmp(), llvm::IRBuilderBase::CreateFCmpS(), llvm::IRBuilderBase::CreateICmp(), llvm::createInterleavedLoadCombinePass(), llvm::createLegacyPMAAResults(), llvm::createLegacyPMBasicAAResult(), llvm::createMIRProfileLoaderPass(), createNaturalLoopInternal(), llvm::createRepeatedPass(), llvm::OpenMPIRBuilder::createTask(), llvm::sys::fs::createTemporaryFile(), CriticalPathStep(), llvm::GraphTraits< DDGNode * >::DDGGetTargetNode(), llvm::GraphTraits< const DDGNode * >::DDGGetTargetNode(), DecodeAddrMode2IdxInstruction(), DecodeAddrMode3Instruction(), DecodeT2LDRDPreInstruction(), DecodeT2STRDPreInstruction(), decodeUImmOperand(), llvm::deleteDeadLoop(), deleteLoopIfDead(), llvm::DemotePHIToStack(), llvm::DependenceInfo::depends(), llvm::orc::deregisterFrameWrapper(), llvm::orc::shared::SPSSerializationTraits< SPSTuple< SPSTagT1, SPSTagT2 >, std::pair< T1, T2 > >::deserialize(), llvm::DenseMapBase< DenseMap< llvm::VPInstruction *, llvm::InterleaveGroup< llvm::VPInstruction > *, DenseMapInfo< llvm::VPInstruction * >, llvm::detail::DenseMapPair< llvm::VPInstruction *, llvm::InterleaveGroup< llvm::VPInstruction > * > >, llvm::VPInstruction *, llvm::InterleaveGroup< llvm::VPInstruction > *, DenseMapInfo< llvm::VPInstruction * >, llvm::detail::DenseMapPair< llvm::VPInstruction *, llvm::InterleaveGroup< llvm::VPInstruction > * > >::destroyAll(), llvm::GCNIterativeScheduler::detachSchedule(), llvm::RegPressureTracker::discoverLiveIn(), llvm::RegPressureTracker::discoverLiveInOrOut(), llvm::RegPressureTracker::discoverLiveOut(), llvm::codeview::discoverTypeIndices(), llvm::codeview::discoverTypeIndicesInSymbol(), llvm::MIRProfileLoader::doInitialization(), llvm::RegisterPressure::dump(), llvm::MCAsmMacro::dump(), LiveDebugValues::InstrRefBasedLDV::dump_mloc_transfer(), llvm::PMTopLevelManager::dumpArguments(), llvm::PMDataManager::dumpLastUses(), llvm::PMDataManager::dumpPassArguments(), llvm::PMDataManager::dumpPassInfo(), llvm::LPPassManager::dumpPassStructure(), llvm::RGPassManager::dumpPassStructure(), llvm::PMDataManager::dumpPreservedSet(), llvm::PMDataManager::dumpRequiredSet(), llvm::PMDataManager::dumpUsedSet(), llvm::ehAwareSplitEdge(), llvm::MCELFStreamer::emitCommonSymbol(), llvm::SITargetLowering::emitExpandAtomicRMW(), llvm::PMDataManager::emitInstrCountChangedRemark(), llvm::InnerLoopVectorizer::emitIterationCountCheck(), llvm::EpilogueVectorizerMainLoop::emitIterationCountCheck(), llvm::EpilogueVectorizerEpilogueLoop::emitMinimumVectorEpilogueIterCountCheck(), emitRangeList(), llvm::json::Array::emplace(), llvm::DwarfDebug::endModule(), llvm::IntervalMap< KeyT, ValT, N, Traits >::iterator::erase(), llvm::PriorityWorklist< llvm::LazyCallGraph::SCC *, SmallVector< llvm::LazyCallGraph::SCC *, N >, SmallDenseMap< llvm::LazyCallGraph::SCC *, ptrdiff_t > >::erase_if(), llvm::erase_if(), llvm::SmallPtrSetImplBase::erase_imp(), llvm::GlobalsAAResult::FunctionInfo::eraseModRefInfoForGlobal(), llvm::ScalarEvolution::ExitLimit::ExitLimit(), llvm::xray::Profile::expandPath(), expandPseudoVFMK(), extractIntPart(), llvm::X86::fillValidCPUArchList(), llvm::X86::fillValidTuneCPUList(), llvm::pdb::GSIHashStreamBuilder::finalizeBuckets(), llvm::PPCInstrInfo::finalizeInsInstrs(), StringView::find(), llvm::find_if(), llvm::find_if_not(), llvm::find_singleton(), llvm::find_singleton_nested(), llvm::PMTopLevelManager::findAnalysisPass(), llvm::PMTopLevelManager::findAnalysisUsage(), findArgParts(), llvm::AnalysisResolver::findImplPass(), findIrreducibleHeaders(), llvm::MachineLoopInfo::findLoopPreheader(), llvm::InnerLoopVectorizer::fixNonInductionPHIs(), llvm::PeelingModuloScheduleExpander::fixupBranches(), foldFabsWithFcmpZero(), llvm::InstCombinerImpl::foldFCmpIntToFPConst(), llvm::InstSimplifyFolder::FoldICmp(), llvm::ConstantFolder::FoldICmp(), llvm::TargetFolder::FoldICmp(), llvm::InstCombinerImpl::foldICmpOrConstant(), llvm::PMDataManager::freePass(), llvm::RegBankSelect::InstrInsertPoint::frequency(), llvm::RegBankSelect::MBBInsertPoint::frequency(), llvm::RegBankSelect::EdgeInsertPoint::frequency(), llvm::json::fromJSON(), llvm::JITEvaluatedSymbol::fromPointer(), llvm::SignedDivisionByConstantInfo::get(), llvm::UnsignedDivisionByConstantInfo::get(), llvm::PointerSumType< ExtraInfoInlineKinds, llvm::PointerSumTypeMember< EIIK_MMO, llvm::MachineMemOperand * >, llvm::PointerSumTypeMember< EIIK_PreInstrSymbol, llvm::MCSymbol * >, llvm::PointerSumTypeMember< EIIK_PostInstrSymbol, llvm::MCSymbol * >, llvm::PointerSumTypeMember< EIIK_OutOfLine, ExtraInfo * > >::get(), getAllocationDataForFunction(), llvm::PredicatedScalarEvolution::getAsAddRec(), llvm::pointer_union_detail::PointerUnionUIntTraits< PTs >::getAsVoidPointer(), llvm::PointerLikeTypeTraits< T * >::getAsVoidPointer(), llvm::PointerLikeTypeTraits< void * >::getAsVoidPointer(), llvm::PointerLikeTypeTraits< const T >::getAsVoidPointer(), llvm::PointerLikeTypeTraits< PointerEmbeddedInt< IntT, Bits > >::getAsVoidPointer(), llvm::PointerLikeTypeTraits< const T * >::getAsVoidPointer(), llvm::PointerLikeTypeTraits< uintptr_t >::getAsVoidPointer(), llvm::FunctionPointerLikeTypeTraits< 4, ReturnT(*)(ParamTs...)>::getAsVoidPointer(), llvm::PointerLikeTypeTraits< PointerIntPair< PointerTy, IntBits, IntType, PtrTraits > >::getAsVoidPointer(), llvm::PointerLikeTypeTraits< PointerUnion< PTs... > >::getAsVoidPointer(), llvm::PredicatedScalarEvolution::getBackedgeTakenCount(), getBBClusterInfoForFunction(), llvm::getBestSimplifyQuery(), llvm::BPIPassTrait< PassT >::getBPI(), llvm::BPIPassTrait< LazyBranchProbabilityInfoPass >::getBPI(), getCombinerObjective(), getConstantEvolvingPHIOperands(), llvm::StackMapParser< Endianness >::LocationAccessor::getConstantIndex(), llvm::RegsForValue::getCopyFromRegs(), llvm::object::MachOObjectFile::getDice(), llvm::RegPressureTracker::getDownwardPressure(), llvm::StackMapParser< Endianness >::LocationAccessor::getDwarfRegNum(), llvm::StackMapParser< Endianness >::LiveOutAccessor::getDwarfRegNum(), llvm::VPBasicBlock::getEnclosingLoopRegion(), llvm::GraphTraits< ModuleSummaryIndex * >::getEntryNode(), llvm::objcarc::getEquivalentPHIs(), llvm::X86::getFeaturesForCPU(), getFreeFunctionDataForFunction(), llvm::PointerIntPair< llvm::IntrusiveBackListNode *, 1 >::getFromOpaqueValue(), llvm::pointer_union_detail::PointerUnionUIntTraits< PTs >::getFromVoidPointer(), llvm::PointerLikeTypeTraits< T * >::getFromVoidPointer(), llvm::PointerLikeTypeTraits< ReachingDef >::getFromVoidPointer(), llvm::PointerLikeTypeTraits< void * >::getFromVoidPointer(), llvm::PointerLikeTypeTraits< const T >::getFromVoidPointer(), llvm::PointerLikeTypeTraits< PointerEmbeddedInt< IntT, Bits > >::getFromVoidPointer(), llvm::PointerLikeTypeTraits< const T * >::getFromVoidPointer(), llvm::PointerLikeTypeTraits< uintptr_t >::getFromVoidPointer(), llvm::FunctionPointerLikeTypeTraits< 4, ReturnT(*)(ParamTs...)>::getFromVoidPointer(), llvm::PointerLikeTypeTraits< PointerIntPair< PointerTy, IntBits, IntType, PtrTraits > >::getFromVoidPointer(), llvm::PointerLikeTypeTraits< PointerUnion< PTs... > >::getFromVoidPointer(), llvm::getFSPassBitBegin(), llvm::getFSPassBitEnd(), llvm::StackMapParser< Endianness >::FunctionAccessor::getFunctionAddress(), llvm::AMDGPUMangledLibFunc::getFunctionType(), llvm::StackMapParser< Endianness >::RecordAccessor::getID(), GetInductionVariable(), llvm::StackMapParser< Endianness >::RecordAccessor::getInstructionOffset(), getIntrinsicParamType(), llvm::X86::getKeyFeature(), llvm::StackMapParser< Endianness >::LocationAccessor::getKind(), llvm::getLegalVectorType(), llvm::object::MachOObjectFile::getLibraryShortNameByIndex(), llvm::StackMapParser< Endianness >::RecordAccessor::getLiveOut(), getLocalId(), llvm::StackMapParser< Endianness >::RecordAccessor::getLocation(), llvm::ScalarEvolution::getLoopInvariantPredicate(), llvm::RegPressureTracker::getMaxDownwardPressureDelta(), llvm::DataLayout::getMaxIndexSize(), llvm::RegPressureTracker::getMaxUpwardPressureDelta(), llvm::AAResults::getModRefInfo(), llvm::GlobalsAAResult::FunctionInfo::getModRefInfoForGlobal(), llvm::AAResults::getModRefInfoMask(), llvm::rdf::RefNode::getNextRef(), llvm::StackMapParser< Endianness >::RecordAccessor::getNumLiveOuts(), llvm::StackMapParser< Endianness >::RecordAccessor::getNumLocations(), llvm::PluginLoader::getNumPlugins(), llvm::StackMapParser< Endianness >::LocationAccessor::getOffset(), llvm::getPassTimer(), llvm::internal::NfaTranscriber::getPaths(), llvm::HvxSelector::getPerfectCompletions(), getPermuteNode(), llvm::orc::ExecutionSession::getPlatform(), llvm::PluginLoader::getPlugin(), llvm::ExecutionEngine::getPointerToGlobal(), llvm::LazyValueInfo::getPredicateAt(), llvm::SITargetLowering::getPrefLoopAlignment(), getPropertyName(), llvm::vfs::RedirectingFileSystem::getRealPath(), llvm::rdf::Liveness::getRealUses(), llvm::StackMapParser< Endianness >::FunctionAccessor::getRecordCount(), getReductionValue(), llvm::LiveIntervals::getRegMaskBitsInBlock(), llvm::LiveIntervals::getRegMaskSlotsInBlock(), llvm::object::MachOObjectFile::getRelocation(), llvm::StackMapParser< Endianness >::LocationAccessor::getSizeInBytes(), llvm::StackMapParser< Endianness >::LiveOutAccessor::getSizeInBytes(), llvm::StackMapParser< Endianness >::LocationAccessor::getSmallConstant(), llvm::DependenceInfo::getSplitIteration(), llvm::StackMapParser< Endianness >::FunctionAccessor::getStackSize(), getStruct(), getStructOrErr(), llvm::MachineBasicBlock::getSuccProbability(), llvm::object::MachOObjectFile::getSymbol64TableEntry(), llvm::object::MachOObjectFile::getSymbolTableEntry(), getSymbolTableEntryBase(), getSymTab(), llvm::getUnderlyingObjects(), llvm::WasmEHFuncInfo::getUnwindSrcs(), llvm::RegPressureTracker::getUpwardPressure(), llvm::RegPressureTracker::getUpwardPressureDelta(), getV_CMPOpcode(), llvm::StackMapParser< Endianness >::ConstantAccessor::getValue(), llvm::PBQP::ValuePool< MatrixT >::getValue(), llvm::FunctionLoweringInfo::getValueFromVirtualReg(), llvm::SystemZTTIImpl::getVectorTruncCost(), llvm::getVRegSubRegDef(), llvm::vfs::File::getWithPath(), llvm::SmallDenseMap< const llvm::RecurrenceDescriptor *, unsigned, N >::grow(), llvm::handleErrors(), llvm::rdf::RegisterAggr::hasAliasOf(), llvm::RISCVInstrInfo::hasAllNBitUsers(), llvm::rdf::RegisterAggr::hasCoverOf(), llvm::MachineFunctionProperties::hasProperty(), llvm::RISCVRegisterInfo::hasReservedSpillSlot(), llvm::HexagonMCELFStreamer::HexagonMCEmitCommonSymbol(), llvm::rdf::NodeAllocator::id(), llvm::rdf::DataFlowGraph::id(), llvm::BumpPtrAllocatorImpl< MallocAllocator, 65536 >::identifyObject(), llvm::PassNameParser::ignorablePass(), llvm::FunctionVarLocs::init(), INITIALIZE_PASS(), llvm::PMDataManager::initializeAnalysisImpl(), llvm::yaml::CustomMappingTraits< std::map< std::vector< uint64_t >, WholeProgramDevirtResolution::ByArg > >::inputOne(), llvm::codeview::DebugStringTableSubsection::insert(), llvm::rdf::RegisterAggr::insert(), llvm::json::Array::insert(), llvm::IntervalMap< KeyT, ValT, N, Traits >::iterator::insert(), llvm::InsertPreheaderForLoop(), insertSpills(), insertUniqueBackedgeBlock(), instCombineSVEVectorFuseMulAddSub(), llvm::HexagonShuffler::insts(), llvm::xray::Profile::internPath(), isAlternateInstruction(), isAtLineEnd(), llvm::ScalarEvolution::isBasicBlockEntryGuardedByCond(), llvm::isBitcodeWriterPass(), isBlockInLCSSAForm(), llvm::isEqual(), llvm::CmpInst::isEquality(), llvm::ICmpInst::isEquality(), llvm::RecurrenceDescriptor::isFixedOrderRecurrence(), llvm::CmpInst::isFPPredicate(), llvm::ICmpInst::isGE(), llvm::ICmpInst::isGT(), isIntegerWideningViable(), llvm::CmpInst::isIntPredicate(), llvm::isIRPrintingPass(), llvm::LiveRangeCalc::isJointlyDominated(), llvm::ICmpInst::isLE(), isLoopDead(), llvm::ICmpInst::isLT(), isObjectSizeLessThanOrEq(), llvm::isOfRegClass(), isRegUsedByPhiNodes(), llvm::CmpInst::isRelational(), llvm::ICmpInst::isRelational(), IsStoredObjCPointer(), isVectorPromotionViable(), isVectorPromotionViableForSlice(), LinearizeExprTree(), LLVMCreateFunctionPassManager(), LLVMCreateGenericValueOfPointer(), LLVMGetAlignment(), LLVMGetCmpXchgFailureOrdering(), LLVMGetCmpXchgSuccessOrdering(), LLVMGetMaskValue(), LLVMGetNumMaskElements(), LLVMGetOrdering(), LLVMGetVolatile(), LLVMIsAtomicSingleThread(), LLVMSetAlignment(), LLVMSetAtomicSingleThread(), LLVMSetCmpXchgFailureOrdering(), LLVMSetCmpXchgSuccessOrdering(), LLVMSetOrdering(), LLVMSetVolatile(), llvm::PassPlugin::Load(), llvm::pdb::HashTable< llvm::support::detail::packed_endian_specific_integral >::load(), llvm::xray::loadProfile(), llvm::HexagonTargetLowering::LowerBUILD_VECTOR(), llvm::HexagonTargetLowering::LowerCONCAT_VECTORS(), llvm::NVPTXTargetLowering::LowerFormalArguments(), llvm::HexagonTargetLowering::LowerUnalignedLoad(), llvm::HexagonTargetLowering::LowerVECTOR_SHUFFLE(), llvm::MIPatternMatch::m_c_GFCmp(), llvm::MIPatternMatch::m_c_GICmp(), llvm::MIPatternMatch::m_GFCmp(), llvm::MIPatternMatch::m_GICmp(), llvm::MIPatternMatch::m_Pred(), llvm::PatternMatch::m_SpecificInt_ICMP(), makeImportedSymbolIterator(), makeReducible(), llvm::rdf::RegisterAggr::makeRegRef(), mapFCmpPred(), llvm::yaml::MappingTraits< ArchYAML::Archive::Child >::mapping(), llvm::yaml::MappingTraits< DXContainerYAML::Part >::mapping(), llvm::yaml::MappingTraits< InstrProfCorrelator::Probe >::mapping(), mapToSinitPriority(), llvm::rdf::DataFlowGraph::markBlock(), llvm::PatternMatch::match(), llvm::MIPatternMatch::And< Pred, Preds... >::match(), llvm::MIPatternMatch::Or< Pred, Preds... >::match(), llvm::MCInstPrinter::matchAliasPatterns(), matchDoublePermute(), matchIsNotNaN(), matchPermute(), llvm::matchSimpleRecurrence(), matchUnorderedInfCompare(), llvm::JumpThreadingPass::maybethreadThroughTwoBasicBlocks(), llvm::rdf::CodeNode::members_if(), mergeConditionalStores(), llvm::xray::mergeProfilesByStack(), llvm::xray::mergeProfilesByThread(), llvm::MIPatternMatch::mi_match(), llvm::MIRAddFSDiscriminators::MIRAddFSDiscriminators(), llvm::MIRProfileLoaderPass::MIRProfileLoaderPass(), moveLCSSAPhis(), llvm::object::DiceRef::moveNext(), llvm::object::MachOBindEntry::moveNext(), llvm::PeelingModuloScheduleExpander::moveStageBetweenBlocks(), llvm::orc::OrcV2CAPIHelper::moveToSymbolStringPtr(), multipleIterations(), llvm::ShuffleBlockStrategy::mutate(), needToReserveScavengingSpillSlots(), node_eq(), llvm::none_of(), llvm::orc::ObjectLinkingLayerJITLinkContext::notifyMaterializing(), llvm::orc::shared::numDeallocActions(), llvm::json::Object::Object(), llvm::TargetInstrInfo::RegSubRegPair::operator!=(), llvm::orc::ExecutorAddr::Tag::operator()(), llvm::orc::ExecutorAddr::Untag::operator()(), llvm::pair_hash< First, Second >::operator()(), llvm::object::symbol_iterator::operator*(), AllocaSlices::partition_iterator::operator*(), llvm::bfi_detail::BlockMass::operator*=(), llvm::MachineRegisterInfo::defusechain_iterator< ReturnUses, ReturnDefs, SkipDebug, ByOperand, ByInstr, ByBundle >::operator++(), llvm::MachineRegisterInfo::defusechain_instr_iterator< ReturnUses, ReturnDefs, SkipDebug, ByOperand, ByInstr, ByBundle >::operator++(), llvm::object::symbol_iterator::operator->(), llvm::rdf::operator<<(), llvm::DiagnosticPrinterRawOStream::operator<<(), llvm::operator<<(), llvm::raw_ostream::operator<<(), llvm::PluginLoader::operator=(), llvm::xray::Profile::operator=(), llvm::TargetInstrInfo::RegSubRegPair::operator==(), AllocaSlices::partition_iterator::operator==(), llvm::xray::Graph< VertexAttribute, EdgeAttribute, VI >::operator[](), or32le(), llvm::or32le(), llvm::SMSchedule::orderDependence(), llvm::yaml::CustomMappingTraits< std::map< std::vector< uint64_t >, WholeProgramDevirtResolution::ByArg > >::output(), llvm::yaml::CustomMappingTraits< std::map< uint64_t, WholeProgramDevirtResolution > >::output(), llvm::yaml::CustomMappingTraits< GlobalValueSummaryMapTy >::output(), ParameterPack::ParameterPack(), llvm::json::parse(), llvm::X86::parseArchX86(), llvm::parseCachePruningPolicy(), llvm::AMDGPUMangledLibFunc::parseFuncName(), parseNamePrefix(), parsePredicateConstraint(), parseSegmentOrSectionName(), AbstractManglingParser< ManglingParser< Alloc >, Alloc >::parseTemplateParamDecl(), AbstractManglingParser< ManglingParser< Alloc >, Alloc >::parseType(), AbstractManglingParser< ManglingParser< Alloc >, Alloc >::parseUnnamedTypeName(), llvm::partition(), llvm::partition_point(), llvm::PassNameParser::passEnumerate(), llvm::PassNameParser::passRegistered(), llvm::ProfOStream::patch(), llvm::HexagonTargetLowering::PerformDAGCombine(), llvm::rdf::PhysicalRegisterInfo::PhysicalRegisterInfo(), llvm::ScheduleDAGMI::placeDebugValues(), llvm::AAResults::pointsToConstantMemory(), llvm::rdf::DataFlowGraph::DefStack::pop(), llvm::PMDataManager::preserveHigherLevelAnalysis(), llvm::ConvergingVLIWScheduler::pressureChange(), llvm::cl::OptionDiffPrinter< ParserDT, ValDT >::print(), llvm::cl::OptionDiffPrinter< DT, DT >::print(), llvm::BitTracker::print_cells(), printAsmMRegister(), Node::printAsOperand(), llvm::SIScheduleBlock::printDebug(), PrintLoadStoreResults(), PrintLoopInfo(), PrintModRefResults(), llvm::cl::printOptionDiff(), llvm::PassManager< Loop, LoopAnalysisManager, LoopStandardAnalysisResults &, LPMUpdater & >::printPipeline(), llvm::PassManager< LazyCallGraph::SCC, CGSCCAnalysisManager, LazyCallGraph &, CGSCCUpdateResult & >::printPipeline(), PrintResults(), processPHI(), processRemarkVersion(), processStrTab(), llvm::FoldingSetTrait< std::pair< T1, T2 > >::Profile(), llvm::xray::profileFromTrace(), profitImm(), llvm::ModuleSummaryIndex::propagateAttributes(), llvm::PTOGV(), llvm::support::endian::read(), llvm::support::endian::read16(), llvm::support::endian::read16be(), llvm::support::endian::read16le(), llvm::support::endian::read32(), llvm::support::endian::read32be(), llvm::support::endian::read32le(), llvm::support::endian::read64(), llvm::support::endian::read64be(), llvm::support::endian::read64le(), llvm::readPGOFuncNameStrings(), llvm::WebAssemblyExceptionInfo::recalculate(), llvm::RegPressureTracker::recedeSkipDebugValues(), llvm::PMDataManager::recordAvailableAnalysis(), TransferTracker::redefVar(), llvm::PrintIRInstrumentation::registerCallbacks(), llvm::PseudoProbeVerifier::registerCallbacks(), llvm::OptNoneInstrumentation::registerCallbacks(), llvm::TimePassesHandler::registerCallbacks(), llvm::PreservedCFGCheckerInstrumentation::registerCallbacks(), llvm::DebugifyEachInstrumentation::registerCallbacks(), llvm::VerifyInstrumentation::registerCallbacks(), llvm::TimeProfilingPassesHandler::registerCallbacks(), llvm::registerCodeGenCallback(), llvm::RuntimeDyldMachOCRTPBase< RuntimeDyldMachOX86_64 >::registerEHFrames(), llvm::orc::registerFrameWrapper(), registerPartialPipelineCallback(), llvm::ChangeReporter< IRDataT< EmptyData > >::registerRequiredCallbacks(), llvm::rdf::DataFlowGraph::releaseBlock(), llvm::orc::OrcV2CAPIHelper::releasePoolEntry(), llvm::detail::IEEEFloat::remainder(), llvm::MCContext::RemapDebugPaths(), rematerializeLiveValuesAtUses(), llvm::SetVector< llvm::MCSection *, SmallVector< llvm::MCSection *, N >, SmallDenseSet< llvm::MCSection *, N > >::remove_if(), llvm::NodeSet::remove_if(), llvm::remove_if(), llvm::PMDataManager::removeDeadPasses(), llvm::PMDataManager::removeNotPreservedAnalysis(), llvm::ScalarEvolution::removePointerBase(), llvm::SUnit::removePred(), llvm::replace_copy_if(), replaceAllPrepares(), replaceConstantExprOp(), llvm::HexagonTargetLowering::ReplaceNodeResults(), llvm::json::Path::report(), llvm::reportMismatch(), llvm::MachineFunctionProperties::reset(), llvm::orc::OrcV2CAPIHelper::retainPoolEntry(), rewriteNonInstructionUses(), llvm::DevirtSCCRepeatedPass::run(), llvm::orc::LocalCXXRuntimeOverridesBase::runDestructors(), llvm::LPPassManager::runOnFunction(), llvm::RGPassManager::runOnFunction(), llvm::MachineFunction::salvageCopySSAImpl(), llvm::StringSaver::save(), llvm::RegScavenger::scavengeRegisterBackwards(), llvm::SCEVUnionPredicate::SCEVUnionPredicate(), llvm::PMTopLevelManager::schedulePass(), separateNestedLoop(), llvm::orc::shared::SPSSerializationTraits< SPSTuple< SPSTagT1, SPSTagT2 >, std::pair< T1, T2 > >::serialize(), llvm::FunctionLoweringInfo::set(), llvm::MachineFunctionProperties::set(), llvm::MipsABIFlagsSection::setAllFromPredicates(), llvm::MipsABIFlagsSection::setASESetFromPredicates(), llvm::MipsABIFlagsSection::setCPR1SizeFromPredicates(), llvm::vfs::InMemoryFileSystem::setCurrentWorkingDirectory(), llvm::sampleprof::SampleProfileReader::setDiscriminatorMaskedBitFrom(), llvm::MCAssembler::setDWARFLinetableParams(), llvm::MipsABIFlagsSection::setFpAbiFromPredicates(), llvm::MIRProfileLoader::setFSPass(), llvm::MipsABIFlagsSection::setGPRSizeFromPredicates(), llvm::MipsABIFlagsSection::setISAExtensionFromPredicates(), llvm::MipsABIFlagsSection::setISALevelAndRevisionFromPredicates(), llvm::PMTopLevelManager::setLastUser(), llvm::VPBlockBase::setParent(), llvm::orc::ExecutionSession::setPlatform(), llvm::CmpInst::setPredicate(), llvm::ScopedPrinter::setPrefix(), llvm::LineEditor::setPrompt(), llvm::msf::MSFBuilder::setStreamSize(), llvm::MCAsmParser::setTargetParser(), llvm::MIRParserImpl::setupRegisterInfo(), llvm::TrackingVH< Value >::setValPtr(), llvm::CallbackVH::setValPtr(), shouldSplitOnPredicatedArgument(), simplifyCommonValuePhi(), simplifyGEPInst(), simplifyICmpWithMinMax(), simplifyOneLoop(), llvm::JumpThreadingPass::simplifyPartiallyRedundantLoad(), llvm::orc::shared::SPSSerializationTraits< SPSTuple< SPSTagT1, SPSTagT2 >, std::pair< T1, T2 > >::size(), skipIfAtLineEnd(), llvm::MachineBasicBlock::SplitCriticalEdge(), llvm::SplitKnownCriticalEdge(), llvm::stable_hash_combine_array(), llvm::BitTracker::subst(), swapAntiDependences(), llvm::SwingSchedulerDAG::SwingSchedulerDAG(), targets(), test(), llvm::OpenMPIRBuilder::tileLoops(), llvm::TimerGroup::TimerGroup(), llvm::to_address(), llvm::SymbolTableListTraits< ValueSubClass >::toPtr(), llvm::orc::LocalCXXRuntimeOverridesBase::toTargetAddress(), llvm::ConvergingVLIWScheduler::traceCandidate(), llvm::GenericSchedulerBase::traceCandidate(), llvm::TrackingVH< Value >::TrackingVH(), tryAdjustICmpImmAndPred(), llvm::unique(), unwrap(), llvm::unwrap(), llvm::MipsTargetStreamer::updateABIInfo(), llvm::VFShape::updateParam(), llvm::ScheduleDAGMILive::updatePressureDiffs(), llvm::updateVCallVisibilityInIndex(), llvm::yaml::MappingTraits< ArchYAML::Archive::Child >::validate(), valueDominatesPHI(), llvm::GraphTraits< ValueInfo >::valueInfoFromEdge(), llvm::PMDataManager::verifyPreservedAnalysis(), llvm::InstCombinerImpl::visitFNeg(), llvm::InstCombinerImpl::visitIntToPtr(), llvm::InstCombinerImpl::visitPtrToInt(), wrap(), llvm::wrap(), write(), llvm::StringTableBuilder::write(), llvm::support::endian::write(), llvm::support::endian::write16(), llvm::support::endian::write16be(), llvm::support::endian::write16le(), llvm::support::endian::write32(), llvm::support::endian::write32be(), llvm::support::endian::write32le(), llvm::support::endian::write64(), llvm::support::endian::write64be(), llvm::support::endian::write64le(), llvm::writeIndex(), writeTypeIdCompatibleVtableSummaryRecord(), llvm::xxHash64(), llvm::yaml::yaml2archive(), llvm::objcarc::BundledRetainClaimRVs::~BundledRetainClaimRVs(), llvm::orc::InProcessMemoryMapper::~InProcessMemoryMapper(), llvm::PMDataManager::~PMDataManager(), and llvm::PMTopLevelManager::~PMTopLevelManager().
This might compile to this xmm1 xorps xmm0 movss xmm0 ret Now consider if the code caused xmm1 to get spilled This might produce this xmm1 movaps xmm0 movaps xmm1 movss xmm0 ret since the reload is only used by these we could fold it into the producing something like xmm1 movaps xmm0 ret saving two instructions The basic idea is that a reload from a spill if only one byte chunk is bring in zeros the one element instead of elements This can be used to simplify a variety of shuffle where the elements are fixed zeros This code generates ugly probably due to costs being off or<4 x float> * P2 |
Definition at line 278 of file README-SSE.txt.
Referenced by llvm::array_pod_sort_comparator(), ConstantIntSortPredicate(), DecodePredRegsRegisterClass(), llvm::HexagonRegisterInfo::getCallerSavedRegs(), getCompoundOp(), getNextVectorRegister(), llvm::HexagonTargetLowering::getRegisterByName(), isInheritanceKind(), llvm::max(), llvm::ScaledNumbers::multiply64(), node_pair(), llvm::predicatesFoldable(), llvm::detail::IEEEFloat::remainder(), and llvm::PPCInstrInfo::SubsumesPredicate().
into eax xorps xmm0 xmm0 eax xmm0 eax xmm0 ret esp eax movdqa xmm0 xmm0 esp const ret align it should be movdqa xmm0 pinsrb |
Definition at line 650 of file README-SSE.txt.
Definition at line 304 of file README-SSE.txt.
into xmm2 addss xmm2 xmm1 xmm3 addss xmm3 movaps xmm0 unpcklps xmm0 ret seems silly when it could just be one addps Expand libm rounding functions main should enable SSE DAZ mode and other fast SSE modes Think about doing i64 math in SSE regs on x86 This testcase should have no SSE instructions in and only one load from a constant pool |
Definition at line 85 of file README-SSE.txt.
gcc mainline compiles it xmm0 xmm1 movaps xmm2 movlhps xmm2 movaps xmm0 ret We compile vector multiply by constant into poor< i32 10, i32 10, i32 10, i32 10 > ret< 4 x i32 > A On targets without this compiles globl _f xmm1 movd eax imull eax movd xmm1 pshufd |
Definition at line 35 of file README-SSE.txt.
into eax xorps xmm0 xmm0 eax xmm0 eax xmm0 ret esp eax movdqa xmm0 xmm0 esp const ret align it should be movdqa xmm0 xmm0 We should transform a shuffle of two vectors of constants into a single vector of constants insertelement of a constant into a vector of constants should also result in a vector of constants e g VecISelBug ll We compiled it to something globl _t xmm0 movhps xmm0 movss xmm1 movaps xmm2 xmm2 xmm0 movaps rdar |
Definition at line 689 of file README-SSE.txt.
This currently compiles esp xmm0 movsd esp eax eax esp ret We should use not the dag combiner This is because dagcombine2 needs to be able to see through the X86ISD::Wrapper which DAGCombine can t really do The code for turning x load into a single vector load is target independent and should be moved to the dag combiner The code for turning x load into a vector load can only handle a direct load from a global or a direct load from the stack It should be generalized to handle any load from where P can be anything The alignment inference code cannot handle loads from globals in static non mode because it doesn t look through the extra dyld stub load If you try vec_align ll without relocation you ll see what I mean We should lower which eliminates a constant pool load For float z nounwind readonly |
Definition at line 421 of file README-SSE.txt.
Referenced by llvm::xray::loadProfile(), llvm::xray::loadTraceFile(), and loadYAML().
Definition at line 480 of file README-SSE.txt.
Referenced by BUCompareLatency(), BURRSort(), checkSpecialNodes(), llvm::ImutAVLTree< ImutInfo >::destroy(), llvm::ImutAVLTree< ImutInfo >::getRight(), PerformADDCombineWithOperands(), llvm::InstrProfValueSiteRecord::sortByCount(), and llvm::InstrProfValueSiteRecord::sortByTargetValues().
Definition at line 803 of file README-SSE.txt.
into eax xorps xmm0 xmm0 eax xmm0 eax xmm0 ret esp eax movdqa xmm0 xmm0 esp const ret align it should be movdqa xmm0 xmm0 We should transform a shuffle of two vectors of constants into a single vector of constants insertelement of a constant into a vector of constants should also result in a vector of constants e g VecISelBug ll We compiled it to something globl _t xmm0 movhps xmm0 movss xmm1 movaps xmm2 xmm2 shufps |
Definition at line 293 of file README-SSE.txt.
This might compile to this xmm1 xorps xmm0 movss xmm0 ret Now consider if the code caused xmm1 to get spilled This might produce this xmm1 movaps xmm0 movaps xmm1 movss xmm0 ret since the reload is only used by these we could fold it into the producing something like xmm1 movaps xmm0 ret saving two instructions The basic idea is that a reload from a spill slot |
Definition at line 269 of file README-SSE.txt.
This might compile to this xmm1 xorps xmm0 movss xmm0 ret Now consider if the code caused xmm1 to get spilled This might produce this xmm1 movaps xmm0 movaps xmm1 movss xmm0 ret since the reload is only used by these we could fold it into the producing something like xmm1 movaps xmm0 ret saving two instructions The basic idea is that a reload from a spill if only one byte chunk is bring in zeros the one element instead of elements This can be used to simplify a variety of shuffle where the elements are fixed zeros This code generates ugly probably due to costs being off or something |
Definition at line 278 of file README-SSE.txt.
the custom lowered code happens to be but we shouldn t have to custom lower anything This is probably related to<2 x i64> ops being so bad LLVM currently generates stack realignment when it is not necessary needed The problem is that we need to know about stack alignment too before RA runs At that point we don t whether there will be vector spill |
Definition at line 489 of file README-SSE.txt.
Referenced by llvm::RegScavenger::scavengeRegister(), and llvm::RegScavenger::scavengeRegisterBackwards().
gcc mainline compiles it xmm0 xmm1 movaps xmm2 movlhps xmm2 movaps xmm0 ret We compile vector multiply by constant into poor< i32 10, i32 10, i32 10, i32 10 > ret<4 x i32> A On targets without SSE4 |
Definition at line 571 of file README-SSE.txt.
the custom lowered code happens to be but we shouldn t have to custom lower anything This is probably related to<2 x i64> ops being so bad LLVM currently generates stack realignment when it is not necessary needed The problem is that we need to know about stack alignment too before RA runs At that point we don t whether there will be vector or not Stack realignment logic is overly conservative but otherwise we can produce unaligned loads stores Fixing this will require some huge RA changes Testcase |
Definition at line 498 of file README-SSE.txt.
This might compile to this xmm1 xorps xmm0 movss xmm0 ret Now consider if the code caused xmm1 to get spilled This might produce this xmm1 movaps xmm0 movaps xmm1 movss xmm0 ret since the reload is only used by these we could fold it into the producing something like this |
Definition at line 7 of file README-SSE.txt.
Definition at line 791 of file README-SSE.txt.
Definition at line 793 of file README-SSE.txt.
Referenced by to().
< i64 > tmp20 |
Definition at line 424 of file README-SSE.txt.
<float> tmp5 = getelementptr inbounds %struct.float3* %res |
Definition at line 794 of file README-SSE.txt.
Referenced by foo().
Definition at line 224 of file README-SSE.txt.
Referenced by to().
This might compile to this xmm1 xorps xmm0 movss xmm0 ret Now consider if the code caused xmm1 to get spilled This might produce this xmm1 movaps xmm0 movaps xmm1 movss xmm0 ret since the reload is only used by these we could fold it into the producing something like xmm1 movaps xmm0 ret saving two instructions The basic idea is that a reload from a spill if only one byte chunk is used |
Definition at line 270 of file README-SSE.txt.
Referenced by llvm::ConnectedVNInfoEqClasses::Classify(), llvm::MD5::final(), and llvm::MD5::update().
This might compile to this xmm1 xorps xmm0 movss xmm0 ret Now consider if the code caused xmm1 to get spilled This might produce this xmm1 movaps xmm0 movaps xmm1 movss xmm0 ret since the reload is only used by these we could fold it into the uses |
Definition at line 258 of file README-SSE.txt.
Referenced by abort_gzip(), bar(), foo(), llvm::SelectionDAG::getStackArgumentTokenFactor(), llvm::PPCFunctionInfo::setUsesPICBase(), and to().
In x86 we generate this spiffy xmm0 xmm0 ret in x86 we generate this which could be xmm1 movss xmm1 xmm0 ret In sse4 we could use insertps to make both better Here s another testcase that could use x3 |
Definition at line 547 of file README-SSE.txt.
Referenced by adjustFixupValue(), llvm::jitlink::aarch64::applyFixup(), canonicalizeLaneShuffleWithRepeatedOps(), combineTargetShuffle(), computeFlagsForAddressComputation(), decode(), llvm::RuntimeDyldMachOAArch64::decodeAddend(), decodeL32ROperand(), DecodeRestrictedSPredicateOperand(), DecodeThreeAddrSRegInstruction(), DecodeVLD4DupInstruction(), llvm::RuntimeDyldMachOAArch64::encodeAddend(), llvm::PPCMCExpr::evaluateAsRelocatableImpl(), llvm::ARM_MC::evaluateBranchTarget(), evaluateMemOpAddrForAddrModeT2_i8s4(), llvm::ARM::WinEH::RuntimeFunction::ExceptionInformationRVA(), llvm::ARM::WinEH::RuntimeFunctionARM64::ExceptionInformationRVA(), llvm::ARM::WinEH::RuntimeFunction::Flag(), llvm::ARM::WinEH::RuntimeFunctionARM64::Flag(), foo1(), GeneratePerfectShuffle(), llvm::AArch64SysReg::genericRegisterString(), getMClassFlagsMask(), llvm::X86::getSwappedVCMPImm(), lowerV2X128Shuffle(), llvm::OnDiskChainedHashTable< Info >::OnDiskChainedHashTable(), llvm::ARM::WinEH::RuntimeFunction::PackedUnwindData(), llvm::ARM::WinEH::RuntimeFunctionARM64::PackedUnwindData(), PostOperandDecodeAdjust(), llvm::ARMInstPrinter::printT2AddrModeImm8s4OffsetOperand(), llvm::ARMInstPrinter::printT2AddrModeImm8s4Operand(), llvm::RuntimeDyldCOFFAArch64::processRelocationRef(), llvm::OnDiskChainedHashTable< Info >::readNumBucketsAndEntries(), simplifyX86insertps(), and llvm::ARM::WinEH::StackAdjustment().
In fpstack this compiles esp eax movl esp esp ret in SSE it compiles into significantly slower esp xmm0 mulsd xmm0 movsd xmm0 |
Definition at line 33 of file README-SSE.txt.
gets compiled into this on rsp movaps rsp movaps rsp movaps rsp movaps rsp movaps rsp movaps rsp movaps xmm1 |
Definition at line 38 of file README-SSE.txt.
into eax xorps xmm0 xmm0 eax xmm0 eax xmm0 ret esp eax movdqa xmm0 xmm0 esp const ret align it should be movdqa xmm0 xmm0 We should transform a shuffle of two vectors of constants into a single vector of constants insertelement of a constant into a vector of constants should also result in a vector of constants e g VecISelBug ll We compiled it to something globl _t xmm0 movhps xmm0 movss xmm1 movaps xmm2 xmm2 xmm2 |
Definition at line 39 of file README-SSE.txt.