## AVX10: Advanced Vector Extensions ## The converged vector ISA for all Intel CPUs - Introduces generational umbrella enumeration of all vector ISA - Replaces current numerous disjoint vector features of AVX/AVX2/AVX-512. - Single CPUID for AVX10 version number and the max supported vector length (VL) - All future Intel CPUs will support some version of AVX10, at least <u>AVX10.1/256</u> - AVX10.1/256: supported on all Intel CPUs (P/E-cores) - All modern AVX-512 vector instructions with a maximum VL=256 - 32 vector register (through EVEX prefix) - 8 mask registers (32-bit) - New: embedded rounding with 256-bit instructions - AVX10.1/512: will continue to be supported in all P-cores. - Inclusive: AVX10.N supports all of AVX10.N-1 plus new features! ## AVX10: enabling in SW #### This is WIP, details can change - -m[no-]evex512 - A proxy option to be able to re-compile current SW and continue running it on the <u>current HW</u>, even not the very latest. It provides guarantees that no 512-bit instructions are generated (even through intrinsics). - -mavx10.1[-256,-512] (default is 256) - Introduced for early SW enablement and supports the subset of AVX10.1: - all the Intel AVX512 instruction set available with P-cores codenamed <u>Granite Rapids</u> - will not include the new 256-bit vector instructions supporting embedded rounding - -mavx10.2[-256,-512] (default is 256) - Include the new 256-bit vector instructions supporting embedded rounding - A suite of new Intel AVX10 instructions covering new AI data types and conversions, data movement optimizations, and standards support ## APX: <u>Advanced Performance Extensions</u> #### General-purpose extension of 64-bit x86 for all Intel CPUs - +16 GPRs, for a total of 32 integer EGPRs (extended GPRs) via new REX2 prefix - NDD: adds <u>unique destination register</u> for legacy GPR instructions - XSAVE-enabled (overlays deprecated MMX state) - New instructions/capabilities: - PP2: PUSH2/POP2 instructions to bundle couple of EGPR in one instruction - FSFP+PPX: Fast Store Forwarding Predictor optimizations in a faster and more stable manner - CCMP+CFCMOV: replace more branches with conditional instructions - NF: encode suppress of status flag writes of common instruction - Zero-upper SETcc: Write full register to reduce extra pre-zeroing instructions and reduce data dependency - JMPABS: Replace indirect branches with direct branches (at link time) for better branch prediction, along with benefits in security, and power - Transparent interaction with legacy x86 code using a legacy-compliant ABI (new EGPRs are all caller-saved/volatile) #### EGPR: 32 GPR #### Design principle: least intrusive and not affecting legacy - Value: Eliminate relatively expensive memory operations keeping more state in registers - Static register class for each instruction in the tablgen file is <u>unchanged</u> to not affect pass whose analysis relies on the static type of operands in TD, e.g. machine instruction schedule. - Leverage the target hook <u>TargetInstrInfo:getRegClass</u> to update register class before RA - Reserve R16-R31 for all instructions when GPR32 is not supported (X86RegisterInfo::getReservedRegs) # New destination register (NDD) ## Principle: always prefer NDD over spilling - Value: Eliminate relatively expensive memory operations keeping more state in registers - Prefer NDD than non-NDD at instruction selection - Give a hint to RA to make source and destination are same when it's profitable (e.g. source register is killed) - Compress the NDD instruction to non-NDD instruction, if possible, for code size ``` Current Dst1 (coalesced with src1) = ADD src1, src2 NDD Dst3 = ADD_NDD src1, src2 reg3 = ADD_NDD reg1, reg2 If reg1 is killed reg1 = ADD_NDD reg1, reg2 reg1 = ADD reg1, reg2 ``` ## Prologue/epilogue (PP2, PPX) - Value: reduce number of push/pop memory operations - PPX applies FSFP optimizations for matched push/pop in a quick and stable manner - The PUSH2/POP2 require <u>16B stack</u> <u>alignment</u> (avoids splits in fused operations) - Red = Pad alignment to maximize PP2 opportunities #### Current push rbp push r15 push r14 push r13 push r12 push rbx subq 16, rsp addq 16, rsp pop rbx pop r12 pop r13 pop r14 pop r15 pop rbp ret ``` push.p rbp push.p r15 push.p r14 push.p r13 push.p r12 push.p rbx subq 16, rsp addq 16, rsp pop.p rbx pop.p r12 pop.p r13 pop.p r14 pop.p r15 pop.p rbp ret ``` Level of Optimization **PPX** hints ``` PPX & PP2 Alignment push.p rbp push2.p r15, r14 push2.p r13, r12 push.p rbx subq 16, rsp addq 16, rsp pop.p rbx pop2.p r12, r13 pop2.p r14, r15 pop.p rbp ret ``` ## Call-site optimization (PP2, PPX) - Legacy compatible ABI: all new EGPRs are caller-saved, and no changes to parameter passing/returning a value. - Instead of spilling with MOV to pre-allocated slots in the stack frame, aggressively use PUSH2/POP2 around calls. - Value: less spill code that is also more efficient due to PPX hints. Callee PUSH2X r CALL foo POP2X r PUSH2X r14, r13 PUSH2X r12, rbx [..def/use r12-r17] POP2X rbx, r12 POP2X rbx, r12 POP2X r13, r14 POPX r15 **RET** Caller PUSHX r15 PUSH2X r14, r13 PUSH2X r12, rbx [..def r12-r17] PUSH2X r16, r17 CALL foo POP2X r17, r16 [..use r12-r17] ## Conditional compares (CCMP) • Example: ``` if (a == 5 || b == 17) foo(); ``` - Speculatively execute compare operation based on the result of a prior compare - Value: Eliminate conditional branches to reduce branch mis-prediction - Update the probabilities of edges: - P(Tail|Compare) = P(Tail|Head) + P(Compare|Head) \* P(Tail|Compare) - P(I|Compare) = P(Compare|Head) \* P(I|Compare) ``` Head: cmpl $5, $edi je <u>Tail</u> Compare: cmpl $17, $esi je <u>Tail</u> ... Tail: call foo Head: cmpl $5, $edi ccmpel {zf} $17, $edi icmpl $17, $edi call foo ``` ## Conditional load/store (CFCMOV) • Example: ``` // int *p, *q, n; if (*p > n) *p = *q; ``` ``` Head: cmpl %edx, (%rdi) jle Tail TBB: movl (%rsi), %eax movl %eax, (%rdi) Tail: ``` ``` Head: cmpl %edx, (%rdi) cfcmovgl (%rsi), %eax cfcmovgl %eax, (%rdi) Tail: ``` - Load/store instructions in the conditional blocks <u>TBB and/or FBB</u> are spliced into the Head block. - Increases scope of <u>if-conversion</u> - Value: Eliminate conditional branches to reduce branch mis-prediction #### Diamond: ## Summary - Intel intends to provide/enable the necessary compilers, debuggers, tools, and libraries well in advance of HW to support general APX and AVX10 SW enablement (LLVM, GCC, etc.) - Whitepapers and further reading: - APX: <u>https://www.intel.com/content/www/us/en/developer/articles/technical/advanced-performance-extensions-apx.html</u> - AVX10: <a href="https://cdrdv2-public.intel.com/784343/356368-intel-avx10-tech-paper.pdf">https://cdrdv2-public.intel.com/784343/356368-intel-avx10-tech-paper.pdf</a> #