The x86instruction set has several times been extended with SIMD (Single instruction, multiple data) instruction set extensions. These extensions, starting from the MMX instruction set extension introduced with Pentium MMX in 1997, typically define sets of wide registers and instructions that subdivide these registers into fixed-size lanes and perform a computation for each lane in parallel.
Summary of SIMD extensions
The main SIMD instruction set extensions that have been introduced for x86 are:
"Katmai New Instructions" - introduced a set of 70 new instructions. Most but not all of these instructions provide scalar and vector operations on 32-bit floating-point values in 128-bit SIMD vector registers. (Some of the SSE instructions were instead new MMX instructions and non-SIMD instructions such as SFENCE - the subset of SSE that excludes the 128-bit SIMD register instructions is known as "MMX+", and is supported on some AMD processors that didn't implement full SSE, notably early Athlons and Geode LX.)
SSE introduced a new set of eight vector registers XMM0..XMM7, each 128 bits, and a status/control register MXCSR.
This set of eight vector registers would later be extended to 16 registers with the introduction of x86-64.
Extended SSE with 144 new instructions - mainly additional instructions to work on scalars and vectors of 64-bit floating-point values, as well as 128-bit-vector forms of most of the MMX integer instructions.
Added a set of 47 instructions, including variants of integer min/max, widening integer conversions, vector lane insert/extract, and dot-product instructions.
Extended the XMM0..XMM15 vector registers to 256-bit registers, referred to as YMM0..YMM15 when used as full 256-bit registers.
Added three-operand variants of most of the SSE1-4 vector instructions, as well as 256-bit vector variants of most of the SSE1-4 vector instructions acting on 32/64-bit floating-point values. These new instruction variants are all encoded with the new VEX prefix.
Extended the YMM0..YMM15 vector registers to a set of 32 registers, each 512-bits wide - referred to as ZMM0..ZMM31 when used as 512-bit registers. Also added eight opmask registers K0..K7.
Added 512-bit versions of most of the MMX/SSE/AVX vector instructions, as well as a substantial number of additional instructions. These are mostly encoded with the new EVEX prefix (except for opmask management instructions, which continue to use the VEX prefix.)
Added the ability to perform per-vector-lane masking of the operation of most of its vector instructions, by using the opmask registers. Also added embedded rounding controls for floating-point instructions and a scalar-to-vector broadcast function for most instructions that can accept memory operands.
Added a set of eight new tile registers, referred to as TMM0..TMM7. Each of these tile registers has a size of 8192 bits (16 rows of 64 bytes each). Also added a 64-byte tile configuration register TILECFG, and instructions to perform matrix multiplication on the tile registers with various data formats.
Reformulation of AVX-512 that includes most of the optional AVX-512 subsets (F,CD,BW,DQ,VL,IFMA,VBMI,VNNI,BF16,VBMI2,BITALG,VPOPCNTDQ,FP16) as baseline functionality, and switches feature enumeration from the flag-based scheme of AVX-512 to a version-based scheme.[c] No new instructions are added.
Adds instructions to convert to/from MXFP8 datatypes, perform arithmetic on BF16 numbers, saturating conversions from floating-point to integer, IEEE754-compliant min/max, and a few other instructions.
^The count of 13 instructions for SSE3 includes the non-SIMD instructions MONITOR and MWAIT that were also introduced as part of "Prescott New Instructions" - these two instructions are considered to be SSE3 instructions by Intel but not by AMD.
^On older Zhaoxin processors, such as KX-6000 "LuJiaZui", AVX2 instructions are present but not exposed through CPUID due to the lack of FMA3 support.[1]
^Early drafts of the AVX10 specification also added an option for implementations to limit the maximum supported vector-register width to 128/256 bits[2] - however, as of March 2025, this option has been removed, making support for 512-bit vector-register width mandatory again.[3][4]
MMX instructions and extended variants thereof
These instructions are, unless otherwise noted, available in the following forms:
MMX: 64-bit vectors, operating on mm0..mm7 registers (aliased on top of the old x87 register file)
SSE2: 128-bit vectors, operating on xmm0..xmm15 registers (xmm0..xmm7 in 32-bit mode)
AVX: 128-bit vectors, operating on xmm0..xmm15 registers, with a new three-operand encoding enabled by the new VEX prefix. (AVX introduced 256-bit vector registers, but the full width of these vectors was in general not made available for integer SIMD instructions until AVX2.)
AVX2: 256-bit vectors, operating on ymm0..ymm15 registers (extended versions of the xmm0..xmm15 registers)
AVX-512: 512-bit vectors, operating on zmm0..zmm31 registers (zmm0..zmm15 are extended versions of the ymm0..ymm15 registers, while zmm16..zmm31 are new to AVX-512). AVX-512 also introduces opmasks, allowing the operation of most instructions to be masked on a per-lane basis by an opmask register (the lane width varies from one instruction to another). AVX-512 also adds broadcast functionality for many of its instructions - this is used with memory source arguments to replicate a single value to all lanes of a vector calculation. The tables below provide indications of whether opmasks and broadcasts are supported for each instruction, and if so, what lane-widths they are using.
For many of the instruction mnemonics, (V) is used to indicate that the instruction mnemonic exists in forms with and without a leading V - the form with the leading V is used for the VEX/EVEX-prefixed instruction variants introduced by AVX/AVX2/AVX-512, while the form without the leading V is used for legacy MMX/SSE encodings without VEX/EVEX-prefix.
Original Pentium MMX instructions, and SSE2/AVX/AVX-512 extended variants thereof
Description
Instruction mnemonics
Basic opcode
MMX (no prefix)
SSE2 (66h prefix)
AVX (VEX.66 prefix)
AVX-512 (EVEX.66 prefix)
supported
subset
lane
bcst
Empty MMX technology state. (MMX)
Mark all the FP/MMX registers as Empty, so that they can be freely used by later x87 code.[a]
^EMMS will also set the x87 top-of-stack to 0. Unlike the older FNINIT instruction, EMMS will not update the FPU Control Word, nor will it update any part of the FPU Status Register other than the top-of-stack.
^For code that may potentially mix use of legacy-SSE instructions with AVX instructions, it is strongly recommended to execute a VZEROUPPER or VZEROALL instruction after executing AVX instructions but before executing SSE instructions. If this is not done, any subsequent legacy-SSE code may be subject to severe performance degradation.[5]
^On some early AVX implementations (e.g. Sandy Bridge[6]) encoding the VZEROUPPER and VZEROALL instructions with VEX.W=1 will result in #UD - for this reason, it is recommended to encode these instructions with VEX.W=0.
^ abThe 64-bit move instruction forms that are encoded by using a REX.W prefix with the 0F 6E and 0F 7E opcodes are listed with different mnemonics in Intel and AMD documentation — MOVQ in Intel documentation[7] and MOVD in AMD documentation.[8] This is a documentation difference only — the operation performed by these opcodes is the same for Intel and AMD. This documentation difference applies only to the MMX/SSE forms of these opcodes — for VEX/EVEX-encoded forms, both Intel and AMD use the mnemonic VMOVQ.)
^ abThe REX.W-encoded variants of MOVQ are available in 64-bit "long mode" only. For SSE2 and later, MOVQ to and from xmm/ymm/zmm registers can also be encoded with F3 0F 7E /r and 66 0F D6 /r respectively - these encodings are shorter and available outside 64-bit mode.
^On all Intel,[9] AMD[10] and Zhaoxin[11] processors that support AVX, the 128-bit forms of VMOVDQA (encoded with a VEX prefix and VEX.L=0) are, when used with a memory argument addressing WB (write-back cacheable) memory, architecturally guaranteed to perform the 128-bit memory access atomically - this applies to both load and store.
(Intel and AMD provide somewhat wider guarantees covering more 128-bit instruction variants, but Zhaoxin provides the guarantee for cacheable VMOVDQA only.)
While 128-bit VMOVDQA is atomic, it is not locked — it can be reordered in the same way as normal x86 loads/stores (e.g. loads passing older stores).
On processors that support SSE but don't support AVX, the 128-bit forms of SSE load/store instructions such as MOVAPS/MOVAPD/MOVDQA are not guaranteed to execute atomically — examples of processors where such instructions have been observed to execute non-atomically include Intel Core Duo and AMD K10.[12]
^ abVMOVDQA is available with a vector length of 256 bits under AVX, not requiring AVX2.
Unlike the 128-bit form, the 256-bit form of VMOVDQA does not provide any special atomicity guarantees.
^ abcdefghiFor the VPACK* and VPUNPCK* instructions, encodings with a vector-length wider than 128 bits are available under AVX2 and AVX-512, but the operation of such encodings is split into 128-bit lanes where each 128-bit lane internally performs the same operation as the 128-bit variant of the instruction.
^ abcFor the memory argument forms of (V)PUNPCKL* instructions, the memory argument is half-width only for the MMX variants of the instructions. For SSE/AVX/AVX-512 variants, the width of the memory argument is the full vector width even though only half of it is actually used.
^ abcdefThe EVEX-encoded variants of the VPCMPEQ* and VPCMPGT* instructions write their results to AVX-512 opmask registers. This differs from the older non-EVEX variants, which write comparison results as vectors of all-0s/all-1s values to the regular mm/xmm/ymm vector registers.
^The (V)PMADDWD instruction will add multiplication results pairwise, but will not add the sum to an accumulator. AVX512_VNNI provides the instructions VDPWSSD and WDPWSSDS, which will add multiplication results pairwise, and then also add them to a per-32-bit-lane accumulator.
^ abcdefghFor the MMX packed shift instructions PSLL* and PSR* with a shift-argument taken from a vector source (mm or m64), the shift-amount is considered to be a single 64-bit scalar value - the same shift-amount is used for all lanes of the destination vector. This shift-amount is unsigned and is not masked - all bits are considered (e.g. a shift-amount of 0x80000000_00000000 can be specified and will have the same effect as a shift-amount of 64).
For all SSE2/AVX/AVX512 extended variants of these instructions, the shift-amount vector argument is considered to be a 128-bit (xmm or m128) argument - the bottom 64 bits are used as the shift-amount.
Packed shift-instructions that can take a variable per-lane shift-amount were introduced in AVX2 for 32/64-bit lanes and AVX512BW for 16-bit lanes (VPSLLV*, VPSRLV*, VPSRAV* instructions).
MMX instructions added with MMX+/SSE/SSE2/SSSE3, and SSE2/AVX/AVX-512 extended variants thereof
Description
Instruction mnemonics
Basic opcode
MMX (no prefix)
SSE2 (66h prefix)
AVX (VEX.66 prefix)
AVX-512 (EVEX.66 prefix)
supported
subset
lane
bcst
Added with SSE and MMX+
Perform shuffle of four 16-bit integers in 64-bit vector (MMX)[a]
PSHUFW mm,mm/m64,imm8(MMX)
0F 70 /r ib
PSHUFW
PSHUFD
VPSHUFD
VPSHUFD (W=0)
F
32
32
Perform shuffle of four 32-bit integers in 128-bit vector (SSE2)
Compute sum of absolute differences for eight 8-bit unsigned integers, storing the result as a 64-bit integer.
For vector widths wider than 64 bits (SSE/AVX/AVX-512), this calculation is done separately for each 64-bit lane of the vectors, producing a vector of 64-bit integers.
(V)PSADBW mm,mm/m64
0F F6 /r
Yes
Yes
Yes
Yes
BW
No
No
Unaligned store vector register to memory using byte write-mask, with Non-Temporal Hint.
First argument provides data to store, second argument provides byte write-mask (top bit of each byte).[g] Address to store to is given by DS:DI/EDI/RDI (DS: segment overridable with segment-prefix).
Multiply packed 8-bit signed and unsigned integers, add results pairwise into 16-bit signed integers with saturation. First operand is treated as unsigned, second operand as signed.
(V)PMADDUBSW mm,mm/m64
0F38 04 /r
Yes
Yes
Yes
Yes
BW
16
No
Pairwise horizontal subtract of packed integers.
The higher-order integer of each pair is subtracted from the lower-order integer.
Modify packed integers in first source argument based on the sign of packed signed integers in second source argument. The per-lane operation performed is:
Multiply packed 16-bit signed integers, then perform rounding and scaling to produce a 16-bit signed integer result.
The calculation performed per 16-bit lane is: dst ← (src1*src2 + (1<<14)) >> 15
(V)PMULHRSW mm,mm/m64
0F38 0B /r
Yes
Yes
Yes
Yes
BW
16
No
Absolute value of packed signed integers
8-bit
(V)PABSB mm,mm/m64
0F38 1C /r
Yes
Yes
Yes
Yes
BW
8
No
16-bit
(V)PABSW mm,mm/m64
0F38 1D /r
Yes
Yes
Yes
Yes
BW
8
No
32-bit
(V)PABSD mm,mm/m64
0F38 1E /r
PABSD
PABSD
VPABSD
VPABSD(W0)
F
32
32
64-bit
VPABSQ xmm,xmm/m128(AVX-512)
VPABSQ(W1)
F
64
64
Packed Align Right.
Concatenate two input vectors into a double-size vector, then right-shift by the number of bytes specified by the imm8 argument. The shift-amount is not masked - if the shift-amount is greater than the input vector size, zeroes will be shifted in.
^For shuffle of four 16-bit integers in a 64-bit section of a 128-bit XMM register, the SSE2 instructions PSHUFLW (opcode F2 0F 70 /r) or PSHUFHW (opcode F3 0F 70 /r) may be used.
^ abcdefghiFor the VPSHUFD, VPSHUFB, VPHADD*, VPHSUB* and VPALIGNR instructions, encodings with a vector-length wider than 128 bits are available under AVX2 and/or AVX-512, but the operation of such encodings is split into 128-bit lanes where each 128-bit lane internally performs the same operation as the 128-bit variant of the instruction.
^ abFor the VEX-encoded forms of the VPINSRW and VPEXTRW instruction, the Intel SDM (as of rev 084) indicates that the instructions must be encoded with VEX.W=0, however neither Intel XED nor AMD APM indicate any such requirement.
^The 0F C5 /r ib variant of PEXTRW allows register destination only. For SSE4.1 and later, a variant that allows a memory destination is available with the opcode 66 0F 3A 15 /r ib.
^EVEX-prefixed opcode not available. Under AVX-512, a bitmask made from the top bit of each byte can instead be constructed with the VPMOVB2M instruction, with opcode EVEX.F3.0F38.W0 29 /r, which will store such a bitmask to an opmask register.
^VMOVNTDQ is available with a vector length of 256 bits under AVX, not requiring AVX2.
^For the MASKMOVQ and (V)MASKMOVDQU instructions, exception and trap behavior for disabled lanes is implementation-dependent. For example, a given implementation may signal a data breakpoint or a page fault for bytes that are zero-masked and not actually written.
^For AVX, masked stores to memory are also available using the VMASKMOVPS instruction with opcode VEX.66.0F38 2E /r - unlike VMASKMOVDQU, this instruction allows 256-bit stores without temporal hints, although its mask is coarser - 4 bytes vs 1 byte per lane.
^Opcode not available under AVX-512. Under AVX-512, unaligned masked stores to memory (albeit without temporal hints) can be done with the VMOVDQU(8|16|32|64) instructions with opcode EVEX.F2/F3.0F 7F /r, using an opmask register to provide a write mask.
^For AVX2 and AVX-512 with vectors wider than 128 bits, the VPSHUFB instruction is restricted to byte-shuffle within each 128-bit lane. Instructions that can do shuffles across 128-bit lanes include e.g. AVX2's VPERMD (shuffle of 32-bit lanes across 256-bit YMM register) and AVX512_VBMI's VPERMB (full byte shuffle across 64-byte ZMM register).
^For AVX-512, VPALIGNR is supported but will perform its operation within each 128-bit lane. For packed alignment shifts that can shift data across 128-bit lanes, AVX512F's VALIGND instruction may be used, although its shift-amount is specified in units of 32-bits rather than bytes.
SSE instructions and extended variants thereof
Regularly-encoded floating-point SSE/SSE2 instructions, and AVX/AVX-512 extended variants thereof
For the instructions in the below table, the following considerations apply unless otherwise noted:
Packed instructions are available at all vector lengths (128-bit for SSE2, 128/256-bit for AVX, 128/256/512-bit for AVX-512)
FP32 variants of instructions are introduced as part of SSE. FP64 variants of instructions are introduced as part of SSE2.
The AVX-512 variants of the FP32 and FP64 instructions are introduced as part of the AVX512F subset.
For AVX-512 variants of the instructions, opmasks and broadcasts are available with a width of 32 bits for FP32 operations and 64 bits for FP64 operations. (Broadcasts are available for vector operations only.)
From SSE2 onwards, some data movement/bitwise instructions exist in three forms: an integer form, an FP32 form and an FP64 form. Such instructions are functionally identical, however some processors with SSE2 will implement integer, FP32 and FP64 execution units as three different execution clusters, where forwarding of results from one cluster to another may come with performance penalties and where such penalties can be minimzed by choosing instruction forms appropriately. (For example, there exists three forms of vector bitwise XOR instructions under SSE2 - PXOR, XORPS, and XORPD - these are intended for use on integer, FP32, and FP64 data, respectively.)
Floating-point compare. Result is written as all-0s/all-1s values (all-1s for comparison true) to vector registers for SSE/AVX, but opmask register for AVX-512. Comparison function is specified by imm8 argument.[u]
Performs a shuffle on each of its two input arguments, then keeps the bottom half of the shuffle result from its first argument and the top half of the shuffle result from its second argument.
^ abcdefThe VEX-prefix-encoded variants of the scalar instructions listed in this table should be encoded with VEX.L=0. Setting VEX.L=1 for any of these instructions is allowed but will result in what the Intel SDM describes as "unpredictable behavior across different processor generations". This also applies to VEX-encoded variants of V(U)COMISS and V(U)COMISD. (This behavior does not apply to scalar instructions outside this table, such as e.g. VMOVD/VMOVQ, where VEX.L=1 results in an #UD exception.)
^ abcdefghEVEX-encoded variants of VMOVAPS, VMOVUPS, VMOVAPD and VMOVUPD support opmasks but do not support broadcast.
^ abcThe SSE2 MOVSD (MOVe Scalar Double-precision) and CMPSD (CoMPare Scalar Double-precision) instructions have the same names as the older i386 MOVSD (MOVe String Doubleword) and CMPSD (CoMPare String Doubleword) instructions, however their operations are completely unrelated.
At the assembly language level, they can be distinguished by their use of XMM register operands.
^ abcdefghijklmnopqrstFor variants of VMOVLPS, VMOVHPS, VMOVLPD, VMOVHPD, VMOVLHPS, VMOVHLPS encoded with VEX or EVEX prefixes, the only supported vector length is 128 bits (VEX.L=0 or EVEX.L=0).
For the EVEX-encoded variants, broadcasts and opmasks are not supported.
^ abcThe MOVSLDUP, MOVSHDUP and MOVDDUP instructions are not regularly-encoded scalar SSE1/2 instructions, but instead irregularly-assigned SSE3 vector instructions. For a description of these instructions, see table below.
^ abcdefghijFor the VUNPCK*, VSHUFPS and VSHUFPD instructions, encodings with a vector-length wider than 128 bits are available under AVX2 and AVX-512, but the operation of such encodings is split into 128-bit lanes where each 128-bit lane internally performs the same operation as the 128-bit variant of the instruction (except that for VSHUFPD, each 128-bit lane will use a different 2-bit part of the instruction's imm8 argument).
^ abThe CVTPI2PS and CVTPI2PD instructions take their input data as a vector of two 32-bit signed integers from either memory or MMX register. They will cause an x87→MMX transition even if the source operand is a memory operand.
For vector int→FP conversions that can accept an xmm/ymm/zmm register or vectors wider than 64 bits as input arguments, SSE2 provides the following irregularly-assigned instructions (see table below):
CVTDQ2PS (0F 5B /r)
CVTDQ2PD (F3 0F E6 /r)
These exist in AVX/AVX-512 extended forms as well.
^ abFor the (V)CVTSI2SS and (V)CVTSI2SD instructions, variants with a 64-bit source argument are only available in 64-bit long mode and require REX.W, VEX.W or EVEX.W to be set to 1.
In 32-bit mode, their source argument is always 32-bit even if VEX.W or EVEX.W is set to 1.
^ abcdThe CVT(T)PS2PI and CVT(T)PD2PI instructions write their result to MMX register as a vector of two 32-bit signed integers.
For vector FP→int conversions that can write results to xmm/ymm/zmm registers, SSE2 provides the following irregularly-assigned instructions (see table below):
CVTPS2DQ (66 0F 5B /r)
CVTTPS2DQ (F3 0F 5B /r)
CVTPD2DQ (F2 0F E6 /r)
CVTTPD2DQ (66 0F E6 /r)
These exist in AVX/AVX-512 extended forms as well.
^ abcdFor the (V)CVT(T)SS2SI and (V)CVT(T)SD2SI instructions, variants with a 64-bit destination register are only available in 64-bit long mode and require REX.W, VEX.W or EVEX.W to be set to 1.
In 32-bit mode, their destination register is always 32-bit even if VEX.W or EVEX.W is set to 1.
^ abThis instruction cannot be EVEX-encoded. Under AVX512DQ, extracting packed floating-point sign-bits can instead be done with the VPMOVD2M and VPMOVQ2M instructions.
^ abThe (V)RCPSS, (V)RCPPS, (V)RSQRTSS and (V)RSQRTPS approximation instructions compute their result with a relative error of at most . The exact calculation is implementation-specific and known to vary between different x86 CPUs.[13]
^ abcdThis instruction cannot be EVEX-encoded. Instead, AVX512F provides different opcodes - EVEX.66.0F38 4E/4F /r - for its new VRSQRT14* reciprocal square root approximation instructions.
The main difference between the AVX-512 VRSQRT14* instructions and the older SSE/AVX (V)RSQRT* instructions is that the AVX-512 VRSQRT14* instructions have their operation defined in a bit-exact manner, with a C reference model provided by Intel.[14]
^ abcdThis instruction cannot be EVEX-encoded. Instead, AVX512F provides different opcodes - EVEX.66.0F38 4C/4D /r - for its new VRCP14* reciprocal approximation instructions.
The main difference between the AVX-512 VRCP14* instructions and the older SSE/AVX (V)RCP* instructions is that the AVX-512 VRRCP14* instructions have their operation defined in a bit-exact manner, with a C reference model provided by Intel.[14]
^ abcdefghThe EVEX-encoded versions of the VANDPS, VANDPD, VANDNPS, VANDNPD, VORPS, VORPD, VXORPS, VXORPD instructions are not introduced as part of the AVX512F subset, but instead the AVX512DQ subset.
^XORPS/VXORPS with both source operands being the same register is commonly used as a register-zeroing idiom, and is recognized by most x86 CPUs as an instruction that does not depend on its source arguments. Under AVX or AVX-512, it is recommended to use a 128-bit form of VXORPS for this purpose - this will, on some CPUs, result in fewer micro-ops than wider forms while still achieving register-zeroing of the whole 256 or 512 bit vector-register.[15]
^ abcdFor EVEX-encoded variants of conversions between FP formats of different widths, the opmask lane width is determined by the result format: 64-bit for VCVTPS2PD and VCVTSS2SD and 32-bit for VCVTPD2PS and VCVTSS2SD.
^Widening FP→FP conversions (CVTPS2PD, CVTSS2SD, VCVTPH2PD, VCVTSH2SD) support the SAE modifier. Narrowing conversions (CVTPD2PS, CVTSD2SS) support the RC modifier.
^ abFor the floating-point minimum-value and maximum-value instructions (V)MIN* and (V)MAX*, if the two input operands are both zero or at least one of the input operands is NaN, then the second input operand is returned. This matches the behavior of common C programming-language expressions such as ((op1)>(op2)?(op1):(op2)) for maximum-value and ((op1)<(op2)?(op1):(op2)) for minimum-value.
^For the SIMD floating-point compares, the imm8 argument has the following format:
Bits
Usage
1:0
Basic comparison predicate
2
Invert comparison result
3
Invert comparison result if unordered (VEX/EVEX only)
4
Invert signalling behavior (VEX/EVEX only)
The basic comparison predicates are:
Value
Meaning
00b
Equal (non-signalling)
01b
Less-than (signalling)
10b
Less-than-or-equal (signalling)
11b
Unordered (non-signalling)
A signalling compare will cause an exception if any of the inputs are QNaN.
Integer SSE2/4 instructions with 66h prefix, and AVX/AVX-512 extended variants thereof
These instructions do not have any MMX forms, and do not support any encodings without a prefix.
Most of these instructions have extended variants available in VEX-encoded and EVEX-encoded forms:
The VEX-encoded forms are available under AVX/AVX2. Under AVX, they are available only with a vector length of 128 bits (VEX.L=0 enocding) - under AVX2, they are (with some exceptions noted with "L=0") also made available with a vector length of 256 bits.
The EVEX-encoded forms are available under AVX-512 - the specific AVX-512 subset needed for each instruction is listed along with the instruction.
Move 64-bit scalar value from xmm register to xmm register or memory
(V)MOVQ xmm/m64,xmm
0F D6 /r
Yes
Yes (L=0)
Yes (L=0,W=1)
F
No
No
Added with SSE4.1
Variable blend packed bytes.
For each byte lane of the result, pick the value from either the first or the second argument depending on the top bit of the corresponding byte lane of XMM0.
Zero-extend packed integers into wider packed integers
8-bit → 16-bit
(V)PMOVZXBW xmm,xmm/m64
0F38 30 /r
Yes
Yes
Yes
BW
16
No
8-bit → 32-bit
(V)PMOVZXBD xmm,xmm/m32
0F38 31 /r
Yes
Yes
Yes
F
32
No
8-bit → 64-bit
(V)PMOVZXBQ xmm,xmm/m16
0F38 32 /r
Yes
Yes
Yes
F
64
No
16-bit → 32-bit
(V)PMOVZXWD xmm,xmm/m64
0F38 33 /r
Yes
Yes
Yes
F
32
No
16-bit → 64-bit
(V)PMOVZXWQ xmm,xmm/m32
0F38 34 /r
Yes
Yes
Yes
F
64
No
32-bit → 64-bit
(V)PMOVZXDQ xmm,xmm/m64
0F38 35 /r
Yes
Yes
Yes (W=0)
F
64
No
Packed minimum-value of signed integers
8-bit
(V)PMINSB xmm,xmm/m128
0F38 38 /r
Yes
Yes
Yes
BW
8
No
32-bit
(V)PMINSD xmm,xmm/m128
0F38 39 /r
PMINSD
VPMINSD
VPMINSD(W0)
F
32
32
64-bit
VPMINSQ xmm,xmm/m128(AVX-512)
VPMINSQ(W1)
F
64
64
Packed minimum-value of unsigned integers
16-bit
(V)PMINUW xmm,xmm/m128
0F38 3A /r
Yes
Yes
Yes
BW
16
No
32-bit
(V)PMINUD xmm,xmm/m128
0F38 3B /r
PMINUD
VPMINUD
VPMINUD(W0)
F
32
32
64-bit
VPMINUQ xmm,xmm/m128(AVX-512)
VPMINUQ(W1)
F
64
64
Packed maximum-value of signed integers
8-bit
(V)PMAXSB xmm,xmm/m128
0F38 3C /r
Yes
Yes
Yes
BW
8
No
32-bit
(V)PMAXSD xmm,xmm/m128
0F38 3D /r
PMAXSD
VPMAXSD
VPMAXSD(W0)
F
32
32
64-bit
VPMAXSQ xmm,xmm/m128(AVX-512)
VPMAXSQ(W1)
F
64
64
Packed maximum-value of unsigned integers
16-bit
(V)PMAXUW xmm,xmm/m128
0F38 3E /r
Yes
Yes
Yes
BW
16
No
32-bit
(V)PMAXUD xmm,xmm/m128
0F38 3F /r
PMAXUD
VPMAXUD
VPMAXUD(W0)
F
32
32
64-bit
VPMAXUQ xmm,xmm/m128(AVX-512)
VPMAXUQ(W1)
F
64
64
Multiply packed 32/64-bit integers, store low half of results
(V)PMULLD mm,mm/m64 PMULLQ xmm,xmm/m128(AVX-512)
0F38 40 /r
PMULLD
VPMULLD
VPMULLD(W0)
F
32
32
VPMULLQ(W1)
DQ
64
64
Packed Horizontal Word Minimum
Find the smallest 16-bit integer in a packed vector of 16-bit unsigned integers, then return the integer and its index in the bottom two 16-bit lanes of the result vector.
(V)PHMINPOSUW xmm,xmm/m128
0F38 41 /r
Yes
Yes (L=0)
No
—
—
—
Blend Packed Words.
For each 16-bit lane of the result, pick a 16-bit value from either the first or the second source argument depending on the corresponding bit of the imm8.
Compute Multiple Packed Sums of Absolute Difference.
The 128-bit form of this instruction computes 8 sums of absolute differences from sequentially selected groups of four bytes in the first source argument and a selected group of four contiguous bytes in the second source operand, and writes the sums to sequential 16-bit lanes of destination register. If the two source arguments src1 and src2 are considered to be two 16-entry arrays of uint8 values and temp is considered to be an 8-entry array of uint16 values, then the operation of the instruction is:
for i = 0 to 7 do
temp[i] := 0
for j = 0 to 3 do
a := src1[ i+(imm8[2]*4)+j ]
b := src2[ (imm8[1:0]*4)+j ]
temp[i] := temp[i] + abs(a-b)
done
done
dst := temp
For wider forms of this instruction under AVX2 and AVX10.2, the operation is split into 128-bit lanes where each lane internally performs the same operation as the 128-bit variant of the instruction - except that odd-numbered lanes use bits 5:3 rather than bits 2:0 of the imm8.
^ abcdefFor the (V)PUNPCK*, (V)PACKUSDW, (V)PBLENDW, (V)PSLLDQ and (V)PSLRDQ instructions, encodings with a vector-length wider than 128 bits are available under AVX2 and/or AVX-512, but the operation of such encodings is split into 128-bit lanes where each 128-bit lane internally performs the same operation as the 128-bit variant of the instruction.
^Assemblers may accept PBLENDVB with or without XMM0 as a third argument.
^The PBLENDVB instruction with opcode 66 0F38 10 /r is not VEX-encodable. AVX does provide a VPBLENDVB instruction that is similar to PBLENDVB, however, it uses a different opcode and operand encoding - VEX.66.0F3A.W0 4C /r /is4.
^Opcode not EVEX-encodable. Under AVX-512, variable blend of packed bytes may be done with the VPBLENDMB instruction (opcode EVEX.66.0F38.W0 66 /r).
^ abThe EVEX-encoded variants of the VPCMPEQ* and VPCMPGT* instructions write their results to AVX-512 opmask registers. This differs from the older non-EVEX variants, which write comparison results as vectors of all-0s/all-1s values to the regular mm/xmm/ymm vector registers.
^The load performed by (V)MOVNTDQA is weakly-ordered. It may be reordered with respect to other loads, stores and even LOCKs - to impose ordering with respect to other loads/stores, MFENCE or serialization is needed.
If (V)MOVNTDQA is used with uncached memory, it may fetch a cache-line-sized block of data around the data actually requested - subsequent (V)MOVNTDQA instructions may return data from blocks fetched in this manner as long as they are not separated by an MFENCE or serialization.
^For AVX, the VBLENDPS and VPBLENDD instructions can be used to perform a blend with 32-bit lanes, allowing one imm8 mask to span a full 256-bit vector without repetition.
^Opcode not EVEX-encodable. Under AVX-512, variable blend of packed words may be done with the VPBLENDMW instruction (opcode EVEX.66.0F38.W1 66 /r).
^ ab For (V)PEXTRB and (V)PEXTRW, if the destination argument is a register, then the extracted 8/16-bit value is zero-extended to 32/64 bits.
^ abFor the VPEXTRD and VPINSRD instructions in non-64-bit mode, the instructions are documented as being permitted to be encoded with VEX.W=1 on Intel[16] but not AMD[17] CPUs (although exceptions to this do exist, e.g. Bulldozer permits such encodings[18] while Sandy Bridge does not[19]) In 64-bit mode, these instructions require VEX.W=0 on both Intel and AMD processors — encodings with VEX.W=1 are interpreted as VPEXTRQ/VPINSRQ.
^In the case of a register source argument to (V)PINSRB, the argument is considered to be a 32-bit register of which the 8 bottom bits are used, not an 8-bit register proper. This means that it is not possible to specify AH/BH/CH/DH as a source argument to (V)PINSRB.
^EVEX-encoded variants of the VMPSADBW instruction are only available if AVX10.2 is supported.
^ abcdThe SSE4.2 packed string compare PCMP*STR* instructions allow their 16-byte memory operands to be misaligned even when using legacy SSE encoding.
Other SSE/2/3/4 SIMD instructions, and AVX/AVX-512 extended variants thereof
SSE SIMD instructions that do not fit into any of the preceding groups. Many of these instructions have AVX/AVX-512 extended forms - unless otherwise indicated (L=0 or footnotes) these extended forms support 128/256-bit operation under AVX and 128/256/512-bit operation under AVX-512.
Description
Instruction mnemonics
Basic opcode
SSE
AVX (VEX prefix)
AVX-512 (EVEX prefix)
supported
subset
lane
bcst
rc/sae
Added with SSE
Load MXCSR (Media eXtension Control and Status Register) from memory
(V)LDMXCSR m32
NP 0F AE /2
Yes
Yes (L=0)
No
—
—
—
—
Store MXCSR to memory
(V)STMXCSR m32
NP 0F AE /3
Yes
Yes (L=0)
No
—
—
—
—
Added with SSE2
Move a 64-bit data item from MMX register to bottom half of XMM register. Top half is zeroed out.
MOVQ2DQ xmm,mm
F3 0F D6 /r
Yes
No
No
—
—
—
—
Move a 64-bit data item from bottom half of XMM register to MMX register.
MOVDQ2Q mm,xmm
F2 0F D6 /r
Yes
No
No
—
—
—
—
Load a 64-bit integer from memory or XMM register to bottom 64 bits of XMM register, with zero-fill
(V)MOVQ xmm,xmm/m64
F3 0F 7E /r
Yes
Yes (L=0)
Yes (L=0,W=1)
F
No
No
No
Vector load from unaligned memory or vector register
(V)MOVDQU xmm,xmm/m128
F3 0F 6F /r
Yes
Yes
VMOVDQU64(W1)
F
64
No
No
VMOVDQU32(W0)
F
32
No
No
F2 0F 6F /r
No
No
VMOVDQU16(W1)
BW
16
No
No
VMOVDQU8(W0)
BW
8
No
No
Vector store to unaligned memory or vector register
(V)MOVDQU xmm/m128,xmm
F3 0F 7F /r
Yes
Yes
VMOVDQU64(W1)
F
64
No
No
VMOVDQU32(W0)
F
32
No
No
F2 0F 7F /r
No
No
VMOVDQU16(W1)
BW
16
No
No
VMOVDQU8(W0)
BW
8
No
No
Shuffle the four top 16-bit lanes of source vector, then place result in top half of destination vector
Packed floating-point add/subtract in alternating lanes. Even-numbered lanes (counting from 0) do subtract, odd-numbered lanes do add.
32-bit
(V)ADDSUBPS xmm,xmm/m128
F2 0F D0 /r
Yes
Yes
No
—
—
—
—
64-bit
(V)ADDSUBPD xmm,xmm/m128
66 0F D0 /r
Yes
Yes
No
—
—
—
—
Vector load from unaligned memory with looser semantics than (V)MOVDQU.
Unlike (V)MOVDQU, it may fetch data more than once or, for a misaligned access, fetch additional data up until the next 16/32-byte alignment boundaries below/above the actually-requested data.
(V)LDDQU xmm,m128
F2 0F F0 /r
Yes
Yes
No
—
—
—
—
Added with SSE4.1
Vector logical test.
Sets ZF=1 if bitwise-AND between first operand and second operand results in all-0s, ZF=0 otherwise. Sets CF=1 if bitwise-AND between second operand and bitwise-NOT of first operand results in all-0s, CF=0 otherwise
Blend packed floating-point values. For each lane of the result, pick the value from either the first or the second argument depending on the corresponding imm8 bit.
32-bit
(V)BLENDPS xmm,xmm/m128,imm8
66 0F3A 0C /r ib
Yes
Yes
No
—
—
—
—
64-bit
(V)BLENDPD xmm,xmm/m128,imm8
66 0F3A 0D /r ib
Yes
Yes
No
—
—
—
—
Extract 32-bit lane of XMM register to general-purpose register or memory location.
Bits[1:0] of imm8 is used to select lane.
(V)EXTRACTPS r/m32,xmm,imm8
66 0F3A 17 /r ib
Yes
Yes (L=0)
Yes (L=0)
F
No
No
No
Obtain 32-bit value from source XMM register or memory, and insert into the specified lane of destination XMM register.
If the source argument is an XMM register, then bits[7:6] of the imm8 is used to select which 32-bit lane to select source from, otherwise the specified 32-bit memory value is used. This 32-bit value is then inserted into the destination register lane specified by bits[5:4] of the imm8. After insertion, each 32-bit lane of the destination register may optionally be zeroed out - bits[3:0] of the imm8 provides a bitmap of which lanes to zero out.
(V)INSERTPS xmm,xmm/m32,imm8
66 0F3A 21 /r ib
Yes
Yes (L=0)
Yes (L=0,W=0)
F
No
No
No
4-component dot-product of 32-bit floating-point values.
Bits [7:4] of the imm8 specify which lanes should participate in the dot-product, bits[3:0] specify which lanes in the result should receive the dot-product (remaining lanes are filled with zeros)
2-component dot-product of 64-bit floating-point values.
Bits [5:4] of the imm8 specify which lanes should participate in the dot-product, bits[1:0] specify which lanes in the result should receive the dot-product (remaining lanes are filled with zeros)
64-bit bitfield insert, using the low 64 bits of XMM registers.
First argument is an XMM register to insert bitfield into, second argument is a source register containing the bitfield to insert (starting from bit 0).
For the 4-argument version, the first imm8 specifies bitfield length and the second imm8 specifies bit-offset to insert bitfield at. For the 2-argument version, the length and offset are instead taken from bits [69:64] and [77:72] of the second argument, respectively.
64-bit bitfield extract, from the lower 64 bits of an XMM register.
The first argument serves as both source that bitfield is extracted from and destination that bitfield is written to.
For the 3-argument version, the first imm8 specifies bitfield length and the second imm8 specifies bitfield bit-offset. For the 2-argument version, the second argument is an XMM register that contains bitfield length at bits[5:0] and bit-offset at bits[13:8].
^ abcdefghFor the VPSHUFLW, VPSHUFHW, VHADDP*, VHSUBP*, VDPPS and VDPPD instructions, encodings with a vector-length wider than 128 bits are available under AVX2 and/or AVX-512, but the operation of such encodings is split into 128-bit lanes where each 128-bit lane internally performs the same operation as the 128-bit variant of the instruction.
^ abUnder AVX, the VPSHUFHW and VPSHUFLW instructions are only available in 128-bit forms - the 256-bit forms of these instructions require AVX2.
^For the EVEX-encoded form of VCVTDQ2PD, EVEX embedded rounding controls are permitted but have no effect.
^Opcode not EVEX-encodable. Performing a vector logical test under AVX-512 requires a sequence of at least 2 instructions, e.g. VPTESTMD followed by KORTESTW.
^ abAssemblers may accept the BLENDVPS/BLENDVPD instructions with or without XMM0 as a third argument.
^ abWhile AVX does provide VBLENDVPS/VPD instruction that are similar in function to BLENDVPS/VPD, they uses a different opcode and operand encoding - VEX.66.0F3A.W0 4A/4B /r /is4.
^ abcdOpcode not available under AVX-512. Instead, AVX512F provides different opcodes - EVEX.66.0F3A (08..0B) /r ib - for its new VRNDSCALE* rounding instructions.
^ abcdUnder AVX-512, EVEX-encoding the INSERTQ/EXTRQ opcodes result in AVX-512 instructions completely unrelated to SSE4a, namely VCVT(T)P(S|D)2UQQ and VCVT(T)S(S|D)2USI.
AVX were first supported by Intel with Sandy Bridge and by AMD with Bulldozer.
Vector operations on 256 bit registers.
Instruction
Description
VBROADCASTSS
Copy a 32-bit, 64-bit or 128-bit memory operand to all elements of a XMM or YMM vector register.
VBROADCASTSD
VBROADCASTF128
VINSERTF128
Replaces either the lower half or the upper half of a 256-bit YMM register with the value of a 128-bit source operand. The other half of the destination is unchanged.
VEXTRACTF128
Extracts either the lower half or the upper half of a 256-bit YMM register and copies the value to a 128-bit destination operand.
VMASKMOVPS
Conditionally reads any number of elements from a SIMD vector memory operand into a destination register, leaving the remaining vector elements unread and setting the corresponding elements in the destination register to zero. Alternatively, conditionally writes any number of elements from a SIMD vector register operand to a vector memory operand, leaving the remaining elements of the memory operand unchanged. On the AMD Jaguar processor architecture, this instruction with a memory source operand takes more than 300 clock cycles when the mask is zero, in which case the instruction should do nothing. This appears to be a design flaw.[20]
VMASKMOVPD
VPERMILPS
Permute In-Lane. Shuffle the 32-bit or 64-bit vector elements of one input operand. These are in-lane 256-bit instructions, meaning that they operate on all 256 bits with two separate 128-bit shuffles, so they can not shuffle across the 128-bit lanes.[21]
VPERMILPD
VPERM2F128
Shuffle the four 128-bit vector elements of two 256-bit source operands into a 256-bit destination operand, with an immediate constant as selector.
VZEROALL
Set all YMM registers to zero and tag them as unused. Used when switching between 128-bit use and 256-bit use.
VZEROUPPER
Set the upper half of all YMM registers to zero. Used when switching between 128-bit use and 256-bit use.
Convert four half-precision floating point values in memory or the bottom half of an XMM register to four single-precision floating-point values in an XMM register
VCVTPH2PS ymmreg,xmmrm128
Convert eight half-precision floating point values in memory or an XMM register (the bottom half of a YMM register) to eight single-precision floating-point values in a YMM register
VCVTPS2PH xmmrm64,xmmreg,imm8
Convert four single-precision floating point values in an XMM register to half-precision floating-point values in memory or the bottom half an XMM register
VCVTPS2PH xmmrm128,ymmreg,imm8
Convert eight single-precision floating point values in a YMM register to half-precision floating-point values in memory or an XMM register
Expansion of most vector integer SSE and AVX instructions to 256 bits
Instruction
Description
VBROADCASTSS
Copy a 32-bit or 64-bit register operand to all elements of a XMM or YMM vector register. These are register versions of the same instructions in AVX1. There is no 128-bit version however, but the same effect can be simply achieved using VINSERTF128.
VBROADCASTSD
VPBROADCASTB
Copy an 8, 16, 32 or 64-bit integer register or memory operand to all elements of a XMM or YMM vector register.
VPBROADCASTW
VPBROADCASTD
VPBROADCASTQ
VBROADCASTI128
Copy a 128-bit memory operand to all elements of a YMM vector register.
VINSERTI128
Replaces either the lower half or the upper half of a 256-bit YMM register with the value of a 128-bit source operand. The other half of the destination is unchanged.
VEXTRACTI128
Extracts either the lower half or the upper half of a 256-bit YMM register and copies the value to a 128-bit destination operand.
VGATHERDPD
Gathers single or double precision floating point values using either 32 or 64-bit indices and scale.
VGATHERQPD
VGATHERDPS
VGATHERQPS
VPGATHERDD
Gathers 32 or 64-bit integer values using either 32 or 64-bit indices and scale.
VPGATHERDQ
VPGATHERQD
VPGATHERQQ
VPMASKMOVD
Conditionally reads any number of elements from a SIMD vector memory operand into a destination register, leaving the remaining vector elements unread and setting the corresponding elements in the destination register to zero. Alternatively, conditionally writes any number of elements from a SIMD vector register operand to a vector memory operand, leaving the remaining elements of the memory operand unchanged.
VPMASKMOVQ
VPERMPS
Shuffle the eight 32-bit vector elements of one 256-bit source operand into a 256-bit destination operand, with a register or memory operand as selector.
VPERMD
VPERMPD
Shuffle the four 64-bit vector elements of one 256-bit source operand into a 256-bit destination operand, with a register or memory operand as selector.
VPERMQ
VPERM2I128
Shuffle (two of) the four 128-bit vector elements of two 256-bit source operands into a 256-bit destination operand, with an immediate constant as selector.
VPBLENDD
Doubleword immediate version of the PBLEND instructions from SSE4.
VPSLLVD
Shift left logical. Allows variable shifts where each element is shifted according to the packed input.
VPSLLVQ
VPSRLVD
Shift right logical. Allows variable shifts where each element is shifted according to the packed input.
VPSRLVQ
VPSRAVD
Shift right arithmetically. Allows variable shifts where each element is shifted according to the packed input.
Floating-point fused multiply-add instructions are introduced in x86 as two instruction set extensions, "FMA3" and "FMA4", both of which build on top of AVX to provide a set of scalar/vector instructions using the xmm/ymm/zmm vector registers. FMA3 defines a set of 3-operand fused-multiply-add instructions that take three input operands and writes its result back to the first of them. FMA4 defines a set of 4-operand fused-multiply-add instructions that take four input operands – a destination operand and three source operands.
FMA3 is supported on Intel CPUs starting with Haswell, on AMD CPUs starting with Piledriver, and on Zhaoxin CPUs starting with YongFeng. FMA4 was only supported on AMD Family 15h (Bulldozer) CPUs and has been abandoned from AMD Zen onwards. The FMA3/FMA4 extensions are not considered to be an intrinsic part of AVX or AVX2, although all Intel and AMD (but not Zhaoxin) processors that support AVX2 also support FMA3. FMA3 instructions (in EVEX-encoded form) are, however, AVX-512 foundation instructions.
The FMA3 and FMA4 instruction sets both define a set of 10 fused-multiply-add operations, all available in FP32 and FP64 variants. For each of these variants, FMA3 defines three operand orderings while FMA4 defines two.
FMA3 encoding
FMA3 instructions are encoded with the VEX or EVEX prefixes – on the form VEX.66.0F38 xy /r or EVEX.66.0F38 xy /r. The VEX.W/EVEX.W bit selects floating-point format (W=0 means FP32, W=1 means FP64). The opcode byte xy consists of two nibbles, where the top nibble x selects operand ordering (9='132', A='213', B='231') and the bottom nibble y (values 6..F) selects which one of the 10 fused-multiply-add operations to perform. (x and y outside the given ranges will result in something that is not an FMA3 instruction.)
At the assembly language level, the operand ordering is specified in the mnemonic of the instruction:
vfmadd132sd xmm1,xmm2,xmm3 will perform xmm1 ← (xmm1*xmm3)+xmm2
vfmadd213sd xmm1,xmm2,xmm3 will perform xmm1 ← (xmm2*xmm1)+xmm3
vfmadd231sd xmm1,xmm2,xmm3 will perform xmm1 ← (xmm2*xmm3)+xmm1
For all FMA3 variants, the first two arguments must be xmm/ymm/zmm vector register arguments, while the last argument may be either a vector register or memory argument. Under AVX-512 and AVX10, the EVEX-encoded variants support EVEX-prefix-encoded broadcast, opmasks and rounding-controls.
The AVX512-FP16 extension, introduced in Sapphire Rapids, adds FP16 variants of the FMA3 instructions – these all take the form EVEX.66.MAP6.W0 xy /r with the opcode byte working in the same way as for the FP32/FP64 variants. The AVX10.2 extension, published in 2024,[22] similarly adds BF16 variants of the packed (but not scalar) FMA3 instructions – these all take the form EVEX.NP.MAP6.W0 xy /r with the opcode byte again working similar to the FP32/FP64 variants.
(For the FMA4 instructions, no FP16 or BF16 variants are defined.)
FMA4 encoding
FMA4 instructions are encoded with the VEX prefix, on the form VEX.66.0F3A xx /r ib (no EVEX encodings are defined). The opcode byte xx uses its bottom bit to select floating-point format (0=FP32, 1=FP64) and the remaining bits to select one of the 10 fused-multiply-add operations to perform.
For FMA4, operand ordering is controlled by the VEX.W bit. If VEX.W=0, then the third operand is the r/m operand specified by the instruction's ModR/M byte and the fourth operand is a register operand, specified by bits 7:4 of the ib (8-bit immediate) part of the instruction. If VEX.W=1, then these two operands are swapped. For example:
vfmaddsd xmm1,xmm2,[mem],xmm3 will perform xmm1 ← (xmm2*[mem])+xmm3 and require a W=0 encoding.
vfmaddsd xmm1,xmm2,xmm3,[mem] will perform xmm1 ← (xmm2*xmm3)+[mem] and require a W=1 encoding.
vfmaddsd xmm1,xmm2,xmm3,xmm4 will perform xmm1 ← (xmm2*xmm3)+xmm4 and can be encoded with either W=0 or W=1.
Opcode table
The 10 fused-multiply-add operations and the 122 instruction variants they give rise to are given by the following table – with FMA4 instructions highlighted with * and yellow cell coloring, and FMA3 instructions not highlighted:
^Vector register lanes are counted from 0 upwards in a little-endian manner – the lane that contains the first byte of the vector is considered to be even-numbered.
AVX-512, introduced in 2014, adds 512-bit wide vector registers (extending the 256-bit registers, which become the new registers' lower halves) and doubles their count to 32; the new registers are thus named zmm0 through zmm31. It adds eight mask registers, named k0 through k7, which may be used to restrict operations to specific parts of a vector register. Unlike previous instruction set extensions, AVX-512 is implemented in several groups; only the foundation ("AVX-512F") extension is mandatory.[23] Most of the added instructions may also be used with the 256- and 128-bit registers.
Intel AMX adds eight new tile-registers, tmm0-tmm7, each holding a matrix, with a maximum capacity of 16 rows of 64 bytes per tile-register. It also adds a TILECFG register to configure the sizes of the actual matrices held in each of the eight tile-registers, and a set of instructions to perform matrix multiplications on these registers.
AMX subset
Instruction mnemonics
Opcode
Instruction description
Added in
AMX-TILE
AMX control and tile management.
LDTILECFG m512
VEX.128.NP.0F38.W0 49 /0
Load AMX tile configuration data structure from memory as a 64-byte data structure.
Matrix multiplication of tiles, with source data interpreted as complex numbers represented as pairs of FP16 values, and destination data accumulated as FP32 floating-point values.
Matrix multiply complex numbers from tmm2 with complex numbers from tmm3, accumulating imaginary part of result in tmm1.
^For TILEZERO, the tile-register to clear is specified by bits 5:3 of the instruction's ModR/M byte. Bits 7:6 must be set to 11b, and bits 2:0 must be set to 000b.
^ abcFor the TILELOADD, TILELOADDT1 and TILESTORED instructions, the memory argument must use a memory addressing mode with the SIB-byte. Under this addressing mode, the base register and displacement are used to specify the starting address for the first row of the tile to load/store from/to memory – the scale and index are used to specify a per-row stride. These instructions are all interruptible – an interrupt or memory exception taken in the middle of these instructions will cause progress tracking information to be written to TILECFG.start_row, so that the instruction may continue on a partially-loaded/stored tile after the interruption.
^ abcdefghFor all of the AMX matrix multiply instructions, the three arguments are required to be three different tile registers, or else the instruction will #UD.