AMX was introduced by Intel in June 2020 and first supported by Intel with the Sapphire Rapidsmicroarchitecture for Xeon servers, released in January 2023.[2][3] It introduced 2-dimensional registers called tiles upon which accelerators can perform operations. It is intended as an extensible architecture; the first accelerator implemented is called tile matrix multiply unit (TMUL).[4][5]
In Intel Architecture Instruction Set Extensions and Future Features revision 46, published in September 2022, a new AMX-FP16 extension was documented. This extension adds support for half-precision floating-point numbers. In revision 48 from March 2023, AMX-COMPLEX was documented, adding support for half-precision floating-point complex numbers. Both extensions are available in the Granite Rapids set of server processors (with AMX-COMPLEX support only being available in Granite Rapids-D[6]).
Tile matrix multiply unit
TMUL unit supports BF16 and INT8 input types.[7] AMX-FP16 and AMX-COMPLEX also add support for real and complex FP16 numbers. The register file consists of 8 tiles, each with 16 rows of size of 64 bytes (32 BF16/FP16 or 64 INT8 elements). The only supported operation is matrix multiplication[4]
4th Gen Intel Xeon Scalable processor can perform 2048 INT8 or 1024 BF16 operations per cycle:[8][9] the maximal input sizes are for A and for B, where J is 64 for INT8 and 32 for BF16. The matrix multiplication requires multiplication and additions, thus performing operations in 16 cycles.[9]