Intel AMX support in 5.16

By Jonathan Corbet
November 8, 2021

The x86 instruction set is large, but that doesn't mean it can't get bigger yet. Upcoming Intel processors will feature a new set of instructions under the name of "Advanced Matrix Extensions" (AMX) that can be used to operate on matrix data. After a somewhat bumpy development process, support for AMX has found its way into the upcoming 5.16 kernel. Using it will, naturally, require some changes by application developers.

AMX (which is described in this document) is meant to be a general architecture for the acceleration of matrix operations on x86 processors. In its initial form, it implements a set of up to eight "tiles", which are arrays of 16 64-byte rows. Programmers can store matrices in these tiles of any dimension that will fit therein; a matrix of 16x16 32-bit floating-point values would work, but other geometries are supported too. The one supported operation currently will multiply the matrices stored in two tiles, then add the result to a third tile. By chaining these operations, multiplication of matrices of any size can be implemented. Evidently other operations are meant to come in the future.

While AMX may seem like a feature aimed at numerical analysis, the real target use case would appear to be machine-learning applications. That would explain why 16-bit floating-point arithmetic is supported, but 64-bit is not.

The design of AMX gives the kernel control over whether these features can be used by any given process. There are a couple of reasons for this, one being that AMX instructions, as one might imagine, use a lot of processor resources. A process doing heavy AMX work on a shared computer may adversely affect other processes. But AMX also cannot be supported properly unless both the kernel and the user-space process are ready for it.

Development process

Support for AMX was first posted by Chang Bae in October 2020, but got relatively few review comments. By the time version 4 came out in February, more developers were paying attention, and they were not entirely pleased with how this feature was being integrated into the kernel's existing floating-point unit (FPU) code. Various versions followed, with the frustration level seeming to increase; at the end of September, Len Brown posted minutes from a conversation that, seemingly, showed a path forward.

Unfortunately, version 11, posted the very next day, seemed to ignore many of the decisions that had been made. This posting drew a sharp rebuke from Thomas Gleixner, who felt that the feature was being shoved into the kernel without listening to the complaints that were being raised. Things weren't looking great for AMX, but work was happening behind the scenes; in mid-October, Gleixner posted a massive reworking of the FPU code meant to ease the task of supporting AMX. A new AMX patch set followed shortly thereafter, and that, more or less, is what ended up in 5.16.

Gleixner's pull request for this code acknowledged its relatively unseasoned nature:

Note, this is relatively new code despite the fact that AMX support is in the works for more than a year now.
The big refactoring of the FPU code, which allowed to do a proper integration has been started exactly 3 weeks ago. Refactoring of the existing FPU code and of the original AMX patches took a week and has been subject to extensive review and testing. The only fallout which has not been caught in review and testing right away was restricted to AMX enabled systems, which is completely irrelevant for anyone outside Intel and their early access program. There might be dragons lurking as usual, but so far the fine grained refactoring has held up and eventual yet undetected fallout is bisectable and should be easily addressable before the 5.16 release. Famous last words...

The FPU code is relatively tricky, low-level stuff, so it would indeed be unsurprising to find a dragon or two lurking in the new work.

Using AMX

As noted above, the kernel is able to control which processes are able to use the AMX instructions. The first step for a user-space process would be to use a new arch_prctl() command (ARCH_GET_XCOMP_SUPP) to get a list of supported features; if the appropriate bit is set in the result, AMX is available. Then, another arch_prctl() command (ARCH_REQ_XCOMP_PERM) can be used to request permission to use AMX. Some checks are made (one to be described shortly), and there is an opportunity for security modules to express an opinion as well. Normally, though, the request will be granted. Permissions apply to all threads in a process and are carried over a fork; calling execve() will reset them, though.

One challenge presented by AMX is that processors can create a great deal of internal state while the AMX instructions are running. If the CPU is interrupted in the middle of an operation, that state must be saved somewhere or a lot of work could be lost. So, if a process is using AMX, the kernel must be prepared to save up to about 10KB of data in its interrupt handlers before doing much of anything else. This saving is done using the XSAVE instruction.

The kernel allocates memory for each process that can be used for this purpose. Allocating 10KB for every process in the system would waste a lot of memory, though; most processes will never use AMX instructions. Happily, the processor can be configured to trap into the kernel the first time any process executes an AMX instruction; the kernel can then check whether permission to use those instructions has been granted and, if so, allocate an appropriately sized buffer to hold the FPU state and allow the operation to continue.

One potential problem has to do with the sigaltstack() system call, which allows a thread to establish a new stack for signal handling. That stack, too, must be large enough to hold the FPU state if the process involved is using AMX. For years, developers have been told to use MINSIGSTKSZ as the minimum size for this stack; that size, which is 2KB, is nowhere near large enough for AMX-using processes. Indeed, it is not even large enough to use the AVX-512 extensions, a fact that has caused some corrupted-stack problems in the past.

To avoid this problem for AMX, the kernel will check to ensure that all signal stacks are sufficiently large. This check is done at each call to sigaltstack(), but a check of existing stacks is also done when a process requests permission to use AMX in the first place. Processes not using AMX will not need the larger stack and, thus, will not be broken by these checks. Processes that do want to use AMX will not be allowed to unless all of their signal stacks are big enough.

Once the infrastructure to perform these checks was put in place, the kernel also gained the ability to ensure that processes using AVX-512 have adequately sized signal stacks. Enforcing that condition, though, has the potential to break applications that seemingly work now, perhaps because their signal handlers are never actually called. To avoid this problem, there is a kernel configuration option (STRICT_SIGALTSTACK_SIZE) and a command-line option (strict_sas_size=), either of which can be used to control whether the strict checks are carried out when AVX-512 is in use.

Assuming that all of the pieces hold together, this is the form that AMX support will take in 5.16. Those wanting more information can look at the commits containing AMX test cases and some documentation on the arch_prctl() commands. Meanwhile, keep an eye out for dragons for the next nine weeks or so.

Index entries for this article
Kernel	Architectures/x86
Kernel	Releases/5.16

to post comments

packet processing

Posted Nov 8, 2021 16:30 UTC (Mon) by mtaht (guest, #11087) [Link] (3 responses)

maybe I'm weird, but why aren't we ever seeing any improvements for packet processing? For most workloads the SSE side of the chip is idle, and yet a few instructions (invsqrt for codel, all sorts of comparisons and masking stuff for ipv6 traffic) would probably help there.

packet processing

Posted Nov 8, 2021 18:33 UTC (Mon) by mebrown (subscriber, #7960) [Link]

Last time I checked, the kernel code could not use FPU instructions, which includes SSE and all these newer instructions.

... aaand after a quick google search: "kernel_fpu_begin() / kernel_fpu_end()"

looks like they added a standardized way to do FPU/SSE/etc in the kernel. Likely the reason it's not used in packet processing is that you'd have to call the above functions, and the speed increase would have to be worth the fpu state save/restore.

packet processing

Posted Nov 9, 2021 8:59 UTC (Tue) by Sesse (subscriber, #53779) [Link]

In addition to what mebrown said, there _is_ a fast instruction for inverse square root (rsqrtss).

In general, packet processing is not dominated by a single operation; it's more like various small tasks all over, so one specific instruction is unlikely to help a whole lot. But CPUs just getting generally faster helps, of course (even though there was a long time where progress was rather disappointing).

packet processing

Posted Nov 10, 2021 7:34 UTC (Wed) by flussence (guest, #85566) [Link]

I imagine anyone with a workload where they need SIMD to keep up with line rate is already using a userspace thing like DPDK where they can use all the fancy instruction sets they want, along with heavily tuning scheduling, core isolation, PREEMPT_RT and the like (SSE, context switches, tolerable latency - pick 2)

Intel AMX support in 5.16

Posted Nov 8, 2021 16:38 UTC (Mon) by ncm (guest, #165) [Link]

Quality reporting like this keeps me subscribed to LWN.

Intel AMX support in 5.16

Posted Nov 10, 2021 11:36 UTC (Wed) by epa (subscriber, #39769) [Link] (7 responses)

I wondered whether this was the feature derided as a waste of silicon by Linus. But no, that was AVX-512: https://www.realworldtech.com/forum/?threadid=193189&...

Intel AMX support in 5.16

Posted Nov 10, 2021 18:40 UTC (Wed) by jak90 (subscriber, #123821) [Link] (6 responses)

And just like that, AVX-512 is indeed kind of a waste of silicon on Alder Lake processors right now, since firmware if at all only allows the choice between AVX-512 support on performance cores and enabling efficiency cores, to avoid asymmetry entirely.
For what dubious advantages it brings to the table, it would need a distinction from the OS to either declare that a task is planning to use AVX-512 instructions or to restrict it to performance cores on the first related illegal instruction encountered on an efficiency core.

Intel AMX support in 5.16

Posted Nov 10, 2021 22:25 UTC (Wed) by bartoc (guest, #124262) [Link] (3 responses)

both options are kinda bad though, you don't want just "I'm going to use AVX-512 please pin me" you also need to tell the scheduler when you're done using it. Similarly if the kernel just pins on the first illegal instruction it would need to occasionally unpin if it wants to be able to do the non-avx512 things on the non-avx cores.

Also, most apps using avx-512 are not doing it unconditionally, but rather call cpuid and check the results. Because cpuid completely serializes execution it's very much not fast, and so apps tend to just call it once, in a static initializer or similar. calling it before each portion of code using AVX-512 is just not fast at all, so you'd want to make the "can I do avx-512" part of the per-thread state, and have the scheduler change it for you when it decided to schedule you on a CPU with different features.

For AVX-512 on something like alder lake I think you'd need a system call that's essentially "do I have AVX-512" and the kernel could say yes or no (even if there are some cores with AVX-512), but if it said yes then it would promise not to schedule you on any cores without AVX-512 until you were done. Hopefully this would be implementable without actually making a real transition to kernel mode by setting some per-thread flag the scheduler could look at when needed. Even this (pretty complicated) mechanism poses some problems, because apps might not tell the kernel when they are done, either because they forget, or because the kernel told them they could use the fancy instructions and they don't want to give the core back. This would be a particular problem on laptops where I'd imagine the kernel might want to get everyone off the P cores so they could be completely powered off. Unfortunately once the app has started doing it's fancy AVX-512 things the kernel can't unilaterally decide to take back the permission to do AVX-512, as even if it handled the illegal instructions after moving the thread to an E core it can't go back in time to have the process take the non-avx branch. So you might get situations (a little like with switchable graphics) where you have long running apps that ask for AVX-512, don't tell the kernel when they are done with it, and then cause pretty dramatic reductions in battery life for no reason.

I suppose the kernel _could_ forcibly reschedule the process by somehow implementing a software version of the AVX-512 instructions, that way you'd just get somewhat extreme slowness.

Another option would be for intel themselves to implement such software versions of AVX-512 instructions, and use their execution as input into their new hardware scheduler thingy to indicate that maybe the thread should be scheduled on a P core.

Intel AMX support in 5.16

Posted Nov 11, 2021 10:05 UTC (Thu) by wtarreau (subscriber, #51152) [Link] (2 responses)

No, the situation is much worse: applications are using it by accident during a memcpy() or such stuff that they most often do not even require the tiny savings brought by the instruction set, and such calls may happen way more often than one would accept to migrate the tasks, so the real result is that any task using a given libc would end up running exlusively on the avx-enabled core. What we really need is to turn such features into opt-in at the libc level so that we're not inflicted that trouble without consent. And by the way there are plenty of cases where using this significantly lowers the CPU's frequency and dramatically slows down the useful workload, which is another reason some people explicitly disable AVX512 on their machines.

Intel AMX support in 5.16

Posted Nov 11, 2021 18:07 UTC (Thu) by anton (subscriber, #25547) [Link] (1 responses)

AVX slowdown and even AVX-512 slowdown does not seem to be bad in recent Intel CPUs.

I think that AVX-512 support on heterogeneous CPUs where some cores don't support AVX-512 is not a big problem. There are several ways to deal with the situation. Sure you can come up with a scenario for every one of them where you would prefer a different result, but even in these scenarios the disadvantage of the not-preferred result is not that big, certainly not worse than outright disabling AVX-512 or outright disabling E-cores.

E.g. if you automatically reduce the cpu-list of a thread to the P-cores once an AVX-512 instructions is used, the worst case is that the E-cores won't be used. I guess many threads don't use AVX-512, so there is enough left for the E-cores; as for memcpy() and friends, the code for selecting the actual routine could be made more CPU-specific (rather than just checking the AVX-512 flags in cpuinfo).

Alternatively, only report the AVX-512 flags on threads where the cpu-list is limited to the cores that have AVX-512. So you won't get AVX-512 on ordinary threads. Given that relatively few code actually makes significant use of AVX-512, it's not a big problem that the user then has to call such code with taskset or somesuch.

In any case, in order to have such problems at all, we need CPUs that enable AVX-512 at the same time as E-cores. From what I read, Intel wanted to give us no AVX-512 at all, and board manufacturers give us either AVX-512 or E-cores, but not both.

Intel AMX support in 5.16

Posted Nov 12, 2021 11:44 UTC (Fri) by wtarreau (subscriber, #51152) [Link]

I also noticed newer cores are less impacted by this, they're making progress.

For memcpy() ideally the solution would be to only consider features that intersect all CPUs the task may run on, and not just the starting one. It's not much complicated after all, the most painful part is already done (except if it's relying on a cpuid instruction).

Intel AMX support in 5.16

Posted Nov 19, 2021 17:48 UTC (Fri) by flussence (guest, #85566) [Link] (1 responses)

Intel's painted themselves into a corner with this design of putting exponentially-expanding vector instructions inline in the x86 ISA. Interestingly it looks like Apple are doing a thing similar to AMX, but they've put it in a separate part of the chip (and their users have to use an OpenCL-ish library to get at it). Higher overhead as a result but it also means they keep the performance/efficiency core arrangement without dealing with instruction set mismatches, or all of AVX512's downclocking ballet, or humongous context switch overhead.

I think with the way things are headed, other CPUs are going to follow the uncore design and AVX512 itself is going to get Itanium'd out of existence.

Intel AMX support in 5.16

Posted Nov 20, 2021 13:12 UTC (Sat) by farnz (subscriber, #17727) [Link]

There is also the ARM SVE2 or RISC-V V extension route to allowing for large vectors. In both designs, you have the same instructions used regardless of the number of elements in the vector, and the length of the vector is agreed between software and hardware at runtime. SVE2 allows for the vector to be up to 2048 bits long, while RISC-V V extension permits the vector to be up to 65536 bits in length.

In both cases, the hardware can choose how long the implemented vectors actually are, and there are ways for software to find out how large the real vectors are; for code that's happy getting ~90% of the possible speed-up from vector execution, they're designed so that you run a loop asking the hardware to choose as large a vector length as possible that's smaller than the total elements to process, and repeat until you've exhausted the data to process. For code that needs all the possible speed-up, you'll need to worry about the vector length your target CPUs support, but that's less common.